TWI836255B

TWI836255B - Method and apparatus in designing a personalized virtual singer using singing voice conversion

Info

Publication number: TWI836255B
Application number: TW110130295A
Authority: TW
Inventors: 蘇豐文; 甘霖江; 蘇時頤
Original assignee: 國立清華大學
Filing date: 2021-08-17
Publication date: 2024-03-21

Abstract

A method and an apparatus in designing a personalized virtual singer using singing voice conversion are provided. In the method, a music score file represented by symbols is parsed to extract multiple lyrics and multiple notes. Audio data of each of the notes is loaded from an audio database. A vocoder is used to perform acoustic modeling on the audio data to adjust each audio data and concatenate the adjusted audio data to generate singing voice data. A generator of a voice conversion model generator is used to convert multiple acoustic features in the singing voice data into output features conditioned on target attributes, and the voice conversion model is trained according to multiple types of losses of the voice conversion model to obtain synthetic singing voice data with optimized output features. This method can customize a fixed timbre virtual singer to generate multiple personalized virtual singers.

Description

Method and device for designing personalized virtual singer through singing voice conversion

本發明是有關於一種歌聲合成方法及裝置，且特別是有關於一種透過歌聲轉換設計個人化虛擬歌手的方法及裝置。 The present invention relates to a singing voice synthesis method and device, and in particular to a method and device for designing a personalized virtual singer through singing voice conversion.

歌聲合成是一種產生人的歌聲的技術。長久以來，人們嘗試使用不同方法來實現歌聲合成，例如將語音對談轉換為依歌譜與歌詞唱出一首歌。然而，歌聲的音色轉換大多在於一對一的轉換，因此僅限於單一歌手，或是用指定的個人化音色唱出指定的任一首歌。 Singing synthesis is a technology that generates human singing voices. For a long time, people have tried to use different methods to achieve singing synthesis, such as converting voice conversations into singing a song according to the sheet music and lyrics. However, the timbre conversion of singing voices is mostly a one-to-one conversion, so it is limited to a single singer, or singing any specified song with a specified personalized timbre.

也就是說，若欲將某B的音色轉換成某A的音色，必須某A已經說過或唱過某首歌，無法將原本某B唱的歌以某A的音色唱出。 In other words, if you want to convert the timbre of a certain B into the timbre of a certain A, you must have said or sung a certain song. It is impossible to sing the original song sung by a certain B with the timbre of a certain A.

本發明提供一種透過歌聲轉換設計個人化虛擬歌手的方法及裝置，能夠讓虛擬歌手用特定人的音色依照歌譜與歌詞唱出任一首歌，也就是將特定人的音色轉換到虛擬歌手所唱出的歌上。 The present invention provides a method and device for designing a personalized virtual singer through voice conversion, which enables the virtual singer to sing any song according to the score and lyrics with the voice of a specific person, that is, to convert the voice of a specific person to the song sung by the virtual singer.

本發明提供一種透過歌聲轉換設計個人化虛擬歌手的方法，適用於具備處理器的電子裝置。此方法包括下列步驟：解析(parse)使用一符號表示的歌譜文件，以擷取出多個歌詞及多個音符；從一音頻資料庫載入所擷取的各音符的音頻資料；使用聲碼器(vocoder)對音頻資料進行聲學建模，以調整各音頻資料並拼接調整後的音頻資料以產生歌聲資料；以及利用一語音轉換模型的生成器將歌聲資料中的多個聲學特徵轉換為以目標音頻屬性為條件的輸出特徵，並根據語音轉換模型的多種損失訓練語音轉換模型，以獲得輸出特徵最佳化的合成歌聲資料。 The present invention provides a method for designing a personalized virtual singer through singing voice conversion, which is suitable for electronic devices equipped with processors. This method includes the following steps: parse a music score file represented by a symbol to extract multiple lyrics and multiple notes; load the extracted audio data of each note from an audio database; use a vocoder (vocoder) performs acoustic modeling on audio data to adjust each audio data and splice the adjusted audio data to generate singing voice data; and utilize a generator of a speech conversion model to convert multiple acoustic features in the singing voice data into target The audio attributes are the conditional output features, and the speech conversion model is trained according to various losses of the speech conversion model to obtain synthetic singing data with optimized output features.

本發明提供一種透過歌聲轉換設計個人化虛擬歌手的裝置，其包括連接裝置、儲存裝置及處理器。其中，儲存裝置用以儲存電腦程式。處理器耦接連接裝置及儲存裝置，經配置以載入並執行儲存裝置中的電腦程式以：解析使用一符號表示的歌譜文件，以擷取出多個歌詞及多個音符；從一音頻資料庫載入所擷取的各音符的音頻資料；使用聲碼器對音頻資料進行聲學建模，以調整各音頻資料並拼接調整後的音頻資料以產生歌聲資料；以及利用一語音轉換模型的生成器將歌聲資料中的多個聲學特徵轉換為以目標音頻屬性為條件的輸出特徵，並根據語音轉換模型的多種損失訓練語音轉換模型，以獲得輸出特徵最佳化的合成歌聲資料。 The invention provides a device for designing a personalized virtual singer through singing voice conversion, which includes a connection device, a storage device and a processor. Among them, the storage device is used to store computer programs. The processor is coupled to the connection device and the storage device, and is configured to load and execute a computer program in the storage device to: parse a music score file represented by a symbol to retrieve a plurality of lyrics and a plurality of notes; from an audio database Loading the captured audio data of each note; performing acoustic modeling on the audio data using a vocoder to adjust each audio data and splicing the adjusted audio data to generate singing voice data; and utilizing a generator of a speech conversion model Convert multiple acoustic features in the singing data into output features conditioned on the target audio attributes, and train the speech conversion model based on various losses of the speech conversion model to obtain synthetic singing data with optimized output features.

為讓本發明的上述特徵和優點能更明顯易懂，下文特舉實施例，並配合所附圖式作詳細說明如下。 In order to make the above features and advantages of the present invention more clearly understood, the following is a detailed description of the embodiments with the accompanying drawings.

10:透過歌聲轉換設計個人化虛擬歌手的裝置 10: Design a personalized virtual singer's device through singing voice conversion

12:連接裝置 12: Connecting device

14:儲存裝置 14:Storage device

16:處理器 16: Processor

30:系統 30:System

31:歌譜 31: Song Sheet

32:歌詞+音符 32: lyrics + notes

33:音頻資料庫 33: Audio database

34:預錄製的音符音頻資料 34: Pre-recorded note audio data

35:合成歌聲資料 35:Synthesized vocal data

40:歌聲合成器 40: Vocal Synthesizer

41:聲碼器 41:Vocoder

42:基本頻率 42: Basic frequency

43:頻譜包絡 43: Spectral envelope

44:非週期性包絡 44: Non-periodic envelope

45:音高調整 45: Pitch adjustment

46:持續時間 46:Duration

47:歌唱表達 47: Singing Expression

48:語音轉換模型 48: Speech conversion model

400:StarGAN-VC模型 400:StarGAN-VC model

S202~S206、S1~S4:步驟 S202~S206, S1~S4: steps

圖1是根據本發明一實施例所繪示的透過歌聲轉換設計個人化虛擬歌手的裝置的方塊圖。 FIG. 1 is a block diagram of a device for designing a personalized virtual singer through voice conversion according to an embodiment of the present invention.

圖2是依照本發明一實施例所繪示的透過歌聲轉換設計個人化虛擬歌手的方法的流程圖。 FIG2 is a flow chart of a method for designing a personalized virtual singer through voice conversion according to an embodiment of the present invention.

圖3是根據本發明一實施例所繪示的歌聲合成器的系統架構圖。 FIG. 3 is a system architecture diagram of a singing voice synthesizer according to an embodiment of the present invention.

圖4是根據本發明一實施例所繪示的StarGAN-VC模型的示意圖。 Figure 4 is a schematic diagram of the StarGAN-VC model according to an embodiment of the present invention.

本發明實施例主要結合拼接法(concatenative)歌聲合成器以及多說話者聲音轉換模型，其中包括使用拼接法歌聲合成器解析音樂的資料表示(例如MusicXML)檔案以合成一個虛擬歌聲。之後，再用聲音轉換模型把此虛擬歌聲轉換成不同的音色。本發明實施例是藉由預先錄好所有中文單字的發音，來實現拼接法歌聲合成器。對於聲音轉換模型，本發明實施例使用了對抗式深層網路(Generative adversarial network，GAN)的語音轉換(Voice convert，VC)模型來實現各個說話者之間的聲音轉換，使得該模型可以實現多個可區分音色的聲音的生成。 Embodiments of the present invention mainly combine a concatenative singing voice synthesizer and a multi-speaker voice conversion model, which include using a concatenative singing voice synthesizer to parse a music data representation (such as MusicXML) file to synthesize a virtual singing voice. Afterwards, the sound conversion model is used to convert the virtual singing voice into different timbres. The embodiment of the present invention implements a splicing method singing voice synthesizer by pre-recording the pronunciations of all Chinese words. For the voice conversion model, the embodiment of the present invention uses a voice convert (VC) model of a deep adversarial network (GAN) to realize voice conversion between speakers, so that the model can be implemented Now generate multiple sounds with distinguishable timbres.

圖1是根據本發明一實施例所繪示的透過歌聲轉換設計個人化虛擬歌手的裝置的方塊圖。請參照圖1，本發明實施例的透過歌聲轉換設計個人化虛擬歌手的裝置10(以下簡稱裝置10)例如是具有運算能力的檔案伺服器、資料庫伺服器、應用程式伺服器、工作站或個人電腦等計算機裝置，其中包括連接裝置12、儲存裝置14及處理器16等元件，其功能分述如下：連接裝置12例如是任意的有線或無線的介面裝置，其可用以連接並存取位於遠端或本地端(即，儲存於儲存裝置14中)的音訊資料庫，以查詢並接收音訊資料。對於有線方式而言，連接裝置12可以是通用序列匯流排(universal serial bus，USB)、RS232、通用非同步接收器/傳送器(universal asynchronous receiver/transmitter，UART)、內部整合電路(I2C)、序列周邊介面(serial peripheral interface，SPI)、顯示埠(display port)或雷電埠(thunderbolt)等介面，但不限於此。對於無線方式而言，連接裝置12可以是支援無線保真(wireless fidelity，Wi-Fi)、RFID、藍芽、紅外線、近場通訊(near-field communication，NFC)或裝置對裝置(device-to-device，D2D)等通訊協定的裝置，但不限於此。在一些實施例中，連接裝置12亦可以是支援乙太網路(Ethernet)或是支援802.11g、802.11n、802.11ac等無線網路標準的網路卡，亦不限於此。 FIG. 1 is a block diagram of a device for designing a personalized virtual singer through singing voice conversion according to an embodiment of the present invention. Please refer to Figure 1. The device 10 (hereinafter referred to as the device 10) for designing a personalized virtual singer through singing voice conversion according to an embodiment of the present invention is, for example, a file server, a database server, an application server, a workstation or a personal computer with computing capabilities. Computer devices such as computers include components such as a connection device 12, a storage device 14, and a processor 16. Their functions are described as follows: The connection device 12 is, for example, any wired or wireless interface device, which can be used to connect and access an audio database located at a remote or local end (ie, stored in the storage device 14) to query and receive audio data. For the wired method, the connection device 12 may be a universal serial bus (USB), RS232, a universal asynchronous receiver/transmitter (UART), an internal integrated circuit (I2C), Interfaces such as serial peripheral interface (SPI), display port, or thunderbolt port, but are not limited to these. For wireless mode, the connection device 12 may support wireless fidelity (Wi-Fi), RFID, Bluetooth, infrared, near-field communication (NFC) or device-to-device (device-to-device). -device, D2D) and other communication protocol devices, but not limited to this. In some embodiments, the connection device 12 may also be a network card that supports Ethernet or wireless network standards such as 802.11g, 802.11n, 802.11ac, etc., but is not limited thereto.

儲存裝置14例如是任意型式的固定式或可移動式隨機存取記憶體(Random Access Memory，RAM)、唯讀記憶體(Read-Only Memory，ROM)、快閃記憶體(Flash memory)、硬碟或類似元件或上述元件的組合，而用以儲存可由處理器16執行的電腦程式。在一實施例中，儲存裝置14亦可儲存由裝置10所建構的音訊資料庫或從遠端的音訊資料庫下載的音頻資料，在此不設限。 The storage device 14 is, for example, any type of fixed or removable random access memory. Access memory (Random Access Memory, RAM), read-only memory (Read-Only Memory, ROM), flash memory (Flash memory), hard disk or similar components or a combination of the above components, and used for storage can be processed by A computer program executed by the processor 16. In one embodiment, the storage device 14 can also store an audio database constructed by the device 10 or audio data downloaded from a remote audio database, without limitation.

處理器16例如是中央處理單元(Central Processing Unit，CPU)，或是其他可程式化之一般用途或特殊用途的微處理器(Microprocessor)、微控制器(Microcontroller)、數位訊號處理器(Digital Signal Processor，DSP)、可程式化控制器、特殊應用積體電路(Application Specific Integrated Circuits，ASIC)、可程式化邏輯裝置(Programmable Logic Device，PLD)或其他類似裝置或這些裝置的組合，但本實施例不限於此。在本實施例中，處理器16可從儲存裝置14載入電腦程式，以執行本發明實施例的多說話者的歌聲合成方法。 The processor 16 is, for example, a central processing unit (CPU), or other programmable general-purpose or special-purpose microprocessor (Microprocessor), microcontroller (Microcontroller), or digital signal processor (Digital Signal). Processor (DSP), programmable controller, Application Specific Integrated Circuits (ASIC), Programmable Logic Device (PLD) or other similar devices or combinations of these devices, but this implementation Examples are not limited to this. In this embodiment, the processor 16 can load a computer program from the storage device 14 to execute the multi-speaker singing voice synthesis method according to the embodiment of the present invention.

圖2是依照本發明一實施例所繪示的透過歌聲轉換設計個人化虛擬歌手的方法的流程圖。請同時參照圖1及圖2，本實施例的方法適用於圖1的裝置10。以下即搭配裝置10的各項裝置及元件說明本實施例的多說話者的歌聲合成方法的詳細步驟。 FIG2 is a flow chart of a method for designing a personalized virtual singer through voice conversion according to an embodiment of the present invention. Please refer to FIG1 and FIG2 at the same time. The method of this embodiment is applicable to the device 10 of FIG1. The following is a detailed description of the multi-speaker voice synthesis method of this embodiment in conjunction with the various devices and components of the device 10.

在步驟S202中，裝置10是由處理器16解析(parse)使用一符號表示的歌譜文件，以擷取出多個歌詞及多個音符。在一實施例中，處理器16還可從歌譜文件解析出歌唱表達資訊(連音、抖音)等，在此不設限。所述的符號表示例如是MusicXML，但不限於此。 In step S202, the device 10 uses the processor 16 to parse the music score file represented by a symbol to extract a plurality of lyrics and a plurality of notes. In one embodiment, the processor 16 can also parse the singing expression information (legato, vibrato), etc. from the music score file, and there is no limit here. The symbolic representation is, for example, MusicXML, but not Limited to this.

在步驟S202中，由處理器16利用連接裝置12從音頻資料庫載入所擷取的各音符的音頻資料。所述音頻資料庫例如是位於遠端的伺服器或是儲存在本地端的儲存裝置14中，但不限於此。 In step S202, the processor 16 uses the connection device 12 to load the captured audio data of each note from the audio database. The audio database is, for example, located in a remote server or stored in a local storage device 14, but is not limited thereto.

在步驟S204中，由處理器16使用聲碼器(vocoder)對音頻資料進行聲學建模，以調整各音頻資料並拼接調整後的音頻資料以產生歌聲資料。所述的聲碼器例如是WORLD聲碼器，但不限於此。 In step S204, the processor 16 uses a vocoder to perform acoustic modeling on the audio data to adjust each audio data and splice the adjusted audio data to generate singing data. The vocoder is, for example, a WORLD vocoder, but is not limited thereto.

在步驟S206中，由處理器16利用一語音轉換模型的生成器將歌聲資料中的多個聲學特徵轉換為以目標音頻屬性為條件的輸出特徵，並根據語音轉換模型的多種損失訓練語音轉換模型，以獲得輸出特徵最佳化的合成歌聲資料。所述的語音轉換模型例如是基於星狀生成對抗網路(Star Generative adversarial network，StarGAN)-語音轉換(Voice conversion，VC)架構的模型，而所述的損失例如包括對抗性損失(adversarial loss)、分類損失(classification loss)、周期一致性損失(cycle-consistency loss)、身份映射損失(identity-mapping loss)其中之一或其組合，但不限於此。 In step S206, the processor 16 uses a generator of a speech conversion model to convert multiple acoustic features in the singing data into output features conditioned on the target audio attributes, and trains the speech conversion model according to various losses of the speech conversion model. , to obtain synthesized singing data with optimized output characteristics. The voice conversion model is, for example, a model based on the Star Generative adversarial network (StarGAN)-Voice conversion (VC) architecture, and the loss includes, for example, adversarial loss (adversarial loss). , classification loss, cycle-consistency loss, identity-mapping loss, one or a combination thereof, but is not limited to this.

詳細而言，圖3是根據本發明一實施例所繪示的歌聲合成器的系統架構圖。請參照圖3，本發明實施例揭露包括歌聲合成器40的系統30，此歌聲合成器40例如是拼接法(concatenative)歌聲合成器，其是基於預先錄製的音頻資料庫33中的樣本來進行拼接。因此，針對每個單字，需預先錄製大量發音的音頻資料，以構建音頻資料庫33。 Specifically, FIG. 3 is a system architecture diagram of a singing voice synthesizer according to an embodiment of the present invention. Referring to Figure 3, an embodiment of the present invention discloses a system 30 including a singing voice synthesizer 40. The singing voice synthesizer 40 is, for example, a concatenative singing voice synthesizer, which is based on samples in a pre-recorded audio database 33. splicing. Therefore, for each word, a large amount of pronunciation audio data needs to be pre-recorded to build an audio database33.

在一實施例中，歌聲合成器40是用中文實現的，因此需要錄製大量的音頻資料才能覆蓋所有漢字。而在歌唱的過程中，由於音符的音高(pitch)會發生變化，因此單字的音調較不重要，只需要錄製413個發音，即可覆蓋所有漢字。 In one embodiment, the singing synthesizer 40 is implemented in Chinese, so a large amount of audio data needs to be recorded to cover all Chinese characters. During the singing process, since the pitch of the notes will change, the pitch of the individual words is less important. Only 413 pronunciations need to be recorded to cover all Chinese characters.

歌聲合成器40的輸入資料是採用音樂符號表示的歌譜31，此音樂符號表示例如為MusicXML，其是用以表示西方音樂符號的基於XML的文件格式，被超過250種符號程式支持，其中包括一些圖形化使用者介面程式，例如MuseScore。由於本實施例的歌聲合成器所採用的MusicXML是基於XML的文件格式，其可以添加其他標籤來表示歌唱表達技巧，例如連音(legato)、抖音(vibrato)等。 The input data of the vocal synthesizer 40 is a sheet music 31 represented by musical notation, such as MusicXML, which is an XML-based file format for representing Western musical notation and is supported by more than 250 notation programs, including some graphical user interface programs such as MuseScore. Since the MusicXML used by the vocal synthesizer of this embodiment is an XML-based file format, it can add other tags to represent singing expression techniques, such as legato, vibrato, etc.

在合成過程的步驟S1中，系統30會先把MusicXML的輸入解析為歌詞和音符32(包括音高、音符值及/或音符持續時間)。在一實施例中，系統30還會把MusicXML的輸入解析為歌唱表達資訊(連音、抖音)等，在此不設限。對於歌詞中的每個單字，在步驟S2中，系統30會從音頻資料庫33中載入這些單字的預錄製的音頻資料34。然後，在步驟S3中，由歌聲合成器40使用聲碼器(vocoder)41(例如WORLD聲碼器)對單字的音頻資料34進行聲學建模，其中包括根據基本頻率42，將音頻訊號34解析為諧波頻譜包絡(envelope)43和非週期性包絡44，並在步驟S4中，對各個音符的音調頻率應用基本頻率特徵作音高調整45，以及對音頻資料34進行修整(trim)，以匹配該音符的持續時間46。此外，可在漢字之間加入連音(legato)、抖音(vibrato)、淡入(fade-in)淡出(fade-out)等歌唱表達47的效果，以使所拼接的聲音更加平滑。最後，上述所有資料以及所有單字的調整後音頻資料會送入語音轉換模型48以進行轉及拼接，最終在步驟S5中，由歌聲合成器40輸出具所選個人音色的合成歌聲資料35。 In step S1 of the synthesis process, the system 30 first parses the input MusicXML into lyrics and notes 32 (including pitch, note value and/or note duration). In one embodiment, the system 30 also parses the input MusicXML into singing expression information (legato, vibrato), etc., which is not limited here. For each word in the lyrics, in step S2, the system 30 loads the pre-recorded audio data 34 of these words from the audio database 33. Then, in step S3, the vocal synthesizer 40 uses a vocoder 41 (e.g., WORLD vocoder) to perform acoustic modeling on the audio data 34 of the single word, including parsing the audio signal 34 into a harmonic spectrum envelope 43 and a non-periodic envelope 44 according to the fundamental frequency 42, and in step S4, applying the fundamental frequency characteristics to the pitch frequency of each note to perform pitch adjustment 45, and trimming the audio data 34 to match the duration of the note 46. In addition, singing expression effects 47 such as legato, vibrato, fade-in and fade-out can be added between Chinese characters to make the spliced sound smoother. Finally, all the above data and the adjusted audio data of all words will be sent to the speech conversion model 48 for conversion and splicing. Finally, in step S5, the singing synthesizer 40 outputs the synthesized singing data 35 with the selected personal timbre.

在語音轉換部分，本發明實施例使用基於StarGAN-VC架構的模型。StarGAN-VC是一種非平行的多對多語音轉換模型，是原始對抗生成網路(Generative adversarial network，GAN)的一種變形(稱為StarGAN)，其能夠使用單編碼器類型的生成器(generator)網路，同時學習多對多映射，此生成器輸出的屬性是由輔助輸入控制。StarGAN-VC還使用對抗性損失(adversarial loss)進行生成器的訓練，以促使生成器的輸出變得和真實語音無法區分，並確保每對屬性域(domain)之間的映射可保留語言資訊(linguistic information)。StarGAN-VC的優點是在測試時，不需要任何與輸入音頻屬性相關的資訊。 In the speech conversion part, the embodiment of the present invention uses a model based on the StarGAN-VC architecture. StarGAN-VC is a non-parallel many-to-many speech conversion model, a variant of the original adversarial network (GAN) (called StarGAN), which can use a single encoder type generator (generator) The network,simultaneously learns many-to-many mappings, and the,properties output by this generator are controlled by auxiliary,inputs. StarGAN-VC also uses adversarial loss to train the generator to make the output of the generator indistinguishable from real speech and ensure that the mapping between each pair of attribute domains can retain language information ( linguistic information). The advantage of StarGAN-VC is that it does not require any information related to the input audio attributes during testing.

詳細而言，圖4是根據本發明一實施例所繪示的StarGAN-VC模型的示意圖。請參照圖4，本發明實施例StarGAN-VC模型400的目標是獲得一個能夠學習多個域/說話者之間的映射關係的生成器G，其中包括將輸入的聲學特徵x以目標音頻屬性c'為條件轉換為輸出特徵x'： G(x,c')→x' In detail, FIG4 is a schematic diagram of a StarGAN-VC model according to an embodiment of the present invention. Referring to FIG4, the goal of the StarGAN-VC model 400 of the present invention is to obtain a generator G that can learn the mapping relationship between multiple domains/speakers, including converting the input acoustic feature x into the output feature x ' under the condition of the target audio attribute c' : G ( x, c' ) → x'

其中，x

為聲學特徵序列，其中Q為特徵維度、T為特徵序列長度，c和c'分別是來源和目標說話者對應的域碼(domain code)，其中c

{1,...,N}，N為域/說話者的數目。 Where x

is the acoustic feature sequence, where Q is the feature dimension, T is the feature sequence length, c and c' are the domain codes corresponding to the source and target speakers respectively, and c

{1,..., N }, where N is the number of domains/speakers.

所述的語音轉換模型可依據對抗性損失、分類損失、周期一致性損失、身份映射損失來解決最佳化問題，茲分述如下：對抗性損失是用以描述轉換後特徵與真實特徵的區別程度，其定義如下：

The speech conversion model described can solve the optimization problem based on adversarial loss, classification loss, period consistency loss, and identity mapping loss, which are described below: Adversarial loss is used to describe the difference between converted features and real features. Degree, which is defined as follows:

其中，D為目標條件的鑑別器(discriminator)。藉由最大化此對抗性損失，鑑別器D能夠學習到以目標音頻屬性c’為條件的介於轉換後特徵與真實特徵之間的最佳決策邊界。相對地，生成器G可藉由使對抗性損失最小化，來使以目標音頻屬性c’為條件的轉換後特徵無法與真實特徵區別。 Where D is the discriminator of the target condition. By maximizing this adversarial loss, the discriminator D can learn the optimal decision boundary between the transformed features and the true features conditioned on the target audio attribute c '. Conversely, the generator G can make the transformed features conditioned on the target audio attribute c ' indistinguishable from the true features by minimizing the adversarial loss.

分類損失可使語音轉換模型能夠合成出屬於目標域的聲學特徵。其中，分類器C可訓練為真實聲學特徵：

Classification loss enables the speech conversion model to synthesize acoustic features belonging to the target domain. Among them, classifier C can be trained as real acoustic features:

其中，分類器C可藉由使分類損失最小化，來將真實聲學特徵分類到對應的目標域c1’。 Among them, the classifier C can classify the true acoustic features into the corresponding target domain c1 ' by minimizing the classification loss.

此外，生成器G可對分類器C最佳化：

Furthermore, the generator G can optimize the classifier C :

其中，此處的目標域c1’是指的是特定的歌唱者。G(x,c1')就是生成器G以c1’的聲音為條件產生的聲學特徵x的音訊。此音訊經分類器C分辨為c1’的機率就是C(c1'｜G(x,c1'))。分類器C可藉由使分類損失最小化來產生可被分類為目標域c1’的聲學特徵。 Here, the target domain c1 ' refers to a specific singer. G(x,c1 ' ) is the audio of acoustic feature x generated by the generator G with the voice of c1 ' as the condition. The probability of this audio being classified as c1 ' by the classifier C is C(c1 ' | G(x,c1 ' )). The classifier C can generate acoustic features that can be classified as the target domain c1 ' by minimizing the classification loss.

雖然上述的對抗性損失和分類損失能分別促使轉換後的聲學特徵變得真實且可被分類，但其並不能夠保證轉換後的聲學特徵能保留輸入成分。為了彌補此缺陷，可採用下列的週期一致性損失：

Although the above-mentioned adversarial loss and classification loss can respectively make the transformed acoustic features real and classifiable, they cannot guarantee that the transformed acoustic features can retain the input components. To compensate for this defect, the following periodic consistency loss can be used:

上述的周期限制可促使生成器G去找出不會損害成分的最佳來源和目標配對。 The above cycle constraints can force the generator G to find the best source and target pairing that will not damage the components.

為了進一步限制輸入的保留，可採用下列的身份映射損失：

To further restrict input retention, the following identity mapping loss can be employed:

綜上，StarGAN-VC的最小化目標，依生成器G、鑑別器D和分類器C，列示如下：L_D=-L _adversarial In summary, the minimization objective of StarGAN-VC, based on the generator G , discriminator D and classifier C , is as follows: L _D = - L _adversarial

其中，λ_{classification}

0、λ_{cycleconsistency}

0、λ_{identitymapping}

0，其是規則化參數，分別用以加權分類損失、週期一致性損失、身份映射損失相對於對抗性損失的重要性。 Among them, λ _{classification}

0.λ _{cycle consistency}

0. λ _{identity mapping}

0, which is the regularization parameter, used to weight the importance of classification loss, periodic consistency loss, and identity mapping loss relative to adversarial loss.

在合成過程中，聲音合成器是根據歌譜中的歌詞，從預先建立的音頻資料庫中選擇單字的音頻資料，然後使用聲碼器將音頻的波形分解成三個主要特徵：基本頻率、頻譜包絡和非週期性包絡。接著，使用基本頻率特徵來將旋律和上述包絡建模為語音轉換模型的聲學特徵。 In the synthesis process, the sound synthesizer selects the audio data of the words from the pre-established audio database according to the lyrics in the sheet music, and then uses the vocoder to decompose the audio waveform into three main features: fundamental frequency, spectral envelope, and aperiodic envelope. Then, the fundamental frequency features are used to model the melody and the above envelope as acoustic features of the speech conversion model.

基於使用StarGAN-VC作為語音轉換模型，該模型包含三個要訓練的部分，即上述的生成器G、鑑別器D和分類器C。其中，對於生成器G，例如可採用二維卷積神經網路(Convolutional Neural Networks，CNN)。在此模型中，例如是將輸入的聲學特徵序列視為一個通道的二維影像。對於用以於鑑別真/假的鑑別器D，例如可使用PatchGAN的概念，其是用以解決影像對影像問題，而嘗試去分辨影像中的每一個N×N補丁(patch)是否為真或假。此鑑別器D例如是在整個影像上卷積執行，對所有響應求平均，以供作為鑑別器D的最終輸出。在語音轉換模型中，鑑別器D例如會對音頻的補丁或片段進行分類，而不是對整個音頻進行分類。藉此，雖然增加了分類器C的難度，但能夠有效提高鑑別器在語音轉換中的效能。最後，對於域的分類器C，例如使用門控卷積神經網路(Gated CNN)，其是使用門控機制，而能夠以更快的速度達到與長短期記憶(Long short term memory，LSTM)網路相近的結果。上述的網路可以使用亞當優化器(Adam optimizer)對網路訓練200k步，其中，批次(batch)大小為8，生成器G和鑑別器D 的學習率可分別設為0.0001，且動量項可設為0.5。此外，上述的λ _{classification}、λ _{cycleconsistency}、λ _{identitymapping}例如可均設為10，但不限於此。本領域技術人員可視實際需要，修改為其他值。 Based on using StarGAN-VC as the speech conversion model, the model contains three parts to be trained, namely the above-mentioned generator G , discriminator D and classifier C. Among them, for the generator G, for example, two-dimensional convolutional neural networks (Convolutional Neural Networks, CNN) can be used. In this model, for example, the input acoustic feature sequence is regarded as a two-dimensional image of one channel. For the discriminator D used to identify true/false, for example, the concept of PatchGAN can be used, which is used to solve the image-to-image problem and try to distinguish whether each N × N patch in the image is true or false. Fake. This discriminator D is, for example, convolutionally performed on the entire image, and all responses are averaged to provide the final output of the discriminator D. In a speech conversion model, the discriminator D , for example, classifies patches or segments of audio rather than the entire audio. Although this increases the difficulty of classifier C , it can effectively improve the performance of the discriminator in speech conversion. Finally, for the domain classifier C , for example, use a gated convolutional neural network (Gated CNN), which uses a gating mechanism and can achieve the same goal as long short term memory (LSTM) at a faster speed. Similar results online. The above network can be trained for 200k steps using the Adam optimizer, where the batch size is 8, the learning rates of the generator G and the discriminator D can be set to 0.0001 respectively, and the momentum term Can be set to 0.5. In addition, the above-mentioned λ _{classification} , λ _{cycleconsistency} , and λ _{identitymapping} may all be set to 10, for example, but are not limited thereto. Those skilled in the art may modify it to other values according to actual needs.

綜上所述，本發明實施例的透過歌聲轉換設計個人化虛擬歌手的方法及裝置利用單音選擇(unit selection)拼接方法，對文字的多種發音進行預錄，設計成虛擬歌手，而能夠將任一首有歌譜與歌詞的歌曲精準唱出。此外，利用Star-GAN演算法將多個說話者的音色轉換與虛擬歌手音色一同訓練，彼此的音色可以互換。藉此，針對選定的某個說話者，本發明實施例的歌聲合成裝置能夠將該說話者的音色與虛擬歌手的歌唱能力結合，而唱出具有該說話者音色的任一首歌。 In summary, the method and device for designing a personalized virtual singer through singing voice conversion according to embodiments of the present invention use a unit selection splicing method to pre-record multiple pronunciations of text and design them into virtual singers, which can Sing any song with sheet music and lyrics accurately. In addition, the Star-GAN algorithm is used to train the timbre conversion of multiple speakers together with the timbres of virtual singers, so that each other's timbres can be interchanged. Thereby, for a selected speaker, the singing voice synthesis device of the embodiment of the present invention can combine the speaker's timbre with the singing ability of the virtual singer, and sing any song with the speaker's timbre.

雖然本發明已以實施例揭露如上，然其並非用以限定本發明，任何所屬技術領域中具有通常知識者，在不脫離本發明的精神和範圍內，當可作些許的更動與潤飾，故本發明的保護範圍當視後附的申請專利範圍所界定者為準。 Although the present invention has been disclosed above through embodiments, they are not intended to limit the present invention. Anyone with ordinary knowledge in the technical field may make some modifications and modifications without departing from the spirit and scope of the present invention. Therefore, The protection scope of the present invention shall be determined by the appended patent application scope.

S202~S208:步驟 S202~S208: steps

Claims

A method for designing a personalized virtual singer through singing voice conversion, suitable for electronic devices equipped with processors. The method includes the following steps: parsing (parse) a music score file represented by a symbol to extract multiple lyrics and multiple notes ; Load the captured audio data of each note from an audio database; use a vocoder to decompose the waveform of the audio data into basic frequency, spectral envelope and aperiodic envelope, according to the The basic frequency models the spectral envelope and the aperiodic envelope as multiple acoustic features of the speech conversion model, and uses the basic frequency to adjust the pitch of the audio data corresponding to each note. and splicing the adjusted audio data to generate singing voice data; and using a generator of a speech conversion model to convert multiple acoustic features in the singing voice data into output features conditioned on the target audio attributes of the virtual singer, and based on Multiple losses of the speech conversion model are used to train the speech conversion model, and the loss is minimized or maximized according to the type of each of the calculated losses, so as to obtain synthetic singing data with optimized output features, wherein the The above losses include one or a combination of adversarial loss, classification loss, period consistency loss, identity mapping loss.

The method described in claim 1 further includes: recording multiple audio data of each of multiple words in a language and recording them in the audio database.

The method of claim 1, wherein the step of training the speech conversion model according to multiple losses of the speech conversion model includes: Calculating the adversarial loss used to describe the degree of difference between the converted output features and the acoustic features; and by maximizing the adversarial loss, so that the discriminator of the speech conversion model learns to The target audio attribute is a conditional optimal decision boundary between the converted output feature and the acoustic feature.

The method according to claim 1, wherein the step of training the speech conversion model according to multiple losses of the speech conversion model includes: calculating a parameter used to describe the acoustic features synthesized by the speech conversion model belonging to a target domain. The classification loss; and training a classifier of the speech conversion model by minimizing the classification loss so that the acoustic features are classified into the corresponding target domain.

The method of claim 1, wherein the step of training the speech conversion model according to multiple losses of the speech conversion model includes: calculating the linguistic consistency ( and by minimizing the periodic consistency loss to ensure that the output features are linguistically consistent with the acoustic features.

The method of claim 1, wherein the step of training the speech conversion model according to the multiple losses of the speech conversion model comprises: calculating the identity mapping loss used to describe the components of the input features preserved in the converted output features; and preserving the components of the input features in the output features by minimizing the identity mapping loss.

The method of claim 1, wherein the step of training the speech conversion model according to the multiple losses of the speech conversion model comprises: respectively multiplying the calculated classification loss, the cycle consistency loss and the identity mapping loss by corresponding weights to weight the importance of the classification loss, the cycle consistency loss and the identity mapping loss relative to the adversarial loss.

A device for designing a personalized virtual singer through singing voice conversion, including: a connection device; a storage device to store a computer program; and a processor, coupled to the connection device and the storage device, configured to load and execute the storage The computer program in the device is used to: parse a music score file represented by a symbol to extract multiple lyrics and multiple notes; use the connecting device to connect to an audio database to load each extracted description Audio data of musical notes; use a vocoder to decompose the waveform of the audio data into a basic frequency, a spectral envelope and a non-periodic envelope, and model the spectral envelope and the non-periodic envelope according to the basic frequency as the using the plurality of acoustic features of the speech conversion model, applying the basic frequency to adjust the pitch of the audio data corresponding to each note and splicing the adjusted audio data to generate singing data; and using a voice The generator of the transformation model converts the multiple acoustic The features are converted into output features conditioned on the target audio attributes of the virtual singer, and the speech conversion model is trained according to multiple losses of the speech conversion model, so that the loss is minimized or maximized according to the type of each of the calculated losses. ization to obtain synthetic singing material with optimized output features, wherein the loss includes one of adversarial loss, classification loss, period consistency loss, identity mapping loss, or a combination thereof.

As described in claim 8, the device for designing a personalized virtual singer through voice conversion, wherein the processor further records multiple audio data of each of multiple words in a language and records them in the audio database.

The device for designing a personalized virtual singer through singing voice conversion as described in claim 8, wherein the processor includes calculating the adversarial loss used to describe the degree of distinction between the output features after conversion and the acoustic features, and maximizing the adversarial loss so that the discriminator of the speech conversion model learns the optimal decision boundary between the output features after conversion and the acoustic features conditioned on the target audio attributes.

The device for designing a personalized virtual singer through singing voice conversion as claimed in claim 8, wherein the processor includes calculating the classification loss used to describe the acoustic features synthesized by the speech conversion model belonging to a target domain, and training a classifier of the speech conversion model by minimizing the classification loss so that the acoustic features are classified into the corresponding target domain.

The device for designing a personalized virtual singer through singing voice conversion as described in claim 8, wherein the processor includes calculating the cycle consistency loss used to describe the language consistency of the output features and the acoustic features after conversion, and ensuring the language consistency of the output features and the acoustic features by minimizing the cycle consistency loss.

The device for designing a personalized virtual singer through singing voice conversion as described in claim 8, wherein the processor includes calculating the identity mapping loss used to describe the component of the converted output feature that retains the input feature, and by The identity mapping loss is minimized to preserve components of the input features in the output features.

The device for designing a personalized virtual singer through singing voice conversion as described in claim 8, wherein the processor includes multiplying the calculated classification loss, the cycle consistency loss and the identity mapping loss by corresponding weights respectively , to weight the importance of the classification loss, the period consistency loss and the identity mapping loss relative to the adversarial loss.