TWI836255B - Method and apparatus in designing a personalized virtual singer using singing voice conversion - Google Patents
Method and apparatus in designing a personalized virtual singer using singing voice conversion Download PDFInfo
- Publication number
- TWI836255B TWI836255B TW110130295A TW110130295A TWI836255B TW I836255 B TWI836255 B TW I836255B TW 110130295 A TW110130295 A TW 110130295A TW 110130295 A TW110130295 A TW 110130295A TW I836255 B TWI836255 B TW I836255B
- Authority
- TW
- Taiwan
- Prior art keywords
- loss
- features
- conversion model
- speech conversion
- audio data
- Prior art date
Links
- 238000006243 chemical reaction Methods 0.000 title claims abstract description 79
- 238000000034 method Methods 0.000 title claims abstract description 28
- 230000001143 conditioned effect Effects 0.000 claims abstract description 8
- 238000013507 mapping Methods 0.000 claims description 18
- 238000004590 computer program Methods 0.000 claims description 6
- 230000000737 periodic effect Effects 0.000 claims description 6
- 230000003595 spectral effect Effects 0.000 claims description 6
- 230000009466 transformation Effects 0.000 claims 1
- 238000010586 diagram Methods 0.000 description 6
- 230000015572 biosynthetic process Effects 0.000 description 5
- 238000013527 convolutional neural network Methods 0.000 description 5
- 230000015654 memory Effects 0.000 description 5
- 238000003786 synthesis reaction Methods 0.000 description 5
- 230000001755 vocal effect Effects 0.000 description 5
- 101100012902 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) FIG2 gene Proteins 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 238000001308 synthesis method Methods 0.000 description 3
- 101001121408 Homo sapiens L-amino-acid oxidase Proteins 0.000 description 2
- 101000827703 Homo sapiens Polyphosphoinositide phosphatase Proteins 0.000 description 2
- 102100026388 L-amino-acid oxidase Human genes 0.000 description 2
- 102100023591 Polyphosphoinositide phosphatase Human genes 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000014759 maintenance of location Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- APTZNLHMIGJTEW-UHFFFAOYSA-N pyraflufen-ethyl Chemical compound C1=C(Cl)C(OCC(=O)OCC)=CC(C=2C(=C(OC(F)F)N(C)N=2)Cl)=C1F APTZNLHMIGJTEW-UHFFFAOYSA-N 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000009966 trimming Methods 0.000 description 1
Images
Abstract
Description
本發明是有關於一種歌聲合成方法及裝置,且特別是有關於一種透過歌聲轉換設計個人化虛擬歌手的方法及裝置。 The present invention relates to a singing voice synthesis method and device, and in particular to a method and device for designing a personalized virtual singer through singing voice conversion.
歌聲合成是一種產生人的歌聲的技術。長久以來,人們嘗試使用不同方法來實現歌聲合成,例如將語音對談轉換為依歌譜與歌詞唱出一首歌。然而,歌聲的音色轉換大多在於一對一的轉換,因此僅限於單一歌手,或是用指定的個人化音色唱出指定的任一首歌。 Singing synthesis is a technology that generates human singing voices. For a long time, people have tried to use different methods to achieve singing synthesis, such as converting voice conversations into singing a song according to the sheet music and lyrics. However, the timbre conversion of singing voices is mostly a one-to-one conversion, so it is limited to a single singer, or singing any specified song with a specified personalized timbre.
也就是說,若欲將某B的音色轉換成某A的音色,必須某A已經說過或唱過某首歌,無法將原本某B唱的歌以某A的音色唱出。 In other words, if you want to convert the timbre of a certain B into the timbre of a certain A, you must have said or sung a certain song. It is impossible to sing the original song sung by a certain B with the timbre of a certain A.
本發明提供一種透過歌聲轉換設計個人化虛擬歌手的方法及裝置,能夠讓虛擬歌手用特定人的音色依照歌譜與歌詞唱出任一首歌,也就是將特定人的音色轉換到虛擬歌手所唱出的歌上。 The present invention provides a method and device for designing a personalized virtual singer through voice conversion, which enables the virtual singer to sing any song according to the score and lyrics with the voice of a specific person, that is, to convert the voice of a specific person to the song sung by the virtual singer.
本發明提供一種透過歌聲轉換設計個人化虛擬歌手的方法,適用於具備處理器的電子裝置。此方法包括下列步驟:解析(parse)使用一符號表示的歌譜文件,以擷取出多個歌詞及多個音符;從一音頻資料庫載入所擷取的各音符的音頻資料;使用聲碼器(vocoder)對音頻資料進行聲學建模,以調整各音頻資料並拼接調整後的音頻資料以產生歌聲資料;以及利用一語音轉換模型的生成器將歌聲資料中的多個聲學特徵轉換為以目標音頻屬性為條件的輸出特徵,並根據語音轉換模型的多種損失訓練語音轉換模型,以獲得輸出特徵最佳化的合成歌聲資料。 The present invention provides a method for designing a personalized virtual singer through singing voice conversion, which is suitable for electronic devices equipped with processors. This method includes the following steps: parse a music score file represented by a symbol to extract multiple lyrics and multiple notes; load the extracted audio data of each note from an audio database; use a vocoder (vocoder) performs acoustic modeling on audio data to adjust each audio data and splice the adjusted audio data to generate singing voice data; and utilize a generator of a speech conversion model to convert multiple acoustic features in the singing voice data into target The audio attributes are the conditional output features, and the speech conversion model is trained according to various losses of the speech conversion model to obtain synthetic singing data with optimized output features.
本發明提供一種透過歌聲轉換設計個人化虛擬歌手的裝置,其包括連接裝置、儲存裝置及處理器。其中,儲存裝置用以儲存電腦程式。處理器耦接連接裝置及儲存裝置,經配置以載入並執行儲存裝置中的電腦程式以:解析使用一符號表示的歌譜文件,以擷取出多個歌詞及多個音符;從一音頻資料庫載入所擷取的各音符的音頻資料;使用聲碼器對音頻資料進行聲學建模,以調整各音頻資料並拼接調整後的音頻資料以產生歌聲資料;以及利用一語音轉換模型的生成器將歌聲資料中的多個聲學特徵轉換為以目標音頻屬性為條件的輸出特徵,並根據語音轉換模型的多種損失訓練語音轉換模型,以獲得輸出特徵最佳化的合成歌聲資料。 The invention provides a device for designing a personalized virtual singer through singing voice conversion, which includes a connection device, a storage device and a processor. Among them, the storage device is used to store computer programs. The processor is coupled to the connection device and the storage device, and is configured to load and execute a computer program in the storage device to: parse a music score file represented by a symbol to retrieve a plurality of lyrics and a plurality of notes; from an audio database Loading the captured audio data of each note; performing acoustic modeling on the audio data using a vocoder to adjust each audio data and splicing the adjusted audio data to generate singing voice data; and utilizing a generator of a speech conversion model Convert multiple acoustic features in the singing data into output features conditioned on the target audio attributes, and train the speech conversion model based on various losses of the speech conversion model to obtain synthetic singing data with optimized output features.
為讓本發明的上述特徵和優點能更明顯易懂,下文特舉實施例,並配合所附圖式作詳細說明如下。 In order to make the above features and advantages of the present invention more clearly understood, the following is a detailed description of the embodiments with the accompanying drawings.
10:透過歌聲轉換設計個人化虛擬歌手的裝置 10: Design a personalized virtual singer's device through singing voice conversion
12:連接裝置 12: Connecting device
14:儲存裝置 14:Storage device
16:處理器 16: Processor
30:系統 30:System
31:歌譜 31: Song Sheet
32:歌詞+音符 32: lyrics + notes
33:音頻資料庫 33: Audio database
34:預錄製的音符音頻資料 34: Pre-recorded note audio data
35:合成歌聲資料 35:Synthesized vocal data
40:歌聲合成器 40: Vocal Synthesizer
41:聲碼器 41:Vocoder
42:基本頻率 42: Basic frequency
43:頻譜包絡 43: Spectral envelope
44:非週期性包絡 44: Non-periodic envelope
45:音高調整 45: Pitch adjustment
46:持續時間 46:Duration
47:歌唱表達 47: Singing Expression
48:語音轉換模型 48: Speech conversion model
400:StarGAN-VC模型 400:StarGAN-VC model
S202~S206、S1~S4:步驟 S202~S206, S1~S4: steps
圖1是根據本發明一實施例所繪示的透過歌聲轉換設計個人化虛擬歌手的裝置的方塊圖。 FIG. 1 is a block diagram of a device for designing a personalized virtual singer through voice conversion according to an embodiment of the present invention.
圖2是依照本發明一實施例所繪示的透過歌聲轉換設計個人化虛擬歌手的方法的流程圖。 FIG2 is a flow chart of a method for designing a personalized virtual singer through voice conversion according to an embodiment of the present invention.
圖3是根據本發明一實施例所繪示的歌聲合成器的系統架構圖。 FIG. 3 is a system architecture diagram of a singing voice synthesizer according to an embodiment of the present invention.
圖4是根據本發明一實施例所繪示的StarGAN-VC模型的示意圖。 Figure 4 is a schematic diagram of the StarGAN-VC model according to an embodiment of the present invention.
本發明實施例主要結合拼接法(concatenative)歌聲合成器以及多說話者聲音轉換模型,其中包括使用拼接法歌聲合成器解析音樂的資料表示(例如MusicXML)檔案以合成一個虛擬歌聲。之後,再用聲音轉換模型把此虛擬歌聲轉換成不同的音色。本發明實施例是藉由預先錄好所有中文單字的發音,來實現拼接法歌聲合成器。對於聲音轉換模型,本發明實施例使用了對抗式深層網路(Generative adversarial network,GAN)的語音轉換(Voice convert,VC)模型來實現各個說話者之間的聲音轉換,使得該模型可以實 現多個可區分音色的聲音的生成。 Embodiments of the present invention mainly combine a concatenative singing voice synthesizer and a multi-speaker voice conversion model, which include using a concatenative singing voice synthesizer to parse a music data representation (such as MusicXML) file to synthesize a virtual singing voice. Afterwards, the sound conversion model is used to convert the virtual singing voice into different timbres. The embodiment of the present invention implements a splicing method singing voice synthesizer by pre-recording the pronunciations of all Chinese words. For the voice conversion model, the embodiment of the present invention uses a voice convert (VC) model of a deep adversarial network (GAN) to realize voice conversion between speakers, so that the model can be implemented Now generate multiple sounds with distinguishable timbres.
圖1是根據本發明一實施例所繪示的透過歌聲轉換設計個人化虛擬歌手的裝置的方塊圖。請參照圖1,本發明實施例的透過歌聲轉換設計個人化虛擬歌手的裝置10(以下簡稱裝置10)例如是具有運算能力的檔案伺服器、資料庫伺服器、應用程式伺服器、工作站或個人電腦等計算機裝置,其中包括連接裝置12、儲存裝置14及處理器16等元件,其功能分述如下:
連接裝置12例如是任意的有線或無線的介面裝置,其可用以連接並存取位於遠端或本地端(即,儲存於儲存裝置14中)的音訊資料庫,以查詢並接收音訊資料。對於有線方式而言,連接裝置12可以是通用序列匯流排(universal serial bus,USB)、RS232、通用非同步接收器/傳送器(universal asynchronous receiver/transmitter,UART)、內部整合電路(I2C)、序列周邊介面(serial peripheral interface,SPI)、顯示埠(display port)或雷電埠(thunderbolt)等介面,但不限於此。對於無線方式而言,連接裝置12可以是支援無線保真(wireless fidelity,Wi-Fi)、RFID、藍芽、紅外線、近場通訊(near-field communication,NFC)或裝置對裝置(device-to-device,D2D)等通訊協定的裝置,但不限於此。在一些實施例中,連接裝置12亦可以是支援乙太網路(Ethernet)或是支援802.11g、802.11n、802.11ac等無線網路標準的網路卡,亦不限於此。
FIG. 1 is a block diagram of a device for designing a personalized virtual singer through singing voice conversion according to an embodiment of the present invention. Please refer to Figure 1. The device 10 (hereinafter referred to as the device 10) for designing a personalized virtual singer through singing voice conversion according to an embodiment of the present invention is, for example, a file server, a database server, an application server, a workstation or a personal computer with computing capabilities. Computer devices such as computers include components such as a
儲存裝置14例如是任意型式的固定式或可移動式隨機存
取記憶體(Random Access Memory,RAM)、唯讀記憶體(Read-Only Memory,ROM)、快閃記憶體(Flash memory)、硬碟或類似元件或上述元件的組合,而用以儲存可由處理器16執行的電腦程式。在一實施例中,儲存裝置14亦可儲存由裝置10所建構的音訊資料庫或從遠端的音訊資料庫下載的音頻資料,在此不設限。
The
處理器16例如是中央處理單元(Central Processing Unit,CPU),或是其他可程式化之一般用途或特殊用途的微處理器(Microprocessor)、微控制器(Microcontroller)、數位訊號處理器(Digital Signal Processor,DSP)、可程式化控制器、特殊應用積體電路(Application Specific Integrated Circuits,ASIC)、可程式化邏輯裝置(Programmable Logic Device,PLD)或其他類似裝置或這些裝置的組合,但本實施例不限於此。在本實施例中,處理器16可從儲存裝置14載入電腦程式,以執行本發明實施例的多說話者的歌聲合成方法。
The
圖2是依照本發明一實施例所繪示的透過歌聲轉換設計個人化虛擬歌手的方法的流程圖。請同時參照圖1及圖2,本實施例的方法適用於圖1的裝置10。以下即搭配裝置10的各項裝置及元件說明本實施例的多說話者的歌聲合成方法的詳細步驟。
FIG2 is a flow chart of a method for designing a personalized virtual singer through voice conversion according to an embodiment of the present invention. Please refer to FIG1 and FIG2 at the same time. The method of this embodiment is applicable to the
在步驟S202中,裝置10是由處理器16解析(parse)使用一符號表示的歌譜文件,以擷取出多個歌詞及多個音符。在一實施例中,處理器16還可從歌譜文件解析出歌唱表達資訊(連音、抖音)等,在此不設限。所述的符號表示例如是MusicXML,但不
限於此。
In step S202, the
在步驟S202中,由處理器16利用連接裝置12從音頻資料庫載入所擷取的各音符的音頻資料。所述音頻資料庫例如是位於遠端的伺服器或是儲存在本地端的儲存裝置14中,但不限於此。
In step S202, the
在步驟S204中,由處理器16使用聲碼器(vocoder)對音頻資料進行聲學建模,以調整各音頻資料並拼接調整後的音頻資料以產生歌聲資料。所述的聲碼器例如是WORLD聲碼器,但不限於此。
In step S204, the
在步驟S206中,由處理器16利用一語音轉換模型的生成器將歌聲資料中的多個聲學特徵轉換為以目標音頻屬性為條件的輸出特徵,並根據語音轉換模型的多種損失訓練語音轉換模型,以獲得輸出特徵最佳化的合成歌聲資料。所述的語音轉換模型例如是基於星狀生成對抗網路(Star Generative adversarial network,StarGAN)-語音轉換(Voice conversion,VC)架構的模型,而所述的損失例如包括對抗性損失(adversarial loss)、分類損失(classification loss)、周期一致性損失(cycle-consistency loss)、身份映射損失(identity-mapping loss)其中之一或其組合,但不限於此。
In step S206, the
詳細而言,圖3是根據本發明一實施例所繪示的歌聲合成器的系統架構圖。請參照圖3,本發明實施例揭露包括歌聲合成器40的系統30,此歌聲合成器40例如是拼接法(concatenative)歌聲合成器,其是基於預先錄製的音頻資料庫33中的樣本來進行
拼接。因此,針對每個單字,需預先錄製大量發音的音頻資料,以構建音頻資料庫33。
Specifically, FIG. 3 is a system architecture diagram of a singing voice synthesizer according to an embodiment of the present invention. Referring to Figure 3, an embodiment of the present invention discloses a
在一實施例中,歌聲合成器40是用中文實現的,因此需要錄製大量的音頻資料才能覆蓋所有漢字。而在歌唱的過程中,由於音符的音高(pitch)會發生變化,因此單字的音調較不重要,只需要錄製413個發音,即可覆蓋所有漢字。
In one embodiment, the singing
歌聲合成器40的輸入資料是採用音樂符號表示的歌譜31,此音樂符號表示例如為MusicXML,其是用以表示西方音樂符號的基於XML的文件格式,被超過250種符號程式支持,其中包括一些圖形化使用者介面程式,例如MuseScore。由於本實施例的歌聲合成器所採用的MusicXML是基於XML的文件格式,其可以添加其他標籤來表示歌唱表達技巧,例如連音(legato)、抖音(vibrato)等。
The input data of the
在合成過程的步驟S1中,系統30會先把MusicXML的輸入解析為歌詞和音符32(包括音高、音符值及/或音符持續時間)。在一實施例中,系統30還會把MusicXML的輸入解析為歌唱表達資訊(連音、抖音)等,在此不設限。對於歌詞中的每個單字,在步驟S2中,系統30會從音頻資料庫33中載入這些單字的預錄製的音頻資料34。然後,在步驟S3中,由歌聲合成器40使用聲碼器(vocoder)41(例如WORLD聲碼器)對單字的音頻資料34進行聲學建模,其中包括根據基本頻率42,將音頻訊號34解析為諧波頻譜包絡(envelope)43和非週期性包絡44,並在步驟S4中,
對各個音符的音調頻率應用基本頻率特徵作音高調整45,以及對音頻資料34進行修整(trim),以匹配該音符的持續時間46。此外,可在漢字之間加入連音(legato)、抖音(vibrato)、淡入(fade-in)淡出(fade-out)等歌唱表達47的效果,以使所拼接的聲音更加平滑。最後,上述所有資料以及所有單字的調整後音頻資料會送入語音轉換模型48以進行轉及拼接,最終在步驟S5中,由歌聲合成器40輸出具所選個人音色的合成歌聲資料35。
In step S1 of the synthesis process, the
在語音轉換部分,本發明實施例使用基於StarGAN-VC架構的模型。StarGAN-VC是一種非平行的多對多語音轉換模型,是原始對抗生成網路(Generative adversarial network,GAN)的一種變形(稱為StarGAN),其能夠使用單編碼器類型的生成器(generator)網路,同時學習多對多映射,此生成器輸出的屬性是由輔助輸入控制。StarGAN-VC還使用對抗性損失(adversarial loss)進行生成器的訓練,以促使生成器的輸出變得和真實語音無法區分,並確保每對屬性域(domain)之間的映射可保留語言資訊(linguistic information)。StarGAN-VC的優點是在測試時,不需要任何與輸入音頻屬性相關的資訊。 In the speech conversion part, the embodiment of the present invention uses a model based on the StarGAN-VC architecture. StarGAN-VC is a non-parallel many-to-many speech conversion model, a variant of the original adversarial network (GAN) (called StarGAN), which can use a single encoder type generator (generator) The network,simultaneously learns many-to-many mappings, and the,properties output by this generator are controlled by auxiliary,inputs. StarGAN-VC also uses adversarial loss to train the generator to make the output of the generator indistinguishable from real speech and ensure that the mapping between each pair of attribute domains can retain language information ( linguistic information). The advantage of StarGAN-VC is that it does not require any information related to the input audio attributes during testing.
詳細而言,圖4是根據本發明一實施例所繪示的StarGAN-VC模型的示意圖。請參照圖4,本發明實施例StarGAN-VC模型400的目標是獲得一個能夠學習多個域/說話者之間的映射關係的生成器G,其中包括將輸入的聲學特徵x以目標音頻屬性c'為條件轉換為輸出特徵x':
G(x,c')→x' In detail, FIG4 is a schematic diagram of a StarGAN-VC model according to an embodiment of the present invention. Referring to FIG4, the goal of the StarGAN-
其中,x 為聲學特徵序列,其中Q為特徵維度、T為特徵序列長度,c和c'分別是來源和目標說話者對應的域碼(domain code),其中c {1,...,N},N為域/說話者的數目。 Where x is the acoustic feature sequence, where Q is the feature dimension, T is the feature sequence length, c and c' are the domain codes corresponding to the source and target speakers respectively, and c {1,..., N }, where N is the number of domains/speakers.
所述的語音轉換模型可依據對抗性損失、分類損失、周期一致性損失、身份映射損失來解決最佳化問題,茲分述如下:對抗性損失是用以描述轉換後特徵與真實特徵的區別程度,其定義如下:
其中,D為目標條件的鑑別器(discriminator)。藉由最大化此對抗性損失,鑑別器D能夠學習到以目標音頻屬性c’為條件的介於轉換後特徵與真實特徵之間的最佳決策邊界。相對地,生成器G可藉由使對抗性損失最小化,來使以目標音頻屬性c’為條件的轉換後特徵無法與真實特徵區別。 Where D is the discriminator of the target condition. By maximizing this adversarial loss, the discriminator D can learn the optimal decision boundary between the transformed features and the true features conditioned on the target audio attribute c '. Conversely, the generator G can make the transformed features conditioned on the target audio attribute c ' indistinguishable from the true features by minimizing the adversarial loss.
分類損失可使語音轉換模型能夠合成出屬於目標域的聲學特徵。其中,分類器C可訓練為真實聲學特徵:
其中,分類器C可藉由使分類損失最小化,來將真實聲學特徵分類到對應的目標域c1’。 Among them, the classifier C can classify the true acoustic features into the corresponding target domain c1 ' by minimizing the classification loss.
此外,生成器G可對分類器C最佳化:
其中,此處的目標域c1’是指的是特定的歌唱者。G(x,c1')就是生成器G以c1’的聲音為條件產生的聲學特徵x的音訊。此音訊經分類器C分辨為c1’的機率就是C(c1'|G(x,c1'))。分類器C可藉由使分類損失最小化來產生可被分類為目標域c1’的聲學特徵。 Here, the target domain c1 ' refers to a specific singer. G(x,c1 ' ) is the audio of acoustic feature x generated by the generator G with the voice of c1 ' as the condition. The probability of this audio being classified as c1 ' by the classifier C is C(c1 ' | G(x,c1 ' )). The classifier C can generate acoustic features that can be classified as the target domain c1 ' by minimizing the classification loss.
雖然上述的對抗性損失和分類損失能分別促使轉換後的聲學特徵變得真實且可被分類,但其並不能夠保證轉換後的聲學特徵能保留輸入成分。為了彌補此缺陷,可採用下列的週期一致性損失:
上述的周期限制可促使生成器G去找出不會損害成分的最佳來源和目標配對。 The above cycle constraints can force the generator G to find the best source and target pairing that will not damage the components.
為了進一步限制輸入的保留,可採用下列的身份映射損失:
綜上,StarGAN-VC的最小化目標,依生成器G、鑑別器D和分類器C,列示如下:L D =-L adversarial In summary, the minimization objective of StarGAN-VC, based on the generator G , discriminator D and classifier C , is as follows: L D = - L adversarial
其中,λclassification 0、λ cycleconsistency 0、λ identitymapping 0,其是規則化參數,分別用以加權分類損失、週期一致性損失、 身份映射損失相對於對抗性損失的重要性。 Among them, λ classification 0.λ cycle consistency 0. λ identity mapping 0, which is the regularization parameter, used to weight the importance of classification loss, periodic consistency loss, and identity mapping loss relative to adversarial loss.
在合成過程中,聲音合成器是根據歌譜中的歌詞,從預先建立的音頻資料庫中選擇單字的音頻資料,然後使用聲碼器將音頻的波形分解成三個主要特徵:基本頻率、頻譜包絡和非週期性包絡。接著,使用基本頻率特徵來將旋律和上述包絡建模為語音轉換模型的聲學特徵。 In the synthesis process, the sound synthesizer selects the audio data of the words from the pre-established audio database according to the lyrics in the sheet music, and then uses the vocoder to decompose the audio waveform into three main features: fundamental frequency, spectral envelope, and aperiodic envelope. Then, the fundamental frequency features are used to model the melody and the above envelope as acoustic features of the speech conversion model.
基於使用StarGAN-VC作為語音轉換模型,該模型包含三個要訓練的部分,即上述的生成器G、鑑別器D和分類器C。其中,對於生成器G,例如可採用二維卷積神經網路(Convolutional Neural Networks,CNN)。在此模型中,例如是將輸入的聲學特徵序列視為一個通道的二維影像。對於用以於鑑別真/假的鑑別器D,例如可使用PatchGAN的概念,其是用以解決影像對影像問題,而嘗試去分辨影像中的每一個N×N補丁(patch)是否為真或假。此鑑別器D例如是在整個影像上卷積執行,對所有響應求平均,以供作為鑑別器D的最終輸出。在語音轉換模型中,鑑別器D例如會對音頻的補丁或片段進行分類,而不是對整個音頻進行分類。藉此,雖然增加了分類器C的難度,但能夠有效提高鑑別器在語音轉換中的效能。最後,對於域的分類器C,例如使用門控卷積神經網路(Gated CNN),其是使用門控機制,而能夠以更快的速度達到與長短期記憶(Long short term memory,LSTM)網路相近的結果。上述的網路可以使用亞當優化器(Adam optimizer)對網路訓練200k步,其中,批次(batch)大小為8,生成器G和鑑別器D 的學習率可分別設為0.0001,且動量項可設為0.5。此外,上述的λ classification 、λ cycleconsistency 、λ identitymapping 例如可均設為10,但不限於此。本領域技術人員可視實際需要,修改為其他值。 Based on using StarGAN-VC as the speech conversion model, the model contains three parts to be trained, namely the above-mentioned generator G , discriminator D and classifier C. Among them, for the generator G, for example, two-dimensional convolutional neural networks (Convolutional Neural Networks, CNN) can be used. In this model, for example, the input acoustic feature sequence is regarded as a two-dimensional image of one channel. For the discriminator D used to identify true/false, for example, the concept of PatchGAN can be used, which is used to solve the image-to-image problem and try to distinguish whether each N × N patch in the image is true or false. Fake. This discriminator D is, for example, convolutionally performed on the entire image, and all responses are averaged to provide the final output of the discriminator D. In a speech conversion model, the discriminator D , for example, classifies patches or segments of audio rather than the entire audio. Although this increases the difficulty of classifier C , it can effectively improve the performance of the discriminator in speech conversion. Finally, for the domain classifier C , for example, use a gated convolutional neural network (Gated CNN), which uses a gating mechanism and can achieve the same goal as long short term memory (LSTM) at a faster speed. Similar results online. The above network can be trained for 200k steps using the Adam optimizer, where the batch size is 8, the learning rates of the generator G and the discriminator D can be set to 0.0001 respectively, and the momentum term Can be set to 0.5. In addition, the above-mentioned λ classification , λ cycleconsistency , and λ identitymapping may all be set to 10, for example, but are not limited thereto. Those skilled in the art may modify it to other values according to actual needs.
綜上所述,本發明實施例的透過歌聲轉換設計個人化虛擬歌手的方法及裝置利用單音選擇(unit selection)拼接方法,對文字的多種發音進行預錄,設計成虛擬歌手,而能夠將任一首有歌譜與歌詞的歌曲精準唱出。此外,利用Star-GAN演算法將多個說話者的音色轉換與虛擬歌手音色一同訓練,彼此的音色可以互換。藉此,針對選定的某個說話者,本發明實施例的歌聲合成裝置能夠將該說話者的音色與虛擬歌手的歌唱能力結合,而唱出具有該說話者音色的任一首歌。 In summary, the method and device for designing a personalized virtual singer through singing voice conversion according to embodiments of the present invention use a unit selection splicing method to pre-record multiple pronunciations of text and design them into virtual singers, which can Sing any song with sheet music and lyrics accurately. In addition, the Star-GAN algorithm is used to train the timbre conversion of multiple speakers together with the timbres of virtual singers, so that each other's timbres can be interchanged. Thereby, for a selected speaker, the singing voice synthesis device of the embodiment of the present invention can combine the speaker's timbre with the singing ability of the virtual singer, and sing any song with the speaker's timbre.
雖然本發明已以實施例揭露如上,然其並非用以限定本發明,任何所屬技術領域中具有通常知識者,在不脫離本發明的精神和範圍內,當可作些許的更動與潤飾,故本發明的保護範圍當視後附的申請專利範圍所界定者為準。 Although the present invention has been disclosed above through embodiments, they are not intended to limit the present invention. Anyone with ordinary knowledge in the technical field may make some modifications and modifications without departing from the spirit and scope of the present invention. Therefore, The protection scope of the present invention shall be determined by the appended patent application scope.
S202~S208:步驟 S202~S208: steps
Claims (14)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
TW110130295A TWI836255B (en) | 2021-08-17 | Method and apparatus in designing a personalized virtual singer using singing voice conversion |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
TW110130295A TWI836255B (en) | 2021-08-17 | Method and apparatus in designing a personalized virtual singer using singing voice conversion |
Publications (2)
Publication Number | Publication Date |
---|---|
TW202309875A TW202309875A (en) | 2023-03-01 |
TWI836255B true TWI836255B (en) | 2024-03-21 |
Family
ID=
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10789290B2 (en) | Audio data processing method and apparatus, and computer storage medium | |
CN106898340B (en) | Song synthesis method and terminal | |
KR20220004737A (en) | Multilingual speech synthesis and cross-language speech replication | |
WO2022188734A1 (en) | Speech synthesis method and apparatus, and readable storage medium | |
CN111445892B (en) | Song generation method and device, readable medium and electronic equipment | |
CN111048064B (en) | Voice cloning method and device based on single speaker voice synthesis data set | |
US20140046667A1 (en) | System for creating musical content using a client terminal | |
JP2007249212A (en) | Method, computer program and processor for text speech synthesis | |
JP2008134475A (en) | Technique for recognizing accent of input voice | |
CN112102811B (en) | Optimization method and device for synthesized voice and electronic equipment | |
WO2021212954A1 (en) | Method and apparatus for synthesizing emotional speech of specific speaker with extremely few resources | |
WO2023245389A1 (en) | Song generation method, apparatus, electronic device, and storage medium | |
JP2020003535A (en) | Program, information processing method, electronic apparatus and learnt model | |
CN111477210A (en) | Speech synthesis method and device | |
CN111508466A (en) | Text processing method, device and equipment and computer readable storage medium | |
CN112035699A (en) | Music synthesis method, device, equipment and computer readable medium | |
Dongmei | Design of English text-to-speech conversion algorithm based on machine learning | |
CN116798405B (en) | Speech synthesis method, device, storage medium and electronic equipment | |
TWI836255B (en) | Method and apparatus in designing a personalized virtual singer using singing voice conversion | |
CN116564269A (en) | Voice data processing method and device, electronic equipment and readable storage medium | |
JP2017194510A (en) | Acoustic model learning device, voice synthesis device, methods therefor and programs | |
CN114822489A (en) | Text transfer method and text transfer device | |
TW202309875A (en) | Method and apparatus in designing a personalized virtual singer using singing voice conversion | |
CN110431546A (en) | Enunciator retrieves device, enunciator's search method and enunciator's search program | |
KR20220070979A (en) | Style speech synthesis apparatus and speech synthesis method using style encoding network |