TW201120871A - Singing voice synthesis method - Google Patents

Singing voice synthesis method Download PDF

Info

Publication number
TW201120871A
TW201120871A TW98142510A TW98142510A TW201120871A TW 201120871 A TW201120871 A TW 201120871A TW 98142510 A TW98142510 A TW 98142510A TW 98142510 A TW98142510 A TW 98142510A TW 201120871 A TW201120871 A TW 201120871A
Authority
TW
Taiwan
Prior art keywords
sound
value
pitch
vocal
voice
Prior art date
Application number
TW98142510A
Other languages
Chinese (zh)
Other versions
TWI385644B (en
Inventor
Shi-Jinn Horng
Chun-Cheng Loong
Original Assignee
Univ Nat Taiwan Science Tech
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Univ Nat Taiwan Science Tech filed Critical Univ Nat Taiwan Science Tech
Priority to TW98142510A priority Critical patent/TWI385644B/en
Publication of TW201120871A publication Critical patent/TW201120871A/en
Application granted granted Critical
Publication of TWI385644B publication Critical patent/TWI385644B/en

Links

Abstract

The invention discloses a singing voice synthesis method for synthesizing a synthesized voice of singing voice from a plurality of voice samples. The singing voice synthesis method includes the following steps: retrieving a melody from a music file; according to the melody, retrieving the voice samples from a voice sample database; according to the melody, determining a pitch curve relative to the melody; according to the pitch curve and harmonic parameters of the voice samples, calculating harmonic parameters of the synthesized voice of singing voice; and according to the harmonic parameters of the synthesized voice of singing voice, calculating the synthesized voice of singing voice.

Description

201120871 六、發明說明: 【發明所屬之技術領域】 本發明係關於一種歌唱聲合成方法。 【先前技術】 隨著電腦硬體技術發展快速,相對地能夠處理的運算 也愈來愈複雜,現在的語音合成已經廣泛地運用在各個地 方,例如公車上的站名播報系統、電話語音系統或是提款 機上。 然而目前常見的語音系統多用於模擬說話聲,對於語 音樣本的合成,多為單純組合,故合成的語音聽起來就像 機器人說話。雖近年已有針對相鄰語音有不連續的問題進 行平滑處理,而使得合成的語音聽起來較為流暢,但對於 具有音高變化的歌唱聲,前述不連續的問題係更加明顯,201120871 VI. Description of the Invention: [Technical Field to Which the Invention Is Ascribed] The present invention relates to a method of synthesizing a singing voice. [Prior Art] With the rapid development of computer hardware technology, the relatively capable processing can be more and more complicated. Nowadays, speech synthesis has been widely used in various places, such as station name broadcast system on the bus, telephone voice system or It is on the cash machine. However, the current common speech systems are mostly used to simulate speech, and the synthesis of speech samples is mostly a simple combination, so the synthesized speech sounds like a robot. Although in recent years, the problem of discontinuity of adjacent speech has been smoothed, so that the synthesized speech sounds smoother, but for the singing voice with pitch change, the aforementioned discontinuous problem is more obvious.

而前述的平滑處理已不足以模擬出近似歌唱聲之合成笋 音’尤其是在某個待合成語音的長度與語音樣本的長度不 同時,單純地將語音樣本拉長或縮短,將妓得合立 聽起來更不自然。曰 【發明内容】 本發明之-㈣在於提供—錄唱聲合成方法, 用以將複數個語音樣本合成—歌唱聲合成音。 、糸 201120871 该歌唱聲合成方法包含下列步驟:(a)自一歌語文件 擷取出一歌譜;(b)根據該歌譜,自一語音樣本資料庫擷 取該複數個語音樣本;(C)根據該歌譜,決定出與該歌譜 有關之一音高曲線;(d)根據該音高曲線及該複數個語音 樣本之諧波參數,計算出關於該歌唱聲合成音之諧波參 數;以及(e)根據該歌唱聲合成音之諧波參數,來計算出 該歌唱聲合成音。The aforementioned smoothing process is not enough to simulate the synthetic bamboo sound of approximate singing voice. Especially when the length of a speech to be synthesized is different from the length of the speech sample, the speech sample is simply elongated or shortened. It sounds more unnatural. SUMMARY OF THE INVENTION The present invention provides a recording-sounding synthesis method for synthesizing a plurality of speech samples into a vocal synthesis sound.糸201120871 The vocal synthesis method comprises the following steps: (a) extracting a score from a slang file; (b) extracting the plurality of voice samples from a phonetic sample database according to the score; (C) according to The score of the song determines a pitch curve related to the score; (d) calculating a harmonic parameter relating to the synthesized sound of the vocal sound according to the pitch curve and harmonic parameters of the plurality of voice samples; and (e) The vocal synthesis sound is calculated based on the harmonic parameters of the vocal synthesis sound.

其中步驟(b)係先於該歌譜中加入複數個歌詞,再根 據該複數個歌詞,而自—語音樣本賴庫擷取對應該複數 詞的該複數個語音樣本。步驟⑷進一步包含根據該 S向曲線之曲線變化,來簡該音高曲線,而使得調整後 & a问曲線能更符合實際歌唱者,因應該歌譜的音高變 化而產生的歌唱音高調整。 葉轅ΐ中步驟(雜先針對麵音樣本的有聲部分進行傅立 :’以獲得該語音樣本 線來調整該複數個•⑨錄«问曲 成音之触參數。錄,轉縣歌唱聲合 之有聲二::::=時此:調整該語音樣本 頻域波络.時間長度。此外,該語音樣本之 係大體相同,因該:音樣本之頻域波絡 本之音色。 D曰聲合成音仍能保有該語音樣 201120871 其中步驟(e)係输制歌唱聲合成音之諧波參數,複 製對應的該語音樣本之無聲部分;接著,蚊出該語音樣 本之對應的有聲部分之練唱聲合成音的紐參數之初始 相位,然後’根據對應的該語音樣本之有聲部分的該歌唱 聲合成音之魏參數,以相位疊加的方式,來計算出對應 該#音樣本之該歌唱聲合成音之時域振幅;最後,根據對 應的音量來調㈣時域聽,因而將可則彡親歌唱聲合 成音。 因此’本發明之歌唱聲合成方法縣、有原語音樣本之 日色’並且可以根據音高曲線之變化來調整音高曲線,而 使付合成之歌唱聲合成音在清晰度、自财及流暢度上, 均較先前技術有明顯的改善。 關於本發明之優點與精神可以藉㈣下的發明詳述及 所附圖式件到進一步的瞭解。 【實施方式】 。明參閲圖一,圖一係繪示根據本發明之一具體實施例 之歌唱聲合成方法的流程圖。根據該具體實施例之歌唱聲 合^法圭要係包含兩大部分:—是人聲的合成;另一是 樂$聲的合成。兩者製成後,即進行混音程序,製成 WAV槽(或其他音樂職格式),以供後續播放》 如圖一所不’於歌唱聲合成的部分,首先自一歌譜文 201120871 件操取-歌譜’如步驟S1G2所示…般而言,歌譜文件 了為樂器數位介面(Musical Instrument DigitalStep (b) is to add a plurality of lyrics to the score, and then based on the plurality of lyrics, the plurality of voice samples corresponding to the plural words are taken from the voice sample library. Step (4) further includes simplifying the pitch curve according to the curve change of the S-curve curve, so that the adjusted & a question curve can be more in line with the actual singer, and the pitch adjustment of the pitch is generated according to the pitch change of the score. . Ye Weizhong's steps (the first part of the sound part of the surface sound sample is performed by Fu Li: 'Get the voice sample line to adjust the plural number of the 9th record «The sound of the touch of the sound. Record, turn county singing voice The sound two::::= this: adjust the frequency domain wave length of the speech sample. The length of the voice sample is generally the same, because the tone of the frequency domain is the tone of the sound sample. The synthesized sound can still retain the voice sample 201120871, wherein step (e) is to transmit the harmonic parameter of the vocal synthesis sound, and copy the corresponding silent part of the voice sample; and then, the corresponding voiced part of the voice sample is trained. Singing the initial phase of the new parameter of the synthesized sound, and then 'according to the corresponding Wei parameter of the vocal synthesized sound of the voiced part of the voice sample, calculating the singer corresponding to the #音 sample by phase superposition The time-domain amplitude of the synthesized sound; finally, according to the corresponding volume, (4) time-domain listening, so that the vocal synthesis sound can be sung. Therefore, the singer synthesis method of the present invention has the color of the original voice sample. and The pitch curve can be adjusted according to the change of the pitch curve, and the synthesized vocal synthesis sound is obviously improved in sharpness, self-finance and fluency compared with the prior art. The advantages and spirits of the present invention can be The invention will be further described with reference to the detailed description of the invention and the accompanying drawings. FIG. 1 is a flow chart showing a method for synthesizing singing voice according to an embodiment of the present invention. According to the specific embodiment, the vocal combination method comprises two parts: - the synthesis of the human voice; the other is the synthesis of the music sound. After the two are made, the mixing program is performed to make the WAV. Slot (or other music format) for subsequent playback. As shown in Figure 1, the part of the vocal synthesis is first taken from a song score 201120871 - the score is as shown in step S1G2... Documented as the digital interface of the instrument (Musical Instrument Digital

Inteffaee ’ MIDI)㈣’因而於此情形下所娜出的歌譜 疋不含歌㈣’但本發明縣侷限於此ό為了使後續作業 更方便’取出之歌譜得以表格形式來表現 ’如下表一 所不。為便於後續語音樣本賴取,可先行於表—中填入 歌习如表中及閃,,;當然,若歌譜文件中已含有 歌詞,則所練的歌譜自可包含所需的歌詞,而無需另行 填入。 歌詞 音量1.1 表一: :部分歌譜表格 轉折音數量 閃 1. 起始時間 4.70588 5.294115 音南 音長 2 0.490196 2 0.490196 ------- 補充說明的是,歌譜内的音高係 •正的頻率值可依如下等式所示之十二平均律來^算見:其真F~Fc4*2(pitchyi2 率值^為1^=㈣實際頻率值U中央C的頻 高值。’、’、.Hz、pitc1^從歌譜(表袼m賣入的音 驟_ 定出關於此歌譜之音高曲線,如步 依據歌譜中音高、音長(即該音高之持續之 201120871 時間)所直接得出之音高曲線,可再基於歌唱者因音高變 化(即音高鱗之崎變化)岐變實際歌唱聲之音高的事 實,來調魏述音高曲線。通常,當前—個音與目前這個 «之日问不同時’在目前這個音的起始端音高曲線會有一 個往上升再下降,或是往下降再上升的情i因此本發明 提出調整規則係如後所述。Inteffaee ' MIDI) (4) 'Therefore, the songs that are out of this situation do not contain songs (4)', but the county of the invention is limited to this. In order to make the follow-up work more convenient, the songs taken out can be expressed in tabular form. Do not. In order to facilitate the subsequent voice sample retrieval, you can first fill in the table and fill in the songs as shown in the table and flash,; of course, if the song files already contain lyrics, then the scores of the songs can contain the required lyrics, and No need to fill in separately. Lyrics volume 1.1 Table 1:: Part of the score table, the number of tones is flashing 1. Start time 4.70588 5.294115 Yinnan sound length 2 0.490196 2 0.490196 ------- It is added that the pitch in the score is • Positive frequency The value can be calculated according to the twelve-average law shown in the following equation: its true F~Fc4*2 (pitchyi2 rate value ^ is 1^=(four) the actual frequency value U central C frequency value. ', ', .Hz, pitc1^ From the scores of the songs (the sounds that are sold in the form _m) determine the pitch curve of the score, such as the step according to the pitch, the length of the pitch (that is, the duration of the pitch of 201120871) The resulting pitch curve can be used to adjust the pitch curve of the Wei Shuo based on the fact that the singer changes the pitch of the actual singing voice due to the pitch change (ie, the change in pitch scale). Usually, the current tone and current At this time, the pitch curve of the beginning of the current sound will have a rise and fall, or a fall and then rise. Therefore, the present invention proposes an adjustment rule as described later.

於該音南曲線中’若-標的音高值不等於下一個音高 值時’則根據該下-個音高值相對於該標的音高值之變化 f來調整該標的音南值。於此情形中,可再依該標的音 =值之前-個音高值,而再區分為三種調整規則。若該前 —個音高值相對於該標的音高值為低時,則依如下式所八 之第一種調整規則來調整: 不 p卜令)叫(1_+〔1+紀; 其中OSxSl,y為調整倍數,伪為目前該標的音高值, pi+1為下一個音高值。若該前一個音高值與該標的音高值 相同時,則依如下式所示之第二種調整規則來調整: y 卜誓(“) i十 L5*ptIn the sound South curve, if the pitch value of the if-mark is not equal to the next pitch value, the south value of the target is adjusted according to the change f of the lower pitch value relative to the pitch value of the target. In this case, it can be further divided into three adjustment rules according to the value of the pitch = value before the value of the pitch value. If the value of the previous pitch value is lower relative to the pitch value of the target, then the first adjustment rule is adjusted according to the following equation: (not a p) call) (1_+[1+纪; where OSxSl , y is the adjustment multiple, pseudo is the current pitch value, and pi+1 is the next pitch value. If the previous pitch value is the same as the target pitch value, then the second is as follows Adjustment rules to adjust: y swear (") i ten L5*pt

J 若該前一個音高值相對於該標的音高值為高時,則依如下 式所示之第三種調整規則來調整: 201120871 y 1 一 sin(5;sc) 10 * e -2.5πχJ If the previous pitch value is higher relative to the pitch value of the target, adjust it according to the third adjustment rule as shown in the following equation: 201120871 y 1 A sin(5;sc) 10 * e -2.5πχ

(l-x)+ l + &i__(jc) V 1-5* A J 此外,於該音高曲線中,若該標的音高值等於該下一 個音尚值的話,則以一預定倍數值來調整該標的音高值。 同樣地,於此情形中,可再依該標的音高值之前一個音高 二二籠分三種調整賴。若該前—個音高值相對於該 值為低時,則依如下柄示之第—種調整規則來(lx)+ l + &i__(jc) V 1-5* AJ In addition, in the pitch curve, if the value of the target pitch is equal to the value of the next tone, it is adjusted by a predetermined multiple value. The pitch value of the target. Similarly, in this case, three adjustments can be made according to the previous pitch value of the target pitch value. If the previous pitch value is lower than the value, then according to the first adjustment rule of the following handle

sin(5^x:) 10 * e -2.5 ία 若該前-個音高值與該標的音高仙同時 示之第二種調整規則來調整: '^如下式所Sin(5^x:) 10 * e -2.5 ία If the preceding-sound value is the same as the second pitch, the second adjustment rule is adjusted: '^

1+cos(25c) · 200 ’ 若該前一個音高值相對於該標的音高值高時, 所示之第三種調整規則調整: 則依如 下式 y 工 sin(5^c) 10 _ ^ λ-2.5λγ ^ Q 〇 前述調整係基於相鄰音高的相互影經 s 行’惟通^在某-個音和其下—個音之時間差距=下所壤 201120871 程度時’此某一個音亦不致於會因為其下一個音而受到影 響’因此在前述規則適用之前,須先判斷該標的音高值與 該下一個音南值,所對應的時間相差是否大於一預定時間 長度,此預定時間長度可設為0.9秒,但本發明未侷限於 此。:¾判斷結果為是’則將該下一個音高值則視為與該標 的音高值相同,來對音高曲線進行調整;亦即依後者三個 規則調整之。 _ 於上述調整音尚曲線完畢後,可接著或併行下述步驟 S106。如步驟S106所示,根據填入歌譜中的該複數個歌 詞,自一語音樣本資料庫擷取該複數個語音樣本,並做頻 譜分析以得到關於每一個語音樣本之諧波參數。基於人聲 的特性,一個音在時域的波形中可觀察到無聲部分和有聲 部分。無聲部分是非週期性的波形,無法將它分析出諧波 參數,& #對有聲部分進行傅纟葉轉換⑽咖 • TranSf〇rm)。由於語音樣本係由人聲錄製,故亦包含無聲 部分及有聲部分。有聲部分將進行傅立葉轉換以求得該 =音樣本的雜參數;絲部相直接㈣至歌唱聲合成 關於語音樣本的傅立葉轉換,先將一個語音樣本之 聲部分依其時域波形_❹料目同長度的音框,每低 框包點256個樣本點,語音取樣頻較22G5GHZ,因此 個音框的時縣度約為12邮。_音框(512個樣本點: 201120871 乘上一個漢明視窗(Hamming Window),再做一次快速傅 立葉轉換,轉換後得到多組譜波參數,每一組諸波參數均 包含振幅(Amplitude)、頻率(Frequency)及相位(Phase)。其 中基頻的振幅、頻率及相位可利用三次厄米樣條(Cubic Hermite Spline)内插求得,並且基於錄音員在錄製語音樣 本時,係儘量以相同音高、音長和音量來錄製,故在遞迴 3十舁基頻時’得以將搜尋範圍設定於2〇〇〜4〇〇Hz之間, 擇振幅大者為初值。基頻相關參數確定後,即可依其倍數 關係,定其他諧波搜尋初值’仍以三次厄米樣條進行遞迴 计鼻’以求得其·他譜波參數。 在取得語音樣本之諧波參數及基於歌譜產生之音高曲 線後,即可根據别述譜波參數及音高曲線,計算出關於歌 唱聲合成音之諧波參數,如步驟sl〇8所示。由於每一個 语音樣本的時間長度大致相同,但根據音高曲線對應的所 需語音樣本的時間長度確不盡相同,故需對該語音樣本進 行長度調整。 一般而言,語音樣本之有聲部分符合ADSR模型(如 圖二所示),其包含昇起(Attaek,A)區段、落下(Decay,D) 區多又延遲(Sustain ’ S)區段及結尾(Reiease,r)區段;而 田人拉長聲音時,主要是拉長延遲區段的時間長度。因 此,若該語音樣本之時間長度與對應該語音樣本之該音高 曲線的時間長度不同時,則調整該語音樣本之有聲部分之 201120871 延遲區段的時間長度。若前述縮短調整的幅度過大,則昇 起區段、落下區段及結尾區段亦需依合成音與語音樣本之 時間長度比例做縮短調整。為便於實施上述處理,在處理 語音樣本時,可針對語音樣本的各區段所佔時間長度,以 表格方式表現,如下表二。其中以樣本點數目為計算時間 長度的單位,當限定每一個語音樣本之長度均相同時,r 區段(結尾區段)可由全部長度減去無聲部分、AD區段(昇 • 起、落下區段)及s區段(延遲區段)計算而得,故不另外列 出;但本發明不以此為限。 表二:部分語音樣本表格 語音樣本編號 無聲部分 AD區段 s區段 c018 1543 1419 10272 c019 1482 1389 13431 但需的是’語音樣本之要表現在其頻域波 • 絡(envelope),故於上述調整語音樣本時,亦應保持調整 後波絡形狀大體湖,亦即使語音樣本之頻賴絡與該歌 曰聲合成音對應該語音樣本之頻域波絡係大體相同,以使 得調整後的合成音仍保有語音樣本的音色。為了要能夠精 準控制頻譜波絡雜不變’㈣地應使用之前在做頻譜分 析時所用的Cubic Hermite Spline,其之主要的目的是要將 合成音的諧波頻率之振幅值算出。找出合成音諧波頻率所 在區間的左右兩端之頻率點,以求得區間内之頻率振幅 值’即合成音諧波頻率之振幅值。 201120871 另外,若歌譜中出現有轉折音時,亦即一個語音樣本 對應至數個音高值,亦需對此對應的語音樣本進行分割的 處理即’若對應該語音樣本之該音高曲線包含多個音高 值時,則該多個音高值中之第一個音高值係對應該語音樣 本之有聲部分的昇起區段、落下區段及部分的延遲區段, 最後-個音高值係對應該語音樣本之有聲部分的結尾區段 及部分的延遲區段。在一個轉折音之情形下(即有兩個音 # 冑值對應同-個歌詞),第—個音高值㈣應昇起區段、 落下區段及部分的延遲區段,第二個音高值即對應部分的 延遲區段及結尾區段。在有二個轉折音之情形下(即有三 個音南值對朗-個歌詞)’第—個音高值即對應昇起區 段、落下區段及部分的延遲區段,第二個音高值即對部分 應延遲區段’而第三個音高值即對應部分的延遲區段及結 尾區段。在更多的轉折音之情形下,則依前述類推,不再 贅述。 此外,關於抖音效果的模擬,基於實際歌唱時,抖音 多使用在同-個音唱的時間較長的情形下,因此可預設一 預定時間長度,以作為是否需加入抖音效果的判斷基礎。 糾若於該音S曲線中,對應—音高值之時間長度係大於該 預定時間長麟’貞仏人—弦波函數於聽該音高值之該歌 唱聲合成音之諧波參數中。此預定時間長度可設定為丨秒, 但本發明未侷限於此。 13 201120871 接著步驟S108之後,將如步驟sn〇所示的,根據關 =該歌曰聲合成音之譜波參數,來計算出該歌唱聲合成 音。由於語音樣本之無聲部分並未進行傅立葉轉換,故於 產生該歌唱聲合成音時,關於該無聲部分係以複製的方式 加^。接著’決定出對應的該語音樣本之有聲部分之該歌 唱2成音之諧波參數之初始相位;再接著,根據對應的 =吾音樣本之有聲部分的該歌唱聲合成音之驗參數,以 ❿ 7波方程式及相位疊加的方^,料算出每舰本點之振 中田值並璺加,以求出對應該語音樣本之該歌唱聲合成音之 時域振幅。最後,根據歌譜中記錄的音量,調整該時域振 t。因而-個音節之歌唱聲合成音即絲,再完成其他音 即之歌唱聲合成音,即可併錢完整的歌唱聲合成音。 由於實際歌唱者於歌唱時,對於音高較低的歌詞唱得 ,是相對小聲,亦即造成音量會較低,因此在前述調整音 • 量机程中(即調整時域振幅)’若對應該時域振幅之音高值 係小於一預定音高值時,則以一預定倍數值來調整該時域 中田例如該預定音高值為低音C,且該預定倍數值為 〇.6 ;但本發明未侷限於此。而若對應該時域振幅之音高值 係等於或大於該預定音高值時’則根據該音高值,來線性 調整该時域振幅。例如以下述公式來加以調整: νηΊ _ 0.25 12~*Wte + 0,85 ; 14 201120871 其中vol是指時域振幅值要進行調整所乘以的倍數,_ 是從歌譜讀入之音高值(以整數值表現,例如中央c為 0)。*據上述說明再調整前述計算出之時域振幅,所得之歌 唱聲合成音將可更織近於真實歌唱者,於實際歌唱時的 音量變化。 至此,完整的歌唱聲合成音已完成,但背景音樂則需 另行合成(如步驟sm、S114及S116所示),再與歌唱聲 合成音混音(如步驟S118所示)。混音後可直接以醫格 式儲存(但本發明不以此為限),再供其他播放機或播放軟 體播放(如步驟sm所示)1於樂器聲合成及混音、播 放程序均為習知,不再贅述。此外,前述各公式中諸多係 數均為範例,其亦可依實測情形再行調整。 、综上所述’本發明之歌唱聲合成方法利 立 2之f域波絡,使得在調整音高之後,仍能保持ί二 之曰色’並且根據音高曲線本身的變化,調整音高曲 二:ΓΤ!擬出真人在歌唱時因曲調的不同而做出的 «同改變。另外,胁轉折音、抖音等,本發明之歌 合成方法·出了顧方法,以使得 备 在清晰度、自贿及流暢m較先聲口成9 [若在語音樣本上建立更多的 ^不同音色(大人、小孩、男生、女生等)的語音 則在選用語音樣本上可更增彈性,亦能合成出更符合真人 15 201120871 歌唱之歌唱聲合成音。 藉由以上健具H實施狀詳述,料望能更加清楚 描述本發明之特徵與精神’而麟以上述露的較佳具 體實施例來對本發明之範疇加以限制。相反地,其目的是 希望能涵蓋各種改變及具相等性的安排於本發明所欲申請 之專利範圍的範脅内。 【圖式簡單說明】 圖一係繪示根據本發明之一具體實施例之歌唱聲合成 方法之流程圖。 圖二係繪示ADSR模型各區段之示意圖。 【主要元件符號說明】 S102〜S120 :步驟 161+cos(25c) · 200 ' If the previous pitch value is higher than the pitch value of the target, the third adjustment rule is shown: yin(5^c) 10 _ ^ λ-2.5λγ ^ Q 〇 The above adjustment is based on the mutual sound of adjacent pitches s line 'only pass ^ in a certain tone and below - the time difference between the sounds = the next degree 201120871 degree 'this one A sound will not be affected by its next sound. Therefore, before the above rules are applied, it is necessary to judge whether the pitch value of the target and the next south value are different, and the corresponding time difference is greater than a predetermined length of time. This predetermined length of time can be set to 0.9 seconds, but the present invention is not limited thereto. :3⁄4 The result of the judgment is yes, then the next pitch value is regarded as the same pitch value as the target, and the pitch curve is adjusted; that is, it is adjusted according to the latter three rules. _ After the above adjustment of the pitch curve is completed, the following step S106 may be followed or parallel. As shown in step S106, the plurality of speech samples are retrieved from a speech sample database based on the plurality of lyrics entered in the score, and spectral analysis is performed to obtain harmonic parameters for each of the speech samples. Based on the characteristics of the human voice, a sound can observe the silent part and the sound part in the waveform of the time domain. The silent part is a non-periodic waveform, and it cannot be analyzed for harmonic parameters. &#Full-leaf conversion of the voiced part (10) coffee • TranSf〇rm). Since the voice sample is recorded by human voice, it also contains a silent part and a voiced part. The voiced part will perform Fourier transform to obtain the impurity parameters of the = tone sample; the silk phase directly (four) to the vocal synthesis to the Fourier transform of the speech sample, first the sound part of a speech sample according to its time domain waveform For the same length of the sound box, each low-frame package points 256 sample points, the voice sampling frequency is 22G5GHZ, so the time and county of each sound box is about 12 mail. _ sound box (512 sample points: 201120871 multiplied by a Hamming Window), and then do a fast Fourier transform, after conversion to get multiple sets of spectral parameters, each set of wave parameters contain amplitude (Amplitude), Frequency and phase, in which the amplitude, frequency and phase of the fundamental frequency can be obtained by interpolation using Cubic Hermite Spline, and based on the recording of the voice sample by the recorder, try to be the same. Pitch, length and volume are recorded, so when reverting back to 3 舁 fundamental frequency, 'the search range can be set between 2〇〇~4〇〇Hz, and the larger amplitude is the initial value. The fundamental frequency related parameters After the determination, according to the multiple relationship, the other harmonic search initial value 'still recursively counted with three Hermitian splines' to obtain its spectral parameters. Obtain the harmonic parameters of the speech sample and Based on the pitch curve generated by the score, the harmonic parameters of the synthesized sound of the singing voice can be calculated according to the spectral parameters and the pitch curve, as shown in step sl8. Due to the length of each voice sample Roughly the same, but according to The length of the required speech samples corresponding to the pitch curve is not the same, so the length of the speech sample needs to be adjusted. Generally speaking, the voiced part of the speech sample conforms to the ADSR model (as shown in Figure 2), which includes The (Attaek, A) section, the Decay (D) zone is delayed (Sustain 'S) section and the end (Reiease, r) section; while the Tian people stretch the sound, mainly the extended delay zone The length of time of the segment. Therefore, if the length of time of the speech sample is different from the length of time of the pitch curve corresponding to the speech sample, the length of the 201120871 delay segment of the voiced portion of the speech sample is adjusted. If the adjustment range is too large, the rising section, the falling section and the ending section are also required to be shortened according to the ratio of the length of the synthesized sound to the voice sample. To facilitate the above processing, the voice sample can be processed for the voice sample. The length of time for each segment is expressed in tabular form, as shown in Table 2. The number of sample points is the unit of time length, and the length of each voice sample is defined. When the same, the r segment (end segment) can be calculated from the total length minus the silent portion, the AD segment (up, down, and down segments) and the s segment (delay segment), so it is not listed separately; However, the present invention is not limited thereto. Table 2: Partial speech sample table Voice sample number Silent portion AD segment s segment c018 1543 1419 10272 c019 1482 1389 13431 But what is needed is that the speech sample is expressed in its frequency domain wave. • Envelope, so when adjusting the speech samples, the adjusted wave shape should be kept in the large lake, even if the frequency spectrum of the speech sample corresponds to the frequency-domain collinearity of the speech sample. The same is true so that the adjusted synthesized tone still retains the tone of the speech sample. In order to be able to accurately control the spectral wave complexes, the Cubic Hermite Spline used in the previous spectral analysis should be used. The main purpose of the Cubic Hermite Spline is to calculate the amplitude of the harmonic frequency of the synthesized sound. The frequency points at the left and right ends of the interval in which the harmonic frequency of the synthesized sound is located are found to obtain the amplitude value of the frequency amplitude value in the interval, that is, the amplitude of the synthesized harmonic frequency. 201120871 In addition, if there is a transition sound in the score, that is, a speech sample corresponds to several pitch values, the corresponding speech sample needs to be segmented, that is, if the pitch curve corresponding to the speech sample is included When a plurality of pitch values are present, the first pitch value of the plurality of pitch values corresponds to a rising segment, a falling segment, and a partial delay segment of the voiced portion of the voice sample, and the last tone The high value corresponds to the end segment of the voiced portion of the speech sample and a portion of the delay segment. In the case of a turning tone (ie, there are two tones corresponding to the same lyrics), the first pitch value (four) should raise the segment, the falling segment, and the partial delay segment, the second tone. The high value is the delay section and the end section of the corresponding part. In the case of two transitions (ie, there are three pitches to the lyrics), the first pitch value corresponds to the rising segment, the falling segment, and the partial delay segment, the second tone. The high value is that the portion should be delayed by the segment' and the third pitch value is the delay segment and the end segment of the corresponding portion. In the case of more transitions, the analogy is not repeated here. In addition, regarding the simulation of the vibrato effect, based on the actual singing, the vibrato is often used in the case where the same vocal is longer, so a predetermined length of time can be preset as a need to add a vibrato effect. Judging the basis. In the S-curve of the sound, the length of the corresponding-sound value is greater than the predetermined time, and the 贞仏-string function is in the harmonic parameter of the vocal synthesis sound that listens to the pitch value. This predetermined length of time can be set to leap seconds, but the present invention is not limited thereto. 13 201120871 After step S108, the vocal synthesized sound is calculated according to the spectral wave parameter of the singer synthesized sound as shown in step sn 〇. Since the silent portion of the speech sample is not subjected to Fourier transform, when the vocal synth sound is generated, the unvoiced portion is added by copying. Then 'determining the initial phase of the harmonic parameter of the vocal part of the corresponding voiced part of the voice sample; and then, according to the test parameter of the vocal synthesized sound of the corresponding voice part of the voice sample, ❿ The 7-wave equation and the phase superposition are calculated, and the mid-field value of each ship's current point is calculated and added to find the time-domain amplitude of the vocal synthesized sound corresponding to the speech sample. Finally, adjust the time domain vibration t according to the volume recorded in the score. Therefore, the singer's vocal synthesis sound is silk, and then the other sounds, that is, the vocal synthesis sound, can be combined with the complete vocal synthesis sound. Since the actual singer sings when singing, it is relatively vocal for the lower pitch lyrics, which means that the volume will be lower, so in the aforementioned adjustment sound and volume (ie, adjust the time domain amplitude) When the pitch value corresponding to the time domain amplitude is less than a predetermined pitch value, the time domain is adjusted by a predetermined multiple value, for example, the predetermined pitch value is the bass C, and the predetermined multiple value is 〇.6; However, the invention is not limited thereto. And if the pitch value corresponding to the time domain amplitude is equal to or greater than the predetermined pitch value, then the time domain amplitude is linearly adjusted according to the pitch value. For example, the following formula is used to adjust: νηΊ _ 0.25 12~*Wte + 0,85 ; 14 201120871 where vol is the multiple of the time domain amplitude value to be adjusted, and _ is the pitch value read from the score ( Expressed as an integer value, for example, the central c is 0). * According to the above description, the previously calculated time domain amplitude is adjusted, and the resulting vocal synthesis sound will be more closely related to the actual singer's volume change during actual singing. At this point, the complete vocal synthesis sound has been completed, but the background music needs to be synthesized separately (as shown in steps sm, S114 and S116), and then mixed with the vocal synthesis sound (as shown in step S118). After mixing, it can be stored directly in medical format (but the invention is not limited to this), and then played by other players or playback software (as shown in step sm). 1 In the instrument sound synthesis and mixing, the playback program is a habit. Know, no longer repeat them. In addition, many of the above formulas are examples, and they can be adjusted according to the actual measurement situation. In summary, the 'singing sound synthesis method of the present invention is a f-domain wave of Lili 2, so that after adjusting the pitch, the 曰2 曰 color can still be maintained ′ and the pitch is adjusted according to the change of the pitch curve itself. Qu 2: Hey! It is the same change that the real person made when he sang because of the difference in the tune. In addition, the singer-sounding, vibrato, and the like, the method for synthesizing the song of the present invention has a method of making the resolution, self-bribery, and smoothness of the sound port 9 into the first sound port [if more is established on the voice sample ^ The voices of different timbres (adults, children, boys, girls, etc.) can be more flexible in the selection of voice samples, and can also be synthesized to better match the vocal synthesis of the 201115871 singing. The features and spirit of the present invention are more clearly described by the above-described embodiments of the present invention, and the scope of the present invention is limited by the preferred embodiments disclosed above. On the contrary, the intention is to cover various modifications and equivalents within the scope of the patent scope of the invention. BRIEF DESCRIPTION OF THE DRAWINGS Fig. 1 is a flow chart showing a method of synthesizing a singing voice according to an embodiment of the present invention. Figure 2 is a schematic diagram showing each section of the ADSR model. [Main component symbol description] S102~S120: Step 16

Claims (1)

201120871 七 申請專利範圍: 1、 二合成方法’其係用以將複數個語音樣本合成 歌曰聲合成音’該歌唱聲合成料包含下列步驟: (a)自一歌語文件擷取出一歌譜; ⑼線根據純譜’決以錢歌譜㈣之-音高曲 H據ttK ’自—語音樣本㈣庫棘該複數個 浯音樣本; (dl根ΐ該ί高輯及該複數㈣音樣本之諧波參 m 4算出與紐唱聲合成音有關之驗參數; 以及 (^據關於該歌唱聲合成音之_ 該歌唱聲合成音。 个 2、 3、 如^翻_第丨韻述之歌唱聲合射法,立中步驟 (C)由下列步驟實施: ’、 於該歌譜中加入複數個歌詞;以及 根據該複數個歌詞,自一狂立 由複數個制_複二娜對應5亥 狀歌料合成枝,其中步踢 (bli根據該音高曲線之曲線變化,來調整該音高曲 唱聲合成方法,其中步踢 於該音高曲線中,若一標的音高值不等於下一個音高 17 4、 201120871 該標的音- 若該標的音高值係等;膽,以及 倍數值調整該標的音/ 個音焉值,則以一預定 :¾%歌唱聲合成方法,其•步 6、 間相差大於一預定;:該下-個音高值對應的時 該標2音高值。子1長度時,則以該預定倍數值調整 7、 定時間長^m、5項所述之歌唱聲合成方法,其令該預 由下列步驟實施··…、刀及有聲部分,步驟(d)係 數聲:進行傅增換, 峨該魏储音縣之諸波參 8、 如申請專利唱聲合成音之諧波參數。 ㈣包含歌唱聲合成方法,其中步驟 數。 —人厄*樣條’計算出該語音樣本之譜波參 201120871201120871 Seven patent application scope: 1. The second synthesis method is used to synthesize a plurality of speech samples into a vocal synthesis sound. The vocal composition comprises the following steps: (a) extracting a song from a slang file; (9) The line is based on the pure spectrum 'Decision to the money song spectrum (four) - pitch height H according to ttK 'self-speech sample (four) library spine the plural number of voice samples; (dl roots the ί high series and the complex number (four) tone sample harmonic The wave parameter m 4 calculates the test parameter related to the new vocal synthesis sound; and (^ according to the vocal synthesis sound _ the vocal synthesis sound. 2, 3, such as ^ turn _ 丨 丨 rhyme sing The compositing method, the step (C) is carried out by the following steps: ', adding a plurality of lyrics to the lyrics; and according to the plural lyrics, from a madness to a plural number of _ complex two na a synthetic branch, wherein the step kick (bli adjusts the pitch chorus synthesis method according to the curve of the pitch curve, wherein the step kicks in the pitch curve, if a target pitch value is not equal to the next tone High 17 4, 201120871 The sound of the subject - if the The pitch value is equal to; the biliary, and the multiplier value adjusts the value of the target sound/sound value, and then a predetermined: 3⁄4% vocal synthesis method, wherein: step 6, the phase difference is greater than a predetermined;: the next- When the pitch value corresponds to the mark 2 pitch value, when the length of the sub 1 is used, the singer synthesis method according to the predetermined multiple value is adjusted, and the predetermined time length is 2m, 5 items, which makes the pre-step Implementation ···, knife and voiced part, step (d) coefficient sound: carry out Fu Zeng, 峨 魏 储 储 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 The number of steps. - The human * spline 'calculates the spectral wave of the speech sample 201120871 高值時’則該多個音高值中之m —a 音樣本之有聲部分之昇起區段=日高值係對應該語 區段’而最後-個音高值係對部分的延遲 之結尾區段及部分的延遲區段。曰樣本之有聲部分 如申請專利範圍第7項所述之歌 驟⑷)中,該語音樣本之頻^曰^ 5成方法’其中於步 應該語音樣本之賴與錄唱聲合成音對 如申請專利範圍第7項所述 (叫包含利用三次厄米樣條,以合^,其中步驟 譜波參數。 ' 异出該歌唱聲合成音之 13、如申請專利範圍第7項所述之 (d2)進一步包含: 曰 若曲線中,對應一音高值之時間長度係大、 =疋時間長度時’騎對_音高值之概 音的讀波參數巾加人—錄㈣ ^ ^ W如申請專利範圍第7項所述之 斷出該語音= 包含判 施:曰俅令心…、耸邛分,及步驟(e)係由下列步驟實 11、 12、 合成方法,其中步驟 根f該歌唱聲合成音之奴參數,來複製對應的該語 音樣本之無聲部分; 決疋出對應的該語音樣本之有聲部分的該歌唱聲合成 音之請波參數的初始相位; 板據對應的該語音樣本之有聲部分的該歌料合成音 波參數’以相位疊加的方式,來計算出對應該 語音樣本之該歌唱聲合成音的時域振幅 ;以及 f據對應的音量,來娜該時域振幅。 、申请專利範圍第14項所述之歌唱聲合成方法,其中步 201120871 驟(:)進-步包含下列步驟: 振幅之音高值係小於1定音高值時, 疋倍數值來調整該時域振幅,•以及 右對應該時域振幅之音高值係等於或大於該預定音高 π μΐί’則根獅音高值,來線性婦該時域振幅。 16、 如申哨專利範圍第15項所述之歌唱聲合成方法其 預定音高值為低音C ’及該預定倍數值為〇.6。 17、 如申請專利範圍第1項所述之歌唱聲合成方法,其中該語 音樣本之譜波參數或該歌唱聲合成音之諧波參數,係包 含複數組之頻率、振幅及相位。 20When the value is high, then the m-a sound sample of the plurality of pitch values is raised. The day-high value is the corresponding segment and the last-pitch value is delayed. The ending section and the partial delay section.有The sound part of the sample is as in the song (4) mentioned in the seventh paragraph of the patent application, the frequency of the voice sample is 方法^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Patent scope 7 (referred to as the use of three Hermite splines, to ^, where the step spectral parameters. 'Issue of the vocal synthesis of the sound 13, as described in the scope of claim 7 (d2) ) further includes: 曰 If the curve corresponds to a pitch value, the length of time is large, = 疋 time length, 'the read wave parameter of the ride on the _ pitch value is added to the person's record — (4) ^ ^ W Breaking out the voice as described in item 7 of the patent scope = including the judgment: 曰俅 心 ..., 邛 ,, and step (e) by the following steps 11, 12, the synthesis method, wherein the step root f sings a synthesizing sound slave parameter for copying a corresponding silent portion of the voice sample; determining an initial phase of the wave parameter of the vocal synthesized sound corresponding to the voiced portion of the voice sample; the corresponding voice sample of the board data The vocal part of the vocal synthesis sound wave parameter' Calculating the time domain amplitude of the vocal synthesized sound corresponding to the speech sample in a phase superposition manner; and f accommodating the time domain amplitude according to the corresponding volume. The singing voice described in claim 14 The synthesis method, wherein step 201120871 (:) step-by-step comprises the following steps: When the amplitude value of the amplitude is less than 1 pitch value, the value is multiplied to adjust the time domain amplitude, and the right corresponds to the time domain amplitude The pitch value is equal to or greater than the predetermined pitch π μΐί', and the root lion pitch value is used to linearize the time domain amplitude. 16. The vocal synthesis method according to claim 15 of the whistle patent scope has a predetermined pitch The value is the bass C ' and the predetermined multiple value is 〇.6. 17. The vocal synthesis method according to claim 1, wherein the spectral wave parameter of the speech sample or the harmonic parameter of the vocal synthesized sound , contains the frequency, amplitude, and phase of the complex array.
TW98142510A 2009-12-11 2009-12-11 Singing voice synthesis method TWI385644B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW98142510A TWI385644B (en) 2009-12-11 2009-12-11 Singing voice synthesis method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW98142510A TWI385644B (en) 2009-12-11 2009-12-11 Singing voice synthesis method

Publications (2)

Publication Number Publication Date
TW201120871A true TW201120871A (en) 2011-06-16
TWI385644B TWI385644B (en) 2013-02-11

Family

ID=45045352

Family Applications (1)

Application Number Title Priority Date Filing Date
TW98142510A TWI385644B (en) 2009-12-11 2009-12-11 Singing voice synthesis method

Country Status (1)

Country Link
TW (1) TWI385644B (en)

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070137463A1 (en) * 2005-12-19 2007-06-21 Lumsden David J Digital Music Composition Device, Composition Software and Method of Use

Also Published As

Publication number Publication date
TWI385644B (en) 2013-02-11

Similar Documents

Publication Publication Date Title
CN106023969B (en) Method for applying audio effects to one or more tracks of a music compilation
JP4207902B2 (en) Speech synthesis apparatus and program
TWI394142B (en) System, method, and apparatus for singing voice synthesis
US8626497B2 (en) Automatic marking method for karaoke vocal accompaniment
CN112382257B (en) Audio processing method, device, equipment and medium
WO2020171033A1 (en) Sound signal synthesis method, generative model training method, sound signal synthesis system, and program
JP4645241B2 (en) Voice processing apparatus and program
JP6175812B2 (en) Musical sound information processing apparatus and program
JP7359164B2 (en) Sound signal synthesis method and neural network training method
TWI377558B (en) Singing synthesis systems and related synthesis methods
JP4844623B2 (en) CHORAL SYNTHESIS DEVICE, CHORAL SYNTHESIS METHOD, AND PROGRAM
TW200813977A (en) Automatic pitch following method and system for music accompaniment device
TWI377557B (en) Apparatus and method for correcting a singing voice
JP4024440B2 (en) Data input device for song search system
JP4757971B2 (en) Harmony sound adding device
TW201120871A (en) Singing voice synthesis method
JP3879524B2 (en) Waveform generation method, performance data processing method, and waveform selection device
TWI394141B (en) Karaoke song accompaniment automatic scoring method
JP7107427B2 (en) Sound signal synthesis method, generative model training method, sound signal synthesis system and program
JP4565846B2 (en) Pitch converter
JP2007240552A (en) Musical instrument sound recognition method, musical instrument annotation method and music piece searching method
JPH01288900A (en) Singing voice accompanying device
WO2020171035A1 (en) Sound signal synthesis method, generative model training method, sound signal synthesis system, and program
JP5810947B2 (en) Speech segment specifying device, speech parameter generating device, and program
JP5549325B2 (en) Sound processor

Legal Events

Date Code Title Description
MM4A Annulment or lapse of patent due to non-payment of fees