TWI385644B

TWI385644B - Singing voice synthesis method

Info

Publication number: TWI385644B
Application number: TW98142510A
Authority: TW
Inventors: Shi Jinn Horng; Chun Cheng Loong
Original assignee: Univ Nat Taiwan Science Tech
Priority date: 2009-12-11
Filing date: 2009-12-11
Publication date: 2013-02-11
Also published as: TW201120871A

Description

Singing sound synthesis method

本發明係關於一種歌唱聲合成方法。 The present invention relates to a vocal synthesis method.

隨著電腦硬體技術發展快速，相對地能夠處理的運算也愈來愈複雜，現在的語音合成已經廣泛地運用在各個地方，例如公車上的站名播報系統、電話語音系統或是提款機上。 With the rapid development of computer hardware technology, the relatively capable processing can be more and more complicated. Nowadays, speech synthesis has been widely used in various places, such as station name broadcast system on the bus, telephone voice system or cash machine. on.

然而目前常見的語音系統多用於模擬說話聲，對於語音樣本的合成，多為單純組合，故合成的語音聽起來就像機器人說話。雖近年已有針對相鄰語音有不連續的問題進行平滑處理，而使得合成的語音聽起來較為流暢，但對於具有音高變化的歌唱聲，前述不連續的問題係更加明顯，而前述的平滑處理已不足以模擬出近似歌唱聲之合成語音，尤其是在某個待合成語音的長度與語音樣本的長度不同時，單純地將語音樣本拉長或縮短，將會使得合成語音聽起來更不自然。 However, the current common speech systems are mostly used to simulate speech, and for the synthesis of speech samples, most of them are simple combinations, so the synthesized speech sounds like a robot. Although in recent years, the problem of discontinuity of adjacent speech has been smoothed, so that the synthesized speech sounds smoother, but for the singing voice with pitch change, the aforementioned discontinuity problem is more obvious, and the aforementioned smoothing Processing is not enough to simulate the synthesized speech of approximate singing voice, especially when the length of a certain synthesized speech is different from the length of the speech sample, simply stretching or shortening the speech sample will make the synthesized speech sound less natural.

本發明之一範疇在於提供一種歌唱聲合成方法，其係用以將複數個語音樣本合成一歌唱聲合成音。 One aspect of the present invention is to provide a vocal synthesis method for synthesizing a plurality of speech samples into a vocal synthesized sound.

該歌唱聲合成方法包含下列步驟：(a)自一歌語文件擷取出一歌譜；(b)根據該歌譜，自一語音樣本資料庫擷取該複數個語音樣本；(c)根據該歌譜，決定出與該歌譜有關之一音高曲線；(d)根據該音高曲線及該複數個語音樣本之諧波參數，計算出關於該歌唱聲合成音之諧波參數；以及(e)根據該歌唱聲合成音之諧波參數，來計算出該歌唱聲合成音。 The singing voice synthesis method comprises the following steps: (a) extracting a song spectrum from a song document; (b) extracting the plurality of voice samples from a voice sample database according to the song score; (c) according to the song score, Determining a pitch curve related to the score; (d) calculating a harmonic parameter regarding the synthesized sound of the vocal sound according to the pitch curve and harmonic parameters of the plurality of voice samples; and (e) according to the The harmonic parameters of the vocal synthesis sound are calculated to calculate the vocal synthesis sound.

其中步驟(b)係先於該歌譜中加入複數個歌詞，再根據該複數個歌詞，而自一語音樣本資料庫擷取對應該複數個歌詞的該複數個語音樣本。步驟(c)進一步包含根據該音高曲線之曲線變化，來調整該音高曲線，而使得調整後的該音高曲線能更符合實際歌唱者，因應該歌譜的音高變化而產生的歌唱音高調整。 Step (b) is to add a plurality of lyrics before the song score, and then extract the plurality of voice samples corresponding to the plurality of lyrics from a voice sample database according to the plurality of lyrics. Step (c) further includes adjusting the pitch curve according to the curve change of the pitch curve, so that the adjusted pitch curve is more in line with the actual singer, and the vocal sound generated by the pitch change of the score High adjustment.

其中步驟(d)係先針對該語音樣本的有聲部分進行傅立葉轉換，以獲得該語音樣本之諧波參數，接著根據該音高曲線來調整該複數個語音樣本之諧波參數，以獲得該歌唱聲合成音之諧波參數。若該語音樣本之時間長度與對應該語音樣本之該音高曲線的時間長度不同時，則調整該語音樣本之有聲部分的延遲區段之時間長度。此外，該語音樣本之頻域波絡以及該歌唱聲合成音對應該語音樣本之頻域波絡係大體相同，因而使得該歌唱聲合成音仍能保有該語音樣本之音色。 Wherein step (d) first performs Fourier transform on the voiced portion of the voice sample to obtain harmonic parameters of the voice sample, and then adjusts harmonic parameters of the plurality of voice samples according to the pitch curve to obtain the singing Harmonic parameters of the sonic synthesis sound. If the length of time of the speech sample is different from the length of time of the pitch curve corresponding to the speech sample, then the length of the delay segment of the voiced portion of the speech sample is adjusted. In addition, the frequency domain wave of the speech sample and the frequency domain wave system of the vocal synthesis sound corresponding to the speech sample are substantially the same, so that the vocal synthesis sound can still retain the timbre of the speech sample.

其中步驟(e)係先根據該歌唱聲合成音之諧波參數，複製對應的該語音樣本之無聲部分；接著，決定出該語音樣本之對應的有聲部分之該歌唱聲合成音的諧波參數之初始相位；然後，根據對應的該語音樣本之有聲部分的該歌唱聲合成音之諧波參數，以相位疊加的方式，來計算出對應該語音樣本之該歌唱聲合成音之時域振幅；最後，根據對應的音量來調整該時域振幅，因而將可以形成該歌唱聲合成音。 Step (e) firstly copying the corresponding silent part of the voice sample according to the harmonic parameter of the vocal synthesized sound; and then determining the harmonic parameter of the vocal synthesized sound of the corresponding voiced part of the voice sample The initial phase; then, according to the harmonic parameter of the vocal synthesized sound of the corresponding voice part of the voice sample, the time domain amplitude of the vocal synthesized sound corresponding to the voice sample is calculated in a phase superposition manner; Finally, the time domain amplitude is adjusted according to the corresponding volume, and thus the vocal synthesized sound can be formed.

因此，本發明之歌唱聲合成方法係保有原語音樣本之音色，並且可以根據音高曲線之變化來調整音高曲線，而使得合成之歌唱聲合成音在清晰度、自然度及流暢度上，均較先前技術有明顯的改善。 Therefore, the vocal synthesis method of the present invention preserves the timbre of the original speech sample, and can adjust the pitch curve according to the change of the pitch curve, so that the synthesized vocal synth sound is in definition, naturalness and fluency. Both are significantly improved over the prior art.

關於本發明之優點與精神可以藉由以下的發明詳述及所附圖式得到進一步的瞭解。 The advantages and spirit of the present invention will be further understood from the following detailed description of the invention.

請參閱圖一，圖一係繪示根據本發明之一具體實施例之歌唱聲合成方法的流程圖。根據該具體實施例之歌唱聲合成方法主要係包含兩大部分：一是人聲的合成；另一是樂器聲的合成。兩者製成後，即進行混音程序，製成WAV檔(或其他音樂檔案格式)，以供後續播放。 Referring to FIG. 1, FIG. 1 is a flow chart showing a method for synthesizing singing voice according to an embodiment of the present invention. The vocal synthesis method according to this embodiment mainly includes two major parts: one is the synthesis of human voice; the other is the synthesis of musical instrument sound. After the two are made, the mixing process is performed to make a WAV file (or other music file format) for subsequent playback.

如圖一所示，於歌唱聲合成的部分，首先自一歌譜文件擷取一歌譜，如步驟S102所示。一般而言，歌譜文件可為一樂器數位介面(Musical Instrument Digital Interface，MIDI)檔案，因而於此情形下所擷取出的歌譜是不含歌詞的，但本發明並未侷限於此。為了使後續作業更方便，所擷取出之歌譜得以表格形式來表現，如下表一所示。為便於後續語音樣本的擷取，可先行於表一中填入歌詞，如表一中"一"及"閃"；當然，若歌譜文件中已含有歌詞，則所擷取的歌譜自可包含所需的歌詞，而無需另行填入。 As shown in Figure 1, in the part of the singing voice synthesis, the first song from a song A piece of music is captured, as shown in step S102. In general, the music score file may be a Musical Instrument Digital Interface (MIDI) file, and thus the music score extracted in this case is free of lyrics, but the present invention is not limited thereto. In order to make the subsequent work more convenient, the extracted scores can be expressed in the form of a table, as shown in Table 1 below. In order to facilitate the retrieval of subsequent speech samples, you can first fill in the lyrics in Table 1, such as "one" and "flash" in Table 1. Of course, if the lyrics already contain lyrics, the scores you have taken can contain The required lyrics are not required to be filled in separately.

補充說明的是，歌譜內的音高係以整數值表現，其真正的頻率值可依如下等式所示之十二平均律來計算：F=F_c4*2^(pitch)/12 In addition, the pitch in the score is expressed as an integer value, and the true frequency value can be calculated according to the twelve-average law shown in the following equation: F=F _c4 *2 ^(pitch)/12

其中，F即為求出的實際頻率值、F_c4是中央C的頻率值，約為261.626Hz、pitch是從歌譜(表格)中讀入的音高值。 Where F is the calculated actual frequency value, F _c4 is the frequency value of the center C, which is about 261.626 Hz, and pitch is the pitch value read from the score (table).

接著根據歌譜來決定出關於此歌譜之音高曲線，如步驟S104所示。依據歌譜中音高、音長(即該音高之持續之時間)所直接得出之音高曲線，可再基於歌唱者因音高變化(即音高曲線之曲線變化)而改變實際歌唱聲之音高的事實，來調整前述音高曲線。通常，當前一個音與目前這個音之音高不同時，在目前這個音的起始端音高曲線會有一個往上升再下降，或是往下降再上升的情況。因此本發明提出調整規則係如後所述。 Then, according to the score, the pitch curve of the score is determined, as shown in step S104. According to the pitch and length of the score (that is, the duration of the pitch) The pitch curve directly derived from time) can be adjusted based on the fact that the singer changes the pitch of the actual singing voice due to the pitch change (ie, the curve of the pitch curve). Usually, when the current one is different from the pitch of the current sound, the pitch curve at the beginning of the current sound will have a rise and fall, or a fall and then rise. Therefore, the present invention proposes an adjustment rule as will be described later.

於該音高曲線中，若一標的音高值不等於下一個音高值時，則根據該下一個音高值相對於該標的音高值之變化率，來調整該標的音高值。於此情形中，可再依該標的音高值之前一個音高值，而再區分為三種調整規則。若該前一個音高值相對於該標的音高值為低時，則依如下式所示之第一種調整規則來調整：其中0≦x≦1，y為調整倍數，p_i為目前該標的音高值，p_i+1為下一個音高值。若該前一個音高值與該標的音高值相同時，則依如下式所示之第二種調整規則來調整：若該前一個音高值相對於該標的音高值為高時，則依如下式所示之第三種調整規則來調整： In the pitch curve, if a target pitch value is not equal to the next pitch value, the target pitch value is adjusted according to the rate of change of the next pitch value relative to the target pitch value. In this case, the pitch value of the target pitch value can be further divided into three adjustment rules. If the previous pitch value is lower relative to the pitch value of the target, it is adjusted according to the first adjustment rule shown in the following equation: Where 0≦x≦1, y is the adjustment multiple, p _i is the current pitch value of the target, and p _i+1 is the next pitch value. If the previous pitch value is the same as the pitch value of the target, it is adjusted according to the second adjustment rule shown in the following formula: If the previous pitch value is higher relative to the pitch value of the target, it is adjusted according to the third adjustment rule shown in the following equation:

此外，於該音高曲線中，若該標的音高值等於該下一個音高值的話，則以一預定倍數值來調整該標的音高值。同樣地，於此情形中，可再依該標的音高值之前一個音高值，而再區分三種調整規則。若該前一個音高值相對於該標的音高值為低時，則依如下式所示之第一種調整規則來調整：若該前一個音高值與該標的音高值相同時，則依如下式所示之第二種調整規則來調整：若該前一個音高值相對於該標的音高值高時，則依如下式所示之第三種調整規則調整： Further, in the pitch curve, if the target pitch value is equal to the next pitch value, the target pitch value is adjusted by a predetermined multiple value. Similarly, in this case, the three pitch adjustment values can be further distinguished according to the previous pitch value of the target pitch value. If the previous pitch value is lower relative to the pitch value of the target, it is adjusted according to the first adjustment rule shown in the following equation: If the previous pitch value is the same as the pitch value of the target, it is adjusted according to the second adjustment rule shown in the following formula: If the previous pitch value is higher than the pitch value of the target, it is adjusted according to the third adjustment rule shown in the following formula:

前述調整係基於相鄰音高的相互影響之前提下所進行，惟通常在某一個音和其下一個音之時間差距大於一定程度時，此某一個音亦不致於會因為其下一個音而受到影響，因此在前述規則適用之前，須先判斷該標的音高值與該下一個音高值，所對應的時間相差是否大於一預定時間長度，此預定時間長度可設為0.9秒，但本發明未侷限於此。若判斷結果為是，則將該下一個音高值則視為與該標的音高值相同，來對音高曲線進行調整；亦即依後者三個規則調整之。 The aforementioned adjustments are made based on the interaction of adjacent pitches, but usually the time difference between a certain tone and its next tone is greater than a certain At the degree, this one sound will not be affected by its next sound. Therefore, before the above rules are applied, it is necessary to judge whether the pitch value of the target and the next pitch value are different, and the corresponding time difference is greater than For a predetermined length of time, the predetermined length of time may be set to 0.9 seconds, but the present invention is not limited thereto. If the result of the determination is yes, the next pitch value is regarded as the same as the pitch value of the target, and the pitch curve is adjusted; that is, the latter rule is adjusted according to the latter three rules.

於上述調整音高曲線完畢後，可接著或併行下述步驟S106。如步驟S106所示，根據填入歌譜中的該複數個歌詞，自一語音樣本資料庫擷取該複數個語音樣本，並做頻譜分析以得到關於每一個語音樣本之諧波參數。基於人聲的特性，一個音在時域的波形中可觀察到無聲部分和有聲部分。無聲部分是非週期性的波形，無法將它分析出諧波參數，故僅對有聲部分進行傅立葉轉換(Fourier Transform)。由於語音樣本係由人聲錄製，故亦包含無聲部分及有聲部分。有聲部分將進行傅立葉轉換，以求得該語音樣本的諧波參數；無聲部分則直接複製至歌唱聲合成音中。 After the above adjustment of the pitch curve is completed, the following step S106 may be followed by or in parallel. As shown in step S106, the plurality of speech samples are retrieved from a speech sample database according to the plurality of lyrics filled in the score, and spectral analysis is performed to obtain harmonic parameters for each speech sample. Based on the characteristics of the human voice, a sound can observe the silent part and the sound part in the waveform of the time domain. The silent part is a non-periodic waveform, and it cannot be analyzed for harmonic parameters, so only the Fourier transform is performed on the voiced part. Since the voice sample is recorded by human voice, it also includes a silent part and a voiced part. The vocal part will be Fourier transformed to obtain the harmonic parameters of the speech sample; the unvoiced part is directly copied to the vocal synthesis.

關於語音樣本的傅立葉轉換，先將一個語音樣本之有聲部分依其時域波形切割成多個相同長度的音框，每個音框包點256個樣本點，語音取樣頻率是22050Hz，因此每個音框的時間長度約為12ms。兩個音框(512個樣本點)先乘上一個漢明視窗(Hamming Window)，再做一次快速傅立葉轉換，轉換後得到多組諧波參數，每一組諧波參數均包含振幅(Amplitude)、頻率(Frequency)及相位(Phase)。其中基頻的振幅、頻率及相位可利用三次厄米樣條(Cubic Hermite Spline)內插求得，並且基於錄音員在錄製語音樣本時，係儘量以相同音高、音長和音量來錄製，故在遞迴計算基頻時，得以將搜尋範圍設定於200~400Hz之間，擇振幅大者為初值。基頻相關參數確定後，即可依其倍數關係，定其他諧波搜尋初值，仍以三次厄米樣條進行遞迴計算，以求得其他諧波參數。 Regarding the Fourier transform of a speech sample, the voiced part of a speech sample is first cut into a plurality of sound boxes of the same length according to its time domain waveform, each sound box has 256 sample points, and the voice sampling frequency is 22050 Hz, so each The length of the sound box is about 12ms. Two sound boxes (512 sample points) first Multiply a Hamming Window and perform a fast Fourier transform. After conversion, multiple sets of harmonic parameters are obtained. Each set of harmonic parameters includes amplitude (Amplitude), frequency (Frequency) and phase (Phase). The amplitude, frequency and phase of the fundamental frequency can be obtained by interpolation using Cubic Hermite Spline, and the recording is based on the same pitch, length and volume as the recorder is recording. Therefore, when recalculating the fundamental frequency, the search range can be set between 200 and 400 Hz, and the larger amplitude is the initial value. After the fundamental frequency related parameters are determined, the other harmonic search initial values can be determined according to the multiple relationship, and the three Hermitian splines are still used for the recursive calculation to obtain other harmonic parameters.

在取得語音樣本之諧波參數及基於歌譜產生之音高曲線後，即可根據前述諧波參數及音高曲線，計算出關於歌唱聲合成音之諧波參數，如步驟S108所示。由於每一個語音樣本的時間長度大致相同，但根據音高曲線對應的所需語音樣本的時間長度確不盡相同，故需對該語音樣本進行長度調整。 After obtaining the harmonic parameters of the speech sample and the pitch curve generated based on the music score, the harmonic parameters of the synthesized sound of the singing voice can be calculated according to the aforementioned harmonic parameters and the pitch curve, as shown in step S108. Since the length of time of each voice sample is substantially the same, the length of time required for the voice sample corresponding to the pitch curve is not the same, so the length of the voice sample needs to be adjusted.

一般而言，語音樣本之有聲部分符合ADSR模型(如圖二所示)，其包含昇起(Attack，A)區段、落下(Decay，D)區段、延遲(Sustain，S)區段及結尾(Release，R)區段；而當人拉長聲音時，主要是拉長延遲區段的時間長度。因此，若該語音樣本之時間長度與對應該語音樣本之該音高曲線的時間長度不同時，則調整該語音樣本之有聲部分之延遲區段的時間長度。若前述縮短調整的幅度過大，則昇起區段、落下區段及結尾區段亦需依合成音與語音樣本之時間長度比例做縮短調整。為便於實施上述處理，在處理語音樣本時，可針對語音樣本的各區段所佔時間長度，以表格方式表現，如下表二。其中以樣本點數目為計算時間長度的單位，當限定每一個語音樣本之長度均相同時，R區段(結尾區段)可由全部長度減去無聲部分、AD區段(昇起、落下區段)及S區段(延遲區段)計算而得，故不另外列出；但本發明不以此為限。 In general, the voiced portion of the speech sample conforms to the ADSR model (as shown in Figure 2), which includes an Attack (A) segment, a Decay (D) segment, a Delay (Sustain, S) segment, and The section of the ending (R) section; when the person stretches the sound, it mainly lengthens the length of the delay section. Therefore, if the length of time of the speech sample is different from the length of time of the pitch curve corresponding to the speech sample, then the voiced portion of the speech sample is adjusted. The length of time for the delay segment. If the amplitude of the shortening adjustment is too large, the rising section, the falling section and the ending section also need to be shortened according to the ratio of the length of the synthesized sound to the voice sample. To facilitate the implementation of the above processing, when processing a speech sample, the length of time for each segment of the speech sample can be expressed in a tabular manner, as shown in Table 2 below. Wherein the number of sample points is the unit of the calculation time length, when the length of each speech sample is defined to be the same, the R segment (end segment) can be subtracted from the entire length by the silent portion, the AD segment (rising, falling segment) And the S section (delay section) is calculated and is not listed separately; however, the invention is not limited thereto.

但需注意的是，語音樣本之音色主要表現在其頻域波絡(envelope)，故於上述調整語音樣本時，亦應保持調整後波絡形狀大體相同，亦即使語音樣本之頻域波絡與該歌唱聲合成音對應該語音樣本之頻域波絡係大體相同，以使得調整後的合成音仍保有語音樣本的音色。為了要能夠精準控制頻譜波絡保持不變，同樣地應使用之前在做頻譜分析時所用的Cubic Hermite Spline，其之主要的目的是要將合成音的諧波頻率之振幅值算出。找出合成音諧波頻率所在區間的左右兩端之頻率點，以求得區間內之頻率振幅值，即合成音諧波頻率之振幅值。 However, it should be noted that the tone of the speech sample is mainly represented by its frequency domain envelope. Therefore, when adjusting the speech samples, the waveform shape should be kept the same after adjustment, even if the frequency domain of the speech sample is The frequency domain collinearity corresponding to the vocal synthesis sound corresponding to the speech sample is substantially the same, so that the adjusted synthesized sound still retains the timbre of the speech sample. In order to be able to accurately control the spectral waveform to remain the same, the Cubic Hermite Spline used in the previous spectrum analysis should be used as well. The main purpose of the Cubic Hermite Spline is to calculate the amplitude of the harmonic frequency of the synthesized sound. Find the frequency points at the left and right ends of the interval where the harmonic frequency of the synthesized sound is located to obtain the amplitude value of the frequency within the interval, that is, the amplitude value of the synthesized harmonic frequency.

另外，若歌譜中出現有轉折音時，亦即一個語音樣本對應至數個音高值，亦需對此對應的語音樣本進行分割的處理。即，若對應該語音樣本之該音高曲線包含多個音高值時，則該多個音高值中之第一個音高值係對應該語音樣本之有聲部分的昇起區段、落下區段及部分的延遲區段，最後一個音高值係對應該語音樣本之有聲部分的結尾區段及部分的延遲區段。在一個轉折音之情形下(即有兩個音高值對應同一個歌詞)，第一個音高值即對應昇起區段、落下區段及部分的延遲區段，第二個音高值即對應部分的延遲區段及結尾區段。在有二個轉折音之情形下(即有三個音高值對應同一個歌詞)，第一個音高值即對應昇起區段、落下區段及部分的延遲區段，第二個音高值即對部分應延遲區段，而第三個音高值即對應部分的延遲區段及結尾區段。在更多的轉折音之情形下，則依前述類推，不再贅述。 In addition, if there is a transition sound in the score, that is, a speech sample corresponds to a plurality of pitch values, the corresponding speech sample needs to be segmented. That is, if the pitch curve corresponding to the speech sample includes a plurality of pitch values, the first pitch value of the plurality of pitch values corresponds to the rising segment of the voiced portion of the speech sample, and falls. The segment and part of the delay segment, the last pitch value corresponds to the end segment of the voiced portion of the speech sample and a portion of the delay segment. In the case of a turning sound (that is, there are two pitch values corresponding to the same lyric), the first pitch value corresponds to the rising section, the falling section, and the partial delay section, and the second pitch value That is, the delay section and the end section of the corresponding part. In the case of two transitions (ie, three pitch values corresponding to the same lyric), the first pitch value corresponds to the raised segment, the falling segment, and a portion of the delay segment, the second pitch The value is the delay segment for the portion, and the third pitch value is the delay segment and the end segment of the corresponding portion. In the case of more transitional sounds, it is analogous to the above, and will not be described again.

此外，關於抖音效果的模擬，基於實際歌唱時，抖音多使用在同一個音唱的時間較長的情形下，因此可預設一預定時間長度，以作為是否需加入抖音效果的判斷基礎。亦即若於該音高曲線中，對應一音高值之時間長度係大於該預定時間長度時，則加入一弦波函數於對應該音高值之該歌唱聲合成音之諧波參數中。此預定時間長度可設定為1秒，但本發明未侷限於此。 In addition, regarding the simulation of the vibrato effect, based on the actual singing, the vibrato is often used in the case where the same vocal is longer, so a predetermined length of time can be preset as a judgment as to whether or not the vibrating effect needs to be added. basis. That is, if the length of time corresponding to a pitch value is greater than the predetermined length of time in the pitch curve, a sine wave function is added to the harmonic parameter of the vocal synthesized sound corresponding to the pitch value. This predetermined length of time can be set to 1 second, but the present invention is not limited thereto.

接著步驟S108之後，將如步驟S110所示的，根據關於該歌唱聲合成音之諧波參數，來計算出該歌唱聲合成音。由於語音樣本之無聲部分並未進行傅立葉轉換，故於產生該歌唱聲合成音時，關於該無聲部分係以複製的方式加入。接著，決定出對應的該語音樣本之有聲部分之該歌唱聲合成音之諧波參數之初始相位；再接著，根據對應的該語音樣本之有聲部分的該歌唱聲合成音之諧波參數，以諧波方程式及相位疊加的方式，來計算出每個樣本點之振幅值並疊加，以求出對應該語音樣本之該歌唱聲合成音之時域振幅。最後，根據歌譜中記錄的音量，調整該時域振幅。因而一個音節之歌唱聲合成音即完成，再完成其他音節之歌唱聲合成音，即可併合成完整的歌唱聲合成音。 Following step S108, the vocal synthesis sound is calculated based on the harmonic parameters of the vocal synthesized sound as shown in step S110. Since the silent portion of the speech sample is not subjected to Fourier transform, when the vocal synth sound is generated, the unvoiced portion is added in a copy manner. Then, determining an initial phase of the harmonic parameter of the vocal synthesized sound of the corresponding voice part of the voice sample; and then, according to the harmonic parameter of the vocal synthesized sound of the corresponding voice part of the voice sample, The harmonic equation and the phase superposition method are used to calculate the amplitude values of each sample point and superimpose them to obtain the time domain amplitude of the vocal synthesized sound corresponding to the speech sample. Finally, adjust the time domain amplitude based on the volume recorded in the score. Therefore, the vocal synthesis of a syllable is completed, and the vocal synthesis of other syllables is completed, and the complete vocal synthesis can be synthesized.

由於實際歌唱者於歌唱時，對於音高較低的歌詞唱得總是相對小聲，亦即造成音量會較低，因此在前述調整音量流程中(即調整時域振幅)，若對應該時域振幅之音高值係小於一預定音高值時，則以一預定倍數值來調整該時域振幅。例如該預定音高值為低音C，且該預定倍數值為0.6；但本發明未侷限於此。而若對應該時域振幅之音高值係等於或大於該預定音高值時，則根據該音高值，來線性調整該時域振幅。例如以下述公式來加以調整：其中vol是指時域振幅值要進行調整所乘以的倍數，note是從歌譜讀入之音高值(以整數值表現，例如中央C為0)。據上述說明再調整前述計算出之時域振幅，所得之歌唱聲合成音將可更能接近於真實歌唱者，於實際歌唱時的音量變化。 Since the actual singer is singing, the lyrics with lower pitch are always whispered relatively, that is, the volume will be lower, so in the aforementioned adjustment volume process (ie, adjusting the time domain amplitude), if corresponding When the pitch value of the domain amplitude is less than a predetermined pitch value, the time domain amplitude is adjusted by a predetermined multiple value. For example, the predetermined pitch value is the bass C, and the predetermined multiple value is 0.6; however, the present invention is not limited thereto. And if the pitch value corresponding to the time domain amplitude is equal to or greater than the predetermined pitch value, the time domain amplitude is linearly adjusted according to the pitch value. For example, adjust it by the following formula: Where vol is the multiple of the time domain amplitude value to be adjusted, and note is the pitch value read from the score (expressed as an integer value, for example, the center C is 0). According to the above description, the previously calculated time domain amplitude is adjusted, and the resulting vocal synthesized sound will be more close to the real singer, and the volume of the actual singing will change.

至此，完整的歌唱聲合成音已完成，但背景音樂則需另行合成(如步驟S112、S114及S116所示)，再與歌唱聲合成音混音(如步驟S118所示)。混音後可直接以WAV格式儲存(但本發明不以此為限)，再供其他播放機或播放軟體播放(如步驟S120所示)。關於樂器聲合成及混音、播放程序均為習知，不再贅述。此外，前述各公式中諸多係數均為範例，其亦可依實測情形再行調整。 At this point, the complete vocal synthesis sound has been completed, but the background music needs to be synthesized separately (as shown in steps S112, S114 and S116), and then mixed with the vocal synthesis sound (as shown in step S118). After mixing, it can be directly stored in the WAV format (but the invention is not limited thereto), and then played by other players or playing software (as shown in step S120). The sound synthesis and mixing and playback programs of the instruments are well-known and will not be described again. In addition, many of the coefficients in the above formulas are examples, which can be adjusted according to the actual measurement situation.

綜上所述，本發明之歌唱聲合成方法利用維持原語音樣本之頻域波絡，使得在調整音高之後，仍能保持原語音樣本之音色，並且根據音高曲線本身的變化，調整音高曲線，可進一步模擬出真人在歌唱時因曲調的不同而做出的音高改變。另外，對於轉折音、抖音等，本發明之歌唱聲合成方法亦提出了模擬方法，以使得合成之歌唱聲合成音在清晰度、自然度及流暢度上，均較先前技術有明顯的改善。若在語音樣本上建立更多的語音樣本，例如不同音高、不同音色(大人、小孩、男生、女生等)的語音樣本，則在選用語音樣本上可更增彈性，亦能合成出更符合真人歌唱之歌唱聲合成音。 In summary, the vocal synthesis method of the present invention utilizes maintaining the frequency domain wave of the original speech sample so that after adjusting the pitch, the tone of the original speech sample can still be maintained, and the tone is adjusted according to the change of the pitch curve itself. The high curve can further simulate the pitch changes made by real people when they sing due to the difference in tunes. In addition, for the transition sound, the vibrato, and the like, the vocal synthesis method of the present invention also proposes a simulation method, so that the synthesized singer synthesized sound has obvious improvement in sharpness, naturalness and fluency compared with the prior art. . If more speech samples are created on the speech samples, such as speech samples of different pitches and different timbres (adults, children, boys, girls, etc.), the speech samples can be more flexible and can be combined to be more consistent. Real person The singing sings are synthesized.

藉由以上較佳具體實施例之詳述，係希望能更加清楚描述本發明之特徵與精神，而並非以上述所揭露的較佳具體實施例來對本發明之範疇加以限制。相反地，其目的是希望能涵蓋各種改變及具相等性的安排於本發明所欲申請之專利範圍的範疇內。 The features and spirit of the present invention will be more apparent from the detailed description of the preferred embodiments. On the contrary, the intention is to cover various modifications and equivalents within the scope of the invention as claimed.

S102~S120‧‧‧步驟 S102~S120‧‧‧Steps

圖一係繪示根據本發明之一具體實施例之歌唱聲合成方法之流程圖。 1 is a flow chart showing a method for synthesizing singing voice according to an embodiment of the present invention.

圖二係繪示ADSR模型各區段之示意圖。 Figure 2 is a schematic diagram showing each section of the ADSR model.

S102~S120‧‧‧步驟 S102~S120‧‧‧Steps

Claims

A vocal synthesis method for synthesizing a plurality of speech samples into a vocal synthesis sound, the vocal synthesis method comprising the steps of: (a) extracting a score from a slang file; (b) according to the score Determining a pitch curve related to the score; (c) extracting the plurality of voice samples from a phonetic sample database according to the score; (d) according to the pitch curve and the harmonic of the plurality of voice samples a wave parameter to calculate a harmonic parameter related to the vocal synthesized sound; and (e) calculating the vocal synthesized sound based on a harmonic parameter regarding the vocal synthesized sound.

The singing voice synthesis method according to claim 1, wherein the step (c) is performed by: adding a plurality of lyrics to the music score; and extracting from a voice sample database according to the plurality of lyrics The plural speech samples of the lyrics should be plural.

The vocal synthesis method according to claim 1, wherein the step (b) further comprises the step of: (b1) adjusting the pitch curve according to a curve change of the pitch curve.

The singing voice synthesizing method according to claim 3, wherein the step (b1) is implemented by the following steps: in the pitch curve, if a target pitch value is not equal to the next pitch a value, the pitch value of the target is adjusted according to a rate of change of the next pitch value relative to the pitch value of the target, and if the target pitch value is equal to the next pitch value, the predetermined pitch value is a predetermined multiple The value adjusts the pitch value of the target.

The vocal synthesis method according to claim 4, wherein in the step (b1), if the time difference between the target pitch value and the next pitch value is greater than a predetermined length of time, The predetermined multiplier value adjusts the pitch value of the target.

The singing voice synthesizing method according to claim 5, wherein the predetermined length of time is 0.9 seconds.

The singing voice synthesizing method according to claim 1, wherein each of the voice samples includes a silent portion and a voiced portion, and the step (d) is performed by the following steps: (d1) for the voiced portion of the voice sample Performing Fourier transform to obtain harmonic parameters of the speech sample; and (d2) adjusting harmonic parameters of the plurality of speech samples according to the pitch curve to obtain harmonic parameters regarding the vocal synthesized sound.

The singing voice synthesis method according to claim 7, wherein the step (d1) comprises calculating a harmonic parameter of the voice sample by using a cubic Hermite spline.

The singing voice synthesizing method according to claim 7, wherein the voiced portion of the voice sample comprises a rising section, a falling section, a delay section and an ending section, in step (d2) And if the length of time of the speech sample is different from the length of time of the pitch curve corresponding to the speech sample, adjusting the length of the delay segment of the voiced portion of the speech sample.

The singing voice synthesis method according to claim 9, wherein in the step (d2), if the pitch curve corresponding to the voice sample contains a plurality of sounds When the value is high, the first pitch value of the plurality of pitch values corresponds to the rising section, the falling section, and the partial delay section of the voiced portion of the voice sample, and the last pitch value is The end segment of the voiced portion of the speech sample and the delayed segment of the portion.

The singing voice synthesizing method according to claim 7, wherein in the step (d2), the frequency domain wave of the voice sample is substantially the same as the frequency domain wave of the voice sample corresponding to the voice sample.

The singing voice synthesizing method according to claim 7, wherein the step (d2) comprises calculating a harmonic parameter of the singing synthesized sound by using a cubic Hermite spline.

The vocal synthesis method of claim 7, wherein the step (d2) further comprises: if the length of the corresponding pitch value is greater than a predetermined length of time in the pitch curve, then A sine wave function is added to the harmonic parameter of the vocal synthesis sound of the pitch value.

The singing voice synthesizing method according to claim 7, wherein the song spectrum includes a plurality of sounds corresponding to the pitch curve, the step (d) includes determining the silent portion of the voice sample, and the step (e) The following steps are performed: copying the corresponding silent part of the voice sample according to the harmonic parameter of the vocal synthesized sound; determining the initial phase of the harmonic parameter of the vocal synthesized sound of the corresponding voiced part of the voice sample And calculating a time domain amplitude of the vocal synthesized sound corresponding to the voice sample according to a harmonic parameter of the vocal synthesized sound of the corresponding voice part of the voice sample; and according to the corresponding volume To adjust the time domain amplitude.

Such as the singing voice synthesis method described in claim 14 of the patent scope, wherein the step The step (e) further includes the step of: adjusting the time domain amplitude by a predetermined multiple value if the pitch value corresponding to the time domain amplitude is less than a predetermined pitch value; and if the time domain amplitude sound is corresponding When the high value is equal to or greater than the predetermined pitch value, the time domain amplitude is linearly adjusted according to the pitch value.

The singing voice synthesizing method according to claim 15, wherein the predetermined pitch value is a low frequency C, and the predetermined multiple value is 0.6.

The vocal synthesis method according to claim 1, wherein the harmonic parameter of the speech sample or the harmonic parameter of the vocal synthesis sound comprises a frequency, an amplitude and a phase of the complex array.