CN103035235A

CN103035235A - Method and device for transforming voice into melody

Info

Publication number: CN103035235A
Application number: CN2011102956675A
Authority: CN
Inventors: 杨晨; 蔡莲红; 周卫
Original assignee: Siemens AG
Current assignee: Siemens AG
Priority date: 2011-09-30
Filing date: 2011-09-30
Publication date: 2013-04-10

Abstract

The invention provides a method and a device for transforming voice into melody. The method includes acquiring input voice data and music information, adjusting duration of each syllable in the voice data to enable the duration of each syllable to be aligned with duration of corresponding lyric in the music information, adjusting voice base frequency of the voice data according to tone of each note in the music information to enable each voice base frequency point to be aligned with the tone of corresponding note in the music information and combining the voice base frequency points subjected to tone adjustment and notes subjected duration adjustment to form melody data.

Description

A kind of is the method and apparatus of melody with speech conversion

Technical field

The present invention relates to voice processing technology, particularly a kind of is the method and apparatus of melody with speech conversion.

Background technology

Melody is the fundamental that consists of music, can show most effectively music and human emotion.Melody is the combination with note of various tones and duration, and the note that being appreciated that serves as reasons has different tones and a duration is arranged and to be formed.Usually, each note sorts by beat, to provide the music meaning to this sequence of notes.

Musician or singer have professional control power and expressive force to music, and counter point shows the song of oneself well, and for the ordinary people, usually has certain difficulty.Usually wish just can be converted in real time the melody with self sound speciality by inputting one section voice, and still can not realize this technology in the prior art.

Summary of the invention

In view of this, the invention provides a kind of is the method and apparatus of melody with speech conversion, the speech data of user's input can be converted to the melody with user voice speciality.

Technical scheme of the present invention is as follows:

A kind of is the method for melody with speech conversion, and the method comprises: obtain speech data and music-book information, described speech data is inputted by the user, and described music-book information comprises: lyrics information, note information and both corresponding relations; Adjust the duration of each syllable in the speech data, lyrics duration corresponding in the duration that makes each syllable and the music-book information aligns, and according to the tone of each note in the music-book information, adjust the speech pitch point of speech data, the tone of corresponding note in each speech pitch point and the music-book information is alignd; Each syllable formation melody data in conjunction with the speech pitch point behind the adjustment tone and after adjusting duration.

Adjust the duration of each syllable in the speech data, corresponding lyrics duration aligns and specifically comprises in the duration that makes each syllable and the music-book information: energy and the zero-crossing rate information of each frame in the speech data that extraction is inputted; According to the energy of each frame and zero-crossing rate information speech data is divided into voice segments and quiet section; Be syllable according to the lyrics information in the described music-book information with each voice segments cutting; Adjust each syllable in the speech data duration so that its with music-book information in corresponding lyrics duration align.

Wherein, according to the energy of each frame and zero-crossing rate information speech data being divided into voice segments and quiet section comprises: according to energy and the zero-crossing rate information of each frame, be speech frame or mute frame with each frame identification; Adjacent speech frame is consisted of voice segments, adjacent mute frame is consisted of quiet section.

Wherein, be that syllable comprises according to the lyrics information in the described music-book information with each voice segments cutting: determine voice segments corresponding to each sentence in the lyrics of music-book information; Determine that each sentence comprises voice segments corresponding to each phrase; And voice segments corresponding to each phrase carried out phonetic segmentation, obtain the syllable after the cutting.

Wherein, adjust each syllable in the speech data duration so that its with music-book information in corresponding lyrics duration align and comprise: when the syllable that comprises initial consonant and simple or compound vowel of a Chinese syllable to carries out the duration adjusting, if need to the duration of this syllable be elongated, then keep the initial consonant duration constant, only elongate the duration of simple or compound vowel of a Chinese syllable; If need to the duration of this syllable be shortened, then initial consonant and simple or compound vowel of a Chinese syllable are shortened simultaneously.

Perhaps adjust each syllable in the speech data duration so that its with music-book information in corresponding lyrics duration align and comprise: when the front and back of a syllable are quiet section, make the duration of this syllable initial consonant account for 16.2% of whole syllable duration; When the front of this syllable is quiet section, when the back is not quiet section, make this syllable initial consonant duration account for 27.6% of whole syllable duration; When the front of this syllable is not quiet section, when the back is quiet section, make this syllable initial consonant duration account for 24.8% of whole syllable duration; And when the front and back of this syllable all are not quiet section, make this syllable initial consonant duration account for 32.9% of whole syllable duration.

Particularly, tone according to each note in the music-book information, adjust the speech pitch of speech data, each speech pitch point is alignd with the tone of corresponding note specifically to be comprised: extract the speech pitch information of the speech data of input, described speech audio information comprises: each speech pitch point of the fundamental frequency average of speech data and speech data; Determine the tone mark of melody that described speech data is converted to based on the fundamental frequency average of all notes in the fundamental frequency average of speech data and the music-book information; Take definite tone mark as benchmark, adjust the frequency of each speech pitch point of speech data and align with the tone of each note in the music-book information.

In addition, determine that based on the fundamental frequency average of all notes in the fundamental frequency average of speech data and the music-book information tone mark of melody comprises: the fundamental frequency average P_aver that determines all audio frequency in the fundamental frequency average F0_aver of speech data and the music-book information; If F0_aver＞P_aver falls K-n semitone as the tone mark of melody with the fundamental frequency average of speech data, wherein, K is the semitone number that F0_aver exceeds than P_aver, and n is experiment value, and can get n is int (K/7), and int represents to round; If F0_aver＜P_aver rises K-n semitone as the tone mark of melody with the fundamental frequency average of speech data, wherein, K is the F0_aver semitone number lower than P_aver, and n is experiment value, and can get n is int (K/7), and int represents to round.

More preferably, after determining the tone mark of melody, further comprise: the speech pitch point is carried out segmentation, wherein be in frequency-splitting between two adjacent speech pitch points of different segmentations greater than the setting fragmentation threshold; Determine that length is less than the segmentation of the wild point of being segmented into of default open country point length threshold; Frequency to speech pitch point in the segmentation of open country point is carried out the sinc interpolation processing.

Perhaps, further comprise after adjusting the speech pitch of speech data: in through the speech data behind the step B adjustment tone, the speech pitch point of the front m% that comprises with the speech pitch point of the rear m% that comprises in each note with a rear note carries out the sinc interpolation processing; Wherein, the experiment value of m% for setting.

A kind of is the device of melody with speech conversion, and this device comprises: user interface 600, music score administrative unit 610, duration adjustment unit 620, tone adjustment unit 630 and melody synthesis unit 640;

Described user interface 600 is used for obtaining the speech data of user's input and the music-book information of selecting from the music score administrative unit, and described speech data is inputted by the user, and described music-book information comprises: lyrics information, note information and both corresponding relations;

Described music score administrative unit 610 is used for the management music-book information and selects for the user;

Described duration adjustment unit 620 is used for adjusting the duration of described each syllable of speech data, and lyrics duration corresponding in the duration that makes each syllable and the music-book information of described selection aligns;

Described tone adjustment unit 630 is used for the tone according to each note of music-book information of described selection, adjusts the speech pitch of speech data, and each speech pitch point is alignd with the tone of corresponding note;

Described melody synthesis unit 640 is used in conjunction with the speech pitch point behind the adjustment tone and each syllable after adjusting duration forms melody data.

Wherein, described duration adjustment unit specifically comprises: feature extraction subelement 621, segment identification subelement 622, phonetic segmentation subelement 623 and duration are adjusted subelement 624;

Described feature extraction subelement 621 is for energy and the zero-crossing rate information of each frame of speech data that extracts described input;

Described segment identification subelement 622 is used for energy and zero-crossing rate information according to each frame, and speech data is divided into voice segments and quiet section;

Described phonetic segmentation subelement 623 is used for the lyrics information according to the music-book information of described selection, is syllable with the voice segments cutting;

Described duration is adjusted subelement 624, and lyrics duration corresponding in the duration that is used for adjusting each syllable of speech data and the music-book information aligns.

Particularly, described segment identification subelement 622 according to energy and the zero-crossing rate information of each frame, is speech frame or mute frame with each frame identification, and adjacent speech frame is consisted of voice segments, and adjacent mute frame is consisted of quiet section.

Wherein, described phonetic segmentation subelement 623 comprises: the first module 6231, for voice segments corresponding to each sentence of the lyrics of determining music-book information; The second module 6232 is used for determining that each sentence comprises voice segments corresponding to each phrase; The 3rd module 6233 is used for voice segments corresponding to each phrase carried out phonetic segmentation.

In addition, described tone adjustment unit 630 specifically comprises: subelement 632 and tone adjustment subelement 633 determined in feature extraction subelement 631, tone mark;

Described feature extraction subelement 631, for the speech audio information of the speech data that extracts input, described speech audio information comprises: each speech pitch point of the fundamental frequency average of speech data and speech data;

Subelement 632 determined in described tone mark, is used for based on the fundamental frequency average of speech data and the fundamental frequency average of all notes of music-book information, determines the tone mark of melody that described speech data is converted to;

Described tone is adjusted subelement 633, is used for determining the definite tone mark of subelement as benchmark take described tone mark, and the frequency of each speech pitch point of adjustment speech data is alignd with the tone of each note in the music-book information.

Wherein, described tone mark determines that subelement 632 comprises: four module (6321) is used for determining the fundamental frequency average F0_aver of speech data and the fundamental frequency average P_aver of all audio frequency of music-book information; The 5th module (6322), be used for when F0_aver＞P_aver, the fundamental frequency average of speech data is fallen K-n semitone as the tone mark of melody, wherein, K is the semitone number that F0_aver exceeds than P_aver, n is experiment value, and particularly can get n is int (K/7), and int represents to round; The 6th module (6323) is used for when F0_aver＜P_aver, and the fundamental frequency average of speech data is risen K-n semitone as the tone mark of melody, wherein, K is the F0_aver semitone number lower than P_aver, and n is experiment value, particularly can get n is int (K/7), and int represents to round.

More preferably, described tone adjustment unit 630 also comprises: the level and smooth subelement 634 of speech pitch, be used for the speech pitch point is carried out segmentation, and wherein be in frequency-splitting between two adjacent speech pitch points of different segmentations greater than the setting fragmentation threshold; Determine that length is less than the segmentation of the wild point of being segmented into of default open country point length threshold; The frequency of speech pitch point in the segmentation of open country point is carried out exporting to described tone adjustment subelement after the sinc interpolation processing.

More preferably, this device also comprises: melody smooth unit 650, speech data after being used for the tone adjustment unit adjusted, the speech pitch point of the front m% that comprises with the speech pitch point of the rear m% that comprises in each note with a rear note carries out exporting to described melody synthesis unit after the sinc interpolation processing; Wherein, the experiment value of m% for setting.

Can be found out by above description, by adjusting the duration of each syllable in the speech data, lyrics duration corresponding in the duration that makes each syllable and the music-book information aligns, and according to the tone of each note in the music-book information, adjust the speech pitch of speech data, each speech pitch point is alignd with the tone of corresponding note, can with the speech data of user's input, be converted to the melody with user voice speciality according to the music-book information of selecting.

Description of drawings

Fig. 1 is main method process flow diagram provided by the invention;

The music-book information synoptic diagram that Fig. 2 provides for the embodiment of the invention;

Fig. 3 is duration matching algorithm realization flow figure provided by the invention;

Fig. 4 is pitch matches algorithm realization flow figure provided by the invention;

Fig. 5 is the smoothing method process flow diagram of speech pitch envelope provided by the invention;

The structure drawing of device that Fig. 6 provides for the embodiment of the invention.

Embodiment

In order to make the purpose, technical solutions and advantages of the present invention clearer, below in conjunction with the drawings and specific embodiments invention is described in detail.

Core concept of the present invention mainly is: according to contacting between voice and the melody, user's speech data and music-book information mated, finally form melody output.Its main method can as shown in Figure 1, may further comprise the steps:

Step 101: obtain speech data and corresponding music-book information, described speech data is inputted by the user, and described music-book information comprises: lyrics information, note information and both corresponding relations.

Step 102: the duration of adjusting each syllable in the speech data, lyrics duration corresponding in the duration that makes each syllable and the music-book information aligns, and according to the tone of each note in the music-book information, adjust the speech pitch of speech data, each speech pitch point is alignd with the tone of corresponding note.

Step 103: each syllable formation melody data in conjunction with the speech pitch point behind the adjustment tone and after adjusting duration.

Below in conjunction with specific embodiment said method provided by the invention is described in detail, take song " whitewashing the craftsman " as example.

In step 101, the user can select by a user interface music-book information of song " whitewashing the craftsman ", perhaps also can be inputted by the user music-book information of " whitewashing the craftsman ".This music-book information mainly comprises: the correspondence relationship information of lyrics information, note information and the lyrics and note.In addition, can also comprise information such as title, subtitle, ci and qu author, time signature, as shown in Figure 2.

The user can be according to the lyrics input speech data of " whitewashing the craftsman ", for example read the lyrics of a section " whitewashing the craftsman ", the lyrics paragraph that the user reads is corresponding with the music-book information of above-mentioned user selection/input, comprises in the described music-book information that namely the user reads the correspondence relationship information of the lyrics information of lyrics paragraph, note information and the lyrics and note.

Obtaining of above-mentioned speech data and music-book information there is no dividing of certain precedence, can arrange according to user's use habit or hobby.

In step 102, need to finish two matching processs, the first is regulated the duration of each syllable in the speech data, and lyrics duration corresponding in the duration that makes each syllable and the music-book information is complementary, i.e. the coupling of voice duration and lyrics duration; It two is each speech pitch points of adjusting speech data, and the tone of corresponding note is complementary in each speech pitch point that makes speech data and the music-book information, i.e. the coupling of speech pitch and note tone.The below is elaborated to these two matching processs respectively.

First matching process: the coupling of voice duration and lyrics duration can be passed through duration matching algorithm (Time Alignment Algorithm) and realize that detailed process can as shown in Figure 3, may further comprise the steps:

Step 301: extract the characteristic parameter of speech data, comprise energy and the zero-crossing rate information of each frame.

Wherein a frame is generally the speech data of set time length, and such as the speech data of the set time length in 20ms to the 30ms scope, the speech data in this time span can be considered to a steady-state signal (relatively fixing such as average, variance).In the specific implementation, this time span value can be preset as a suitable empirical value.The energy information of one frame refers to the quadratic sum of the amplitude of this frame speech data, and the zero-crossing rate information of a frame refers to the frequency of the amplitude zero crossing of this frame speech data.Step 302: speech data is divided into voice segments and quiet section according to the energy of each frame and zero-crossing rate information.

In this step, can be at first be speech frame or mute frame according to energy value and the zero-crossing rate of each frame with each frame identification according to default energy threshold and zero-crossing rate threshold value, wherein: be speech frame greater than described energy threshold and zero-crossing rate greater than the frame identification of described zero-crossing rate threshold value with energy value, other frame identification is mute frame, and described energy threshold and zero-crossing rate threshold value can rule of thumb be worth and/or experimental data is set.Next, adjacent speech frame is designated a voice segments, adjacent mute frame is designated one quiet section.Speech data just is divided into the voice segments of being separated by and quiet section like this.

Step 303: the lyrics information according in the music-book information is cut into syllable with voice segments.

In an embodiment of the present invention, word of a syllable ordinary representation.

Through after the above-mentioned steps 302, speech data is labeled the voice segments that is divided into interval one by one and quiet section.Next in this step, corresponding voice segments (processing of sentence layer) and each sentence of each sentence comprises voice segments corresponding to each phrase (processing of phrase layer) in the lyrics of definite music-book information at first successively, and then voice segments corresponding to each phrase carried out phonetic segmentation, obtain the syllable after the cutting.

Particularly, process for the sentence layer, the number of voice segments is all more than or equal to the sentence number in the lyrics of music-book information in most cases.In this case, need the position of selected cut-off, then the voice segments between two cut-offs is merged, can obtain voice segments corresponding to each sentence in the lyrics of music-book information.The selection of cut-off can be determined according to following formula:

Min(abs(splitTime/totalTime-senLenInLRC/totalLenInLRC))

Wherein, splitTime refers to the duration sum from a upper cut-off to the voice segments the cut-off of current selection; TotalTime refers to the duration sum of all voice segments; SenLenInLRC refers to that the current sentence of wanting cutting sings duration in the music score of Chinese operas; TotalLenInLRC refers to the duration of singing of the whole music score of Chinese operas.Particularly, at first with each quiet section as candidate's cut-off, then each candidate's cut-off is calculated the value of abs (splitTime/totalTime-senLenInLRC/totalLenInLRC), therefrom find out minimum value, corresponding quiet section of this minimum value is exactly the cut-off of determining.In brief, said method by to quiet section one to one candidate's cut-off travel through, find successively the cut-off of each sentence the best.Need to prove, if the number of voice segments less than the sentence number in the lyrics of music-book information, then needs to re-execute above-mentioned steps 302, also namely again speech data is divided into voice segments and quiet section.

Process for the phrase layer, three kinds of situations are arranged: the first situation, if the number of the voice segments that sentence is corresponding equals the phrase number that this sentence comprises, at the phrase layer that each phrase and the corresponding voice segments of this sentence that this sentence comprises is corresponding one by one so, also be the corresponding phrase of each voice segments.The second situation, if the phrase number that the number of the voice segments that sentence is corresponding comprises greater than this sentence, then need according to the selected cut-off position of the system of selection of above-mentioned cut-off, and with the voice segments between two cut-offs with also be in the same place, determine the voice segments that each phrase is corresponding.The third situation, if the phrase number that the number of the voice segments that sentence is corresponding comprises less than this sentence, then directly the corresponding voice segments of this sentence is carried out phonetic segmentation, and the number of words order that the number of syllables that cuts out and this sentence comprise compared, divide again three kinds of situations: the first situation, if syllable number equals number of words, be exactly the corresponding word of a syllable so.The second situation if the method that syllable number greater than number of words, then adopts similar sentence layer to process is found out cut-off, merges the syllable between the cut-off, corresponds to corresponding phrase.The third situation if syllable number less than word, is then found out that the longest syllable of duration in the syllable that cuts out, and with it and be divided into two syllables, is done like this and is continued till syllable number and lyrics number equate.In this case, the phrase layer has been finished the work of syllablic tier simultaneously, no longer carries out the processing of syllablic tier.

At last, utilize voice segments corresponding to each phrase of syllable splitting algorithm to be cut into syllable, the voice segments that for example, will consist of phrase " I am and whitewash the craftsman " is cut into " I ", "Yes", " one ", " individual ", " powder ", " brush ", " craftsman " seven syllables.Particularly, the cutting syllable can adopt existing syllable splitting state machine algorithms.

Step 304: according to the duration of the lyrics in the music-book information, the duration of regulating each syllable in the voice segments make its with music-book information in corresponding lyrics duration align.

Because the corresponding relation of the lyrics and note in the music-book information, each word in the lyrics has different durations, for example, in music score, the lyrics " I ", "Yes" are different from the duration of " craftsman ", the duration of syllable " craftsman " takies a beat, and syllable " I " and "Yes" respectively account for beat half.In the speech data of user's input, because input is not have melodic voice, the duration that each syllable may take is the same or do not have melody, therefore, the duration of each syllable in the speech data need to be regulated unanimously with the duration of each syllable in the lyrics of music-book information.

Need to prove, because in Chinese, a word is made of initial consonant and simple or compound vowel of a Chinese syllable.Preferably, carry out duration when regulating at the syllable that comprises initial consonant and simple or compound vowel of a Chinese syllable to, if need to the duration of this syllable be elongated, then can keep the initial consonant duration constant, only elongate the duration of simple or compound vowel of a Chinese syllable; If need to the duration of this syllable be shortened, then initial consonant and simple or compound vowel of a Chinese syllable can be shortened simultaneously.This mode meets the custom of singing melody more, makes it more melodized.Based on this principle, adopt the mode of GMM (gauss hybrid models) cluster to carry out the estimation of initial consonant, simple or compound vowel of a Chinese syllable duration, adjust with the duration to consonant, vowel, can specifically comprise:

When the front and back of syllable are quiet section, illustrate that this syllable is independent syllable, can make the duration of initial consonant account for 16.2% of whole syllable duration; When the front of this syllable is quiet section, when the back is not quiet section, illustrate that this syllable is the first syllable of phrase or sentence, can make the initial consonant duration account for 27.6% of whole syllable duration.When the front of this syllable is not quiet section, when the back is quiet section, illustrate that this syllable is the last syllable of phrase or sentence, can make the initial consonant duration account for 24.8% of whole syllable duration.When the front and back of this syllable all are not quiet section, can make the initial consonant duration account for 32.9% of whole syllable duration.

Second matching process: the coupling of speech pitch and note tone can be passed through pitch matches algorithm (Pitch Alignment Algorithm) and realize that detailed process can as shown in Figure 4, may further comprise the steps:

Step 401: extract the characteristic parameter of speech data, comprise speech pitch information, specifically refer to the fundamental frequency information of each frame.Above-mentioned speech audio information comprises: each speech pitch point of the fundamental frequency average of speech data and speech data.

Step 402: based on the fundamental frequency average of all notes in the fundamental frequency average of all frames in the speech data and the music-book information, determine the tone mark of melody that speech data is converted to.

Music score itself is take the feature of tone mark as oneself, and the fundamental frequency of raw tone has larger gap with the tone of music score under many circumstances.If so that the final melody that forms of the voice after adjusting has the sound speciality of raw tone, then need simultaneously to determine based on the fundamental frequency average of all notes in the fundamental frequency average of speech data and the music-book information tone mark of melody.

The concrete definite method of melody tone mark can comprise: the fundamental frequency average P_aver that determines all notes in the fundamental frequency average F0_aver of speech data and the music-book information; If F0_aver＞P_aver falls K-n semitone as the tone mark of melody with the fundamental frequency average of speech data, wherein, K is the semitone number that F0_aver exceeds than P_aver, and n is experiment value, and can get n is int (K/7), and int represents to round; If F0_aver＜P_aver rises K-n semitone as the tone mark of melody with the fundamental frequency average of speech data, wherein, K is the F0_aver semitone number lower than P_aver, and n is experiment value, and can get n is int (K/7), and int represents to round.

Step 403: the tone mark of determining take step 402 is adjusted the frequency of each frame speech pitch point and is alignd with the tone of each note in the music-book information as benchmark.

Each note in the music score (Do, Re, Mi, Fa, Sol, La, Si and Do) has separately tone according to the tone mark of music score, when aliging, need to adjust according to the tone mark of determining in the step 402 frequency of speech pitch point, it is alignd with the tone of each note in the music-book information.

In addition, in the matching process of speech pitch and note tone, the smoothness of pitch contour is the key factor that determines melody tonequality.In the present invention can be further by the level and smooth or melody pitch contour of speech pitch envelope smoothly obtained more excellent melody tonequality.

The smoothing method of paper speech pitch envelope.Because speech pitch parameter extraction error, the phenomenon of unavoidable meeting frequency of occurrences sudden change in the speech pitch envelope, the fundamental frequency point of these frequency discontinuities is called wild point, and wild point is the arch-criminal who affects tonequality, therefore, need to carry out smoothing processing to the point of the open country in the speech pitch envelope.The smoothing method of speech pitch envelope can as shown in Figure 5, may further comprise the steps:

Step 501: each frame speech pitch point sequence is carried out segmentation, be in frequency-splitting between two adjacent fundamental frequency points of different segmentations greater than setting fragmentation threshold (default empirical value).

In this step, can obtain above-mentioned each frame speech pitch point sequence by the fundamental frequency point that extracts each frame of voice signal.

The speech pitch point sequence P that supposes the speech data of input is: { P ₁, P ₂, P ₃... P _N.When the frequency-splitting of adjacent two fundamental frequency points during greater than fragmentation threshold Threshold, with this two fundamental frequencies o'clock border as two segmentations, suppose P _iAnd P _I+1Frequency-splitting greater than Threshold, then sequence is divided into { P ₁... P _iAnd

A can get 4 for the scale-up factor that can affect smooth effect that experiment obtains.

Step 502: determine that length is less than open country point length threshold Th _TimeSegmentation is put in the open country that is segmented into of (empirical value, general length is less, for example 0.06 second (s)).

Suppose the mode according to step 501, the speech pitch point sequence is divided into K segmentation

Wherein,

Length less than Th _Time, then

Be wild point sequence.

Step 503: the frequency to speech pitch point in the segmentation of open country point is carried out the sinc interpolation processing.

According to the frequency of audio frequency point before and after the segmentation of open country point, carry out the sinc interpolation processing by the sequence length of open country point segmentation.

Above-mentioned flow process shown in Figure 5 can be carried out between step 402 and step 403, will adjust through the speech pitch point of speech pitch envelope smoothing processing in step 403, and the tone mark of determining according to step 402 aligns with the tone of each note in each music-book information.

The below introduces the smoothing method of melody pitch contour.Because the fundamental frequency of final melody is to get through the frequency adjustment to speech pitch point, that obtain in the melody pitch contour is the splicing result, probably produces obvious frequency discontinuity between the adjacent syllable of synthetic melody, thereby affects tonequality.The fundamental frequency point of the front m% of the fundamental frequency point of m% and a rear syllable adjacent with this syllable carries out smoothing processing after in the fundamental frequency point that each syllable can be comprised in the present invention, is specially the frequency of choosing first and last point and carries out the sinc interpolation processing, and sequence length is constant.Wherein, the experiment value of m% for setting can get 20%.

More than be the description that method provided by the present invention is carried out, the below is described in detail device provided by the present invention.As shown in Figure 6, this device comprises: user interface 600, music score administrative unit 610, duration adjustment unit 620, tone adjustment unit 630 and melody synthesis unit 640.

User interface 600 can be obtained the speech data of user's input and the music-book information of selecting from the music score administrative unit, music-book information comprises: lyrics information, note information and both corresponding relations.

Music score administrative unit 610 management music-book informations are selected for the user.

In addition, user interface 600 also can be shown to the user with the music-book information of music score administrative unit 610 management, selects for the user.

Duration adjustment unit 620 is adjusted the duration of each syllable in the speech datas, and lyrics duration corresponding in the duration that makes each syllable and the music-book information of selection aligns.

Tone adjustment unit 630 is adjusted the speech pitch of speech data according to the tone of each note in the music-book information of selecting, and each speech pitch point is alignd with the tone of corresponding note.

Each syllable after 640 combinations of melody synthesis unit are adjusted the speech pitch point behind the tone and adjusted duration forms melody data.

Wherein, duration adjustment unit 620 can specifically comprise: feature extraction subelement 621, segment identification subelement 622, phonetic segmentation subelement 623 and duration are adjusted subelement 624.

Feature extraction subelement 621 extracts energy and the zero-crossing rate information of each frame in the speech data of inputting.

Segment identification subelement 622 can be according to energy and the zero-crossing rate information of each frame, and speech data is divided into voice segments and quiet section.The associated description of concrete segmentation method in can the employing method.

Phonetic segmentation subelement 623 is syllable according to the lyrics information in the music-book information of selecting with the voice segments cutting;

Duration is adjusted lyrics duration corresponding in duration that subelement 624 adjusts each syllable in the speech datas and the music-book information and is alignd.

Wherein, segment identification subelement 622 can according to energy and the zero-crossing rate information of each frame, be speech frame or mute frame with each frame identification, with adjacent speech frame formation voice segments, with quiet section of adjacent mute frame formation.

In addition, phonetic segmentation subelement 623 may further include: the first module 6231, for voice segments corresponding to each sentence of the lyrics of determining music-book information; The second module 6232 is used for determining that each sentence comprises voice segments corresponding to each phrase; The 3rd module 6233 is used for voice segments corresponding to each phrase carried out phonetic segmentation.

Above-mentioned tone adjustment unit 630 can specifically comprise: subelement 632 and tone adjustment subelement 633 determined in feature extraction subelement 631, tone mark.

Feature extraction subelement 631 extracts the speech pitch information of the speech data of input.

Tone mark determines that subelement 632 based on the fundamental frequency average of all notes in the fundamental frequency average of speech data and the music-book information, determines the tone mark of melody.

Tone is adjusted subelement 633 and is determined that take tone mark tone mark that subelement 632 determines as benchmark, adjusts the frequency of each speech pitch point of speech data and align with the tone of each note in the music-book information.

Wherein, tone mark determines that subelement 632 can specifically comprise: four module 6321 is used for determining the fundamental frequency average F0_aver of speech data and the fundamental frequency average P_aver of all audio frequency of music-book information; The 5th module 6322 is used for when F0_aver＞P_aver the fundamental frequency average of speech data being fallen K-n semitone as the tone mark of melody, wherein, K is the semitone number that F0_aver exceeds than P_aver, and n is experiment value, particularly can get n is int (K/7), and int represents to round; The 6th module 6323 is used for when F0_aver＜P_aver, and the fundamental frequency average of speech data is risen K-n semitone as the tone mark of melody, wherein, K is the F0_aver semitone number lower than P_aver, and n is experiment value, particularly can get n is int (K/7), and int represents to round.

More preferably, in order further to improve the tonequality of melody, can realize by the mode of any or combination in the following dual mode.

The first, tone adjustment unit 630 further comprises: the level and smooth subelement 634 of speech pitch carries out segmentation with the speech pitch point, wherein is in frequency-splitting between two adjacent speech pitch points of different segmentations greater than setting fragmentation threshold; Determine that length is less than the segmentation of the wild point of being segmented into of default open country point length threshold; Frequency to speech pitch point in the segmentation of open country point is carried out the sinc interpolation processing; Speech data after the difference processing is offered described tone mark adjust subelement 633.

The second, this device can also comprise: in the speech data after melody smooth unit 650 is adjusted tone adjustment unit 630, the speech pitch point of the front m% that comprises with the speech pitch point of the rear m% that comprises in each note with a rear note carries out exporting to melody synthesis unit 640 after the sinc interpolation processing; Wherein, the experiment value of m% for setting.

Finally, melody synthesis unit 640 can be exported to audio-frequence player device with synthetic melody or play-over to the user.

The invention provides and a kind of sound is converted to the method and apparatus of melody, wherein method comprises: the speech data and the music-book information that obtain input; Adjust the duration of each syllable in the speech data, lyrics duration corresponding in the duration that makes each syllable and the music-book information aligns, and according to the tone of each note in the music-book information, adjust the speech pitch of speech data, the tone of corresponding note in each speech pitch point and the music-book information is alignd; Each syllable formation melody data in conjunction with the speech pitch point behind the adjustment tone and after adjusting duration.Can with the speech data of user's input, be converted to the melody with user voice speciality according to the music-book information of selecting by the present invention.

The above only is preferred embodiment of the present invention, and is in order to limit the present invention, within the spirit and principles in the present invention not all, any modification of making, is equal to replacement, improvement etc., all should be included within the scope of protection of the invention.

Claims

1. one kind is the method for melody with speech conversion, and described method comprises:

Obtain speech data and music-book information, described speech data is inputted by the user, and described music-book information comprises: lyrics information, note information and both corresponding relations;

Adjust the duration of each syllable in the described speech data, lyrics duration corresponding in the duration that makes each syllable and the music-book information aligns;

According to the tone of each note in the described music-book information, adjust the speech pitch point of described speech data, the tone of corresponding note in described speech pitch point and the music-book information is alignd;

Each syllable formation melody data in conjunction with the speech pitch point behind the adjustment tone and after adjusting duration.

2. method according to claim 1, wherein, lyrics duration corresponding in the duration of each syllable in the described speech data of described adjustment, the duration that makes each syllable and music-book information aligns, and comprising:

Extract energy and the zero-crossing rate information of each frame in the described speech data;

According to the energy of each frame and zero-crossing rate information described speech data is divided into voice segments and quiet section;

Be syllable according to the lyrics information in the described music-book information with each voice segments cutting;

Adjust each syllable in the speech data duration so that its with music-book information in corresponding lyrics duration align.

3. method according to claim 2, wherein, described energy and zero-crossing rate information according to each frame is divided into voice segments and quiet section with described speech data, comprising:

Energy and zero-crossing rate information according to each frame are speech frame or mute frame with each frame identification;

Adjacent speech frame is consisted of voice segments adjacent mute frame is consisted of quiet section.

4. method according to claim 2, wherein, described is syllable according to the lyrics information in the described music-book information with each voice segments cutting, comprising:

Determine voice segments corresponding to each sentence in the lyrics of music-book information;

Determine voice segments corresponding to each phrase that each sentence comprises; And

The voice segments that each phrase is corresponding is carried out phonetic segmentation, obtain the syllable after the cutting.

5. method according to claim 2, wherein, the duration of each syllable in the described adjustment speech data so that its with music-book information in corresponding lyrics duration align and comprise:

Carry out duration when regulating at the syllable that comprises initial consonant and simple or compound vowel of a Chinese syllable to, if need to the duration of this syllable be elongated, then keep the initial consonant duration constant, only elongate the duration of simple or compound vowel of a Chinese syllable; If need to the duration of this syllable be shortened, then initial consonant and simple or compound vowel of a Chinese syllable are shortened simultaneously.

6. method according to claim 2, wherein, the duration of each syllable in the described adjustment speech data so that its with music-book information in corresponding lyrics duration align and comprise:

When the front and back of a syllable are quiet section, make the duration of this syllable initial consonant account for 16.2% of whole syllable duration;

When the front of this syllable is quiet section, when the back is not quiet section, make this syllable initial consonant duration account for 27.6% of whole syllable duration;

When the front of this syllable is not quiet section, when the back is quiet section, make this syllable initial consonant duration account for 24.8% of whole syllable duration; And

When the front and back of this syllable all are not quiet section, make this syllable initial consonant duration account for 32.9% of whole syllable duration.

7. method according to claim 1, wherein, described tone according to each note in the music-book information is adjusted the speech pitch point of described speech data, and the tone of corresponding note in described speech pitch point and the music-book information is alignd, and comprising:

Extract the speech pitch information of described speech data, described speech audio information comprises: each speech pitch point of the fundamental frequency average of speech data and speech data;

Determine the tone mark of melody that described speech data is converted to based on the fundamental frequency average of all notes in the fundamental frequency average of described speech data and the music-book information;

Take definite tone mark as benchmark, adjust the frequency of each speech pitch point of described speech data and align with the tone of each note in the music-book information.

8. method according to claim 7, wherein, described based on described speech data the fundamental frequency average and music-book information in the fundamental frequency average of all notes determine to comprise the tone mark of melody that described speech data is converted to:

Determine the fundamental frequency average P_aver of all notes in the fundamental frequency average F0_aver of described speech data and the described music-book information;

If F0_aver＞P_aver, the fundamental frequency average of described speech data is fallen K-n semitone as the tone mark of the melody that described speech data is converted to, wherein, K is the semitone number that F0_aver exceeds than P_aver, n is experiment value, particularly can get n is int (K/7), and int represents to round;

If F0_aver＜P_aver, the fundamental frequency average of described speech data is risen K-n semitone as the tone mark of the melody that described speech data is converted to, wherein, K is the F0_aver semitone number lower than P_aver, n is experiment value, particularly can get n is int (K/7), and int represents to round.

9. method according to claim 7, wherein, after the described tone mark of determining melody that described speech data is converted to, described method further comprises:

Described speech pitch point is carried out segmentation, wherein be in frequency-splitting between two adjacent speech pitch points of different segmentations greater than setting fragmentation threshold;

Determine that length is less than the segmentation of the wild point of being segmented into of default open country point length threshold;

Frequency to speech pitch point in the segmentation of open country point is carried out the sinc interpolation processing.

10. according to claim 1 or 7 described methods, wherein, described method further comprises: in through the speech data after adjusting tone, the speech pitch point of the front m% that comprises with the speech pitch point of the rear m% that comprises in each note with a rear note carries out the sinc interpolation processing; Wherein, the experiment value of m% for setting.

11. one kind is the device of melody with speech conversion, described device comprises: user interface (600), music score administrative unit (610), duration adjustment unit (620), tone adjustment unit (630) and melody synthesis unit (640);

Described user interface (600) is used for obtaining the speech data of user's input and the music-book information of selecting from the music score administrative unit, and described music-book information comprises: lyrics information, note information and both corresponding relations;

Described music score administrative unit (610) is used for the management music-book information and selects for the user;

Described duration adjustment unit (620) is used for adjusting the duration of described each syllable of speech data, and lyrics duration corresponding in the duration that makes each syllable and the music-book information of described selection aligns;

Described tone adjustment unit (630) is used for the tone according to each note of music-book information of described selection, adjusts the speech pitch point of speech data, and each speech pitch point is alignd with the tone of corresponding note;

Described melody synthesis unit (640) is used in conjunction with the speech pitch point behind the adjustment tone and each syllable after adjusting duration forms melody data.

12. device according to claim 11, wherein, described duration adjustment unit specifically comprises: feature extraction subelement (621), segment identification subelement (622), phonetic segmentation subelement (623) and duration are adjusted subelement (624);

Described feature extraction subelement (621) is for energy and the zero-crossing rate information of extracting described each frame of speech data;

Described segment identification subelement (622) is used for energy and the zero-crossing rate information of each frame of extracting according to described feature extraction subelement, and described speech data is divided into voice segments and quiet section;

Described phonetic segmentation subelement (623) is used for the lyrics information according to the music-book information of described selection, is syllable with described voice segments cutting;

Described duration is adjusted subelement (624), and lyrics duration corresponding in the duration that is used for adjusting described each syllable of speech data and the music-book information aligns.

13. device according to claim 12, wherein, described phonetic segmentation subelement (623) comprising:

The first module (6231) is for voice segments corresponding to each sentence of the lyrics of determining music-book information;

The second module (6232) is used for determining that each sentence comprises voice segments corresponding to each phrase;

The 3rd module (6233) is used for voice segments corresponding to each phrase carried out phonetic segmentation.

14. device according to claim 11, wherein, described tone adjustment unit (630) specifically comprises: subelement (632) and tone adjustment subelement (633) determined in feature extraction subelement (631), tone mark;

Described feature extraction subelement (631), for the speech audio information of the speech data that extracts input, described speech audio information comprises: each speech pitch point of the fundamental frequency average of speech data and speech data;

Subelement (632) determined in described tone mark, is used for based on the fundamental frequency average of speech data and the fundamental frequency average of all notes of music-book information, determines the tone mark of melody that described speech data is converted to;

Described tone is adjusted subelement (633), is used for determining the definite tone mark of subelement as benchmark take described tone mark, and the frequency of each speech pitch point of adjustment speech data is alignd with the tone of each note in the music-book information.

15. device according to claim 14, wherein, described tone mark determines that subelement (632) comprising:

Four module (6321) is used for determining the fundamental frequency average F0_aver of speech data and the fundamental frequency average P_aver of all audio frequency of music-book information;

The 5th module (6322), be used for when F0_aver＞P_aver, the fundamental frequency average of speech data is fallen K-n semitone as the tone mark of melody, wherein, K is the semitone number that F0_aver exceeds than P_aver, n is experiment value, and particularly can get n is int (K/7), and int represents to round;

The 6th module (6323) is used for when F0_aver＜P_aver, and the fundamental frequency average of speech data is risen K-n semitone as the tone mark of melody, wherein, K is the F0_aver semitone number lower than P_aver, and n is experiment value, particularly can get n is int (K/7), and int represents to round.

16. device according to claim 14, wherein, described tone adjustment unit (630) also comprises: the level and smooth subelement of speech pitch (634), be used for the speech pitch point is carried out segmentation, wherein be in frequency-splitting between two adjacent speech pitch points of different segmentations greater than the setting fragmentation threshold; Determine that length is less than the segmentation of the wild point of being segmented into of default open country point length threshold; The frequency of speech pitch point in the segmentation of open country point is carried out exporting to described tone adjustment subelement after the sinc interpolation processing.