CN103035235A - Method and device for transforming voice into melody - Google Patents

Method and device for transforming voice into melody Download PDF

Info

Publication number
CN103035235A
CN103035235A CN2011102956675A CN201110295667A CN103035235A CN 103035235 A CN103035235 A CN 103035235A CN 2011102956675 A CN2011102956675 A CN 2011102956675A CN 201110295667 A CN201110295667 A CN 201110295667A CN 103035235 A CN103035235 A CN 103035235A
Authority
CN
China
Prior art keywords
duration
syllable
music
speech data
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2011102956675A
Other languages
Chinese (zh)
Inventor
杨晨
蔡莲红
周卫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Siemens AG
Original Assignee
Siemens AG
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Siemens AG filed Critical Siemens AG
Priority to CN2011102956675A priority Critical patent/CN103035235A/en
Publication of CN103035235A publication Critical patent/CN103035235A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Auxiliary Devices For Music (AREA)
  • Electrophonic Musical Instruments (AREA)

Abstract

The invention provides a method and a device for transforming voice into melody. The method includes acquiring input voice data and music information, adjusting duration of each syllable in the voice data to enable the duration of each syllable to be aligned with duration of corresponding lyric in the music information, adjusting voice base frequency of the voice data according to tone of each note in the music information to enable each voice base frequency point to be aligned with the tone of corresponding note in the music information and combining the voice base frequency points subjected to tone adjustment and notes subjected duration adjustment to form melody data.

Description

A kind of is the method and apparatus of melody with speech conversion
Technical field
The present invention relates to voice processing technology, particularly a kind of is the method and apparatus of melody with speech conversion.
Background technology
Melody is the fundamental that consists of music, can show most effectively music and human emotion.Melody is the combination with note of various tones and duration, and the note that being appreciated that serves as reasons has different tones and a duration is arranged and to be formed.Usually, each note sorts by beat, to provide the music meaning to this sequence of notes.
Musician or singer have professional control power and expressive force to music, and counter point shows the song of oneself well, and for the ordinary people, usually has certain difficulty.Usually wish just can be converted in real time the melody with self sound speciality by inputting one section voice, and still can not realize this technology in the prior art.
Summary of the invention
In view of this, the invention provides a kind of is the method and apparatus of melody with speech conversion, the speech data of user's input can be converted to the melody with user voice speciality.
Technical scheme of the present invention is as follows:
A kind of is the method for melody with speech conversion, and the method comprises: obtain speech data and music-book information, described speech data is inputted by the user, and described music-book information comprises: lyrics information, note information and both corresponding relations; Adjust the duration of each syllable in the speech data, lyrics duration corresponding in the duration that makes each syllable and the music-book information aligns, and according to the tone of each note in the music-book information, adjust the speech pitch point of speech data, the tone of corresponding note in each speech pitch point and the music-book information is alignd; Each syllable formation melody data in conjunction with the speech pitch point behind the adjustment tone and after adjusting duration.
Adjust the duration of each syllable in the speech data, corresponding lyrics duration aligns and specifically comprises in the duration that makes each syllable and the music-book information: energy and the zero-crossing rate information of each frame in the speech data that extraction is inputted; According to the energy of each frame and zero-crossing rate information speech data is divided into voice segments and quiet section; Be syllable according to the lyrics information in the described music-book information with each voice segments cutting; Adjust each syllable in the speech data duration so that its with music-book information in corresponding lyrics duration align.
Wherein, according to the energy of each frame and zero-crossing rate information speech data being divided into voice segments and quiet section comprises: according to energy and the zero-crossing rate information of each frame, be speech frame or mute frame with each frame identification; Adjacent speech frame is consisted of voice segments, adjacent mute frame is consisted of quiet section.
Wherein, be that syllable comprises according to the lyrics information in the described music-book information with each voice segments cutting: determine voice segments corresponding to each sentence in the lyrics of music-book information; Determine that each sentence comprises voice segments corresponding to each phrase; And voice segments corresponding to each phrase carried out phonetic segmentation, obtain the syllable after the cutting.
Wherein, adjust each syllable in the speech data duration so that its with music-book information in corresponding lyrics duration align and comprise: when the syllable that comprises initial consonant and simple or compound vowel of a Chinese syllable to carries out the duration adjusting, if need to the duration of this syllable be elongated, then keep the initial consonant duration constant, only elongate the duration of simple or compound vowel of a Chinese syllable; If need to the duration of this syllable be shortened, then initial consonant and simple or compound vowel of a Chinese syllable are shortened simultaneously.
Perhaps adjust each syllable in the speech data duration so that its with music-book information in corresponding lyrics duration align and comprise: when the front and back of a syllable are quiet section, make the duration of this syllable initial consonant account for 16.2% of whole syllable duration; When the front of this syllable is quiet section, when the back is not quiet section, make this syllable initial consonant duration account for 27.6% of whole syllable duration; When the front of this syllable is not quiet section, when the back is quiet section, make this syllable initial consonant duration account for 24.8% of whole syllable duration; And when the front and back of this syllable all are not quiet section, make this syllable initial consonant duration account for 32.9% of whole syllable duration.
Particularly, tone according to each note in the music-book information, adjust the speech pitch of speech data, each speech pitch point is alignd with the tone of corresponding note specifically to be comprised: extract the speech pitch information of the speech data of input, described speech audio information comprises: each speech pitch point of the fundamental frequency average of speech data and speech data; Determine the tone mark of melody that described speech data is converted to based on the fundamental frequency average of all notes in the fundamental frequency average of speech data and the music-book information; Take definite tone mark as benchmark, adjust the frequency of each speech pitch point of speech data and align with the tone of each note in the music-book information.
In addition, determine that based on the fundamental frequency average of all notes in the fundamental frequency average of speech data and the music-book information tone mark of melody comprises: the fundamental frequency average P_aver that determines all audio frequency in the fundamental frequency average F0_aver of speech data and the music-book information; If F0_aver>P_aver falls K-n semitone as the tone mark of melody with the fundamental frequency average of speech data, wherein, K is the semitone number that F0_aver exceeds than P_aver, and n is experiment value, and can get n is int (K/7), and int represents to round; If F0_aver<P_aver rises K-n semitone as the tone mark of melody with the fundamental frequency average of speech data, wherein, K is the F0_aver semitone number lower than P_aver, and n is experiment value, and can get n is int (K/7), and int represents to round.
More preferably, after determining the tone mark of melody, further comprise: the speech pitch point is carried out segmentation, wherein be in frequency-splitting between two adjacent speech pitch points of different segmentations greater than the setting fragmentation threshold; Determine that length is less than the segmentation of the wild point of being segmented into of default open country point length threshold; Frequency to speech pitch point in the segmentation of open country point is carried out the sinc interpolation processing.
Perhaps, further comprise after adjusting the speech pitch of speech data: in through the speech data behind the step B adjustment tone, the speech pitch point of the front m% that comprises with the speech pitch point of the rear m% that comprises in each note with a rear note carries out the sinc interpolation processing; Wherein, the experiment value of m% for setting.
A kind of is the device of melody with speech conversion, and this device comprises: user interface 600, music score administrative unit 610, duration adjustment unit 620, tone adjustment unit 630 and melody synthesis unit 640;
Described user interface 600 is used for obtaining the speech data of user's input and the music-book information of selecting from the music score administrative unit, and described speech data is inputted by the user, and described music-book information comprises: lyrics information, note information and both corresponding relations;
Described music score administrative unit 610 is used for the management music-book information and selects for the user;
Described duration adjustment unit 620 is used for adjusting the duration of described each syllable of speech data, and lyrics duration corresponding in the duration that makes each syllable and the music-book information of described selection aligns;
Described tone adjustment unit 630 is used for the tone according to each note of music-book information of described selection, adjusts the speech pitch of speech data, and each speech pitch point is alignd with the tone of corresponding note;
Described melody synthesis unit 640 is used in conjunction with the speech pitch point behind the adjustment tone and each syllable after adjusting duration forms melody data.
Wherein, described duration adjustment unit specifically comprises: feature extraction subelement 621, segment identification subelement 622, phonetic segmentation subelement 623 and duration are adjusted subelement 624;
Described feature extraction subelement 621 is for energy and the zero-crossing rate information of each frame of speech data that extracts described input;
Described segment identification subelement 622 is used for energy and zero-crossing rate information according to each frame, and speech data is divided into voice segments and quiet section;
Described phonetic segmentation subelement 623 is used for the lyrics information according to the music-book information of described selection, is syllable with the voice segments cutting;
Described duration is adjusted subelement 624, and lyrics duration corresponding in the duration that is used for adjusting each syllable of speech data and the music-book information aligns.
Particularly, described segment identification subelement 622 according to energy and the zero-crossing rate information of each frame, is speech frame or mute frame with each frame identification, and adjacent speech frame is consisted of voice segments, and adjacent mute frame is consisted of quiet section.
Wherein, described phonetic segmentation subelement 623 comprises: the first module 6231, for voice segments corresponding to each sentence of the lyrics of determining music-book information; The second module 6232 is used for determining that each sentence comprises voice segments corresponding to each phrase; The 3rd module 6233 is used for voice segments corresponding to each phrase carried out phonetic segmentation.
In addition, described tone adjustment unit 630 specifically comprises: subelement 632 and tone adjustment subelement 633 determined in feature extraction subelement 631, tone mark;
Described feature extraction subelement 631, for the speech audio information of the speech data that extracts input, described speech audio information comprises: each speech pitch point of the fundamental frequency average of speech data and speech data;
Subelement 632 determined in described tone mark, is used for based on the fundamental frequency average of speech data and the fundamental frequency average of all notes of music-book information, determines the tone mark of melody that described speech data is converted to;
Described tone is adjusted subelement 633, is used for determining the definite tone mark of subelement as benchmark take described tone mark, and the frequency of each speech pitch point of adjustment speech data is alignd with the tone of each note in the music-book information.
Wherein, described tone mark determines that subelement 632 comprises: four module (6321) is used for determining the fundamental frequency average F0_aver of speech data and the fundamental frequency average P_aver of all audio frequency of music-book information; The 5th module (6322), be used for when F0_aver>P_aver, the fundamental frequency average of speech data is fallen K-n semitone as the tone mark of melody, wherein, K is the semitone number that F0_aver exceeds than P_aver, n is experiment value, and particularly can get n is int (K/7), and int represents to round; The 6th module (6323) is used for when F0_aver<P_aver, and the fundamental frequency average of speech data is risen K-n semitone as the tone mark of melody, wherein, K is the F0_aver semitone number lower than P_aver, and n is experiment value, particularly can get n is int (K/7), and int represents to round.
More preferably, described tone adjustment unit 630 also comprises: the level and smooth subelement 634 of speech pitch, be used for the speech pitch point is carried out segmentation, and wherein be in frequency-splitting between two adjacent speech pitch points of different segmentations greater than the setting fragmentation threshold; Determine that length is less than the segmentation of the wild point of being segmented into of default open country point length threshold; The frequency of speech pitch point in the segmentation of open country point is carried out exporting to described tone adjustment subelement after the sinc interpolation processing.
More preferably, this device also comprises: melody smooth unit 650, speech data after being used for the tone adjustment unit adjusted, the speech pitch point of the front m% that comprises with the speech pitch point of the rear m% that comprises in each note with a rear note carries out exporting to described melody synthesis unit after the sinc interpolation processing; Wherein, the experiment value of m% for setting.
Can be found out by above description, by adjusting the duration of each syllable in the speech data, lyrics duration corresponding in the duration that makes each syllable and the music-book information aligns, and according to the tone of each note in the music-book information, adjust the speech pitch of speech data, each speech pitch point is alignd with the tone of corresponding note, can with the speech data of user's input, be converted to the melody with user voice speciality according to the music-book information of selecting.
Description of drawings
Fig. 1 is main method process flow diagram provided by the invention;
The music-book information synoptic diagram that Fig. 2 provides for the embodiment of the invention;
Fig. 3 is duration matching algorithm realization flow figure provided by the invention;
Fig. 4 is pitch matches algorithm realization flow figure provided by the invention;
Fig. 5 is the smoothing method process flow diagram of speech pitch envelope provided by the invention;
The structure drawing of device that Fig. 6 provides for the embodiment of the invention.
Embodiment
In order to make the purpose, technical solutions and advantages of the present invention clearer, below in conjunction with the drawings and specific embodiments invention is described in detail.
Core concept of the present invention mainly is: according to contacting between voice and the melody, user's speech data and music-book information mated, finally form melody output.Its main method can as shown in Figure 1, may further comprise the steps:
Step 101: obtain speech data and corresponding music-book information, described speech data is inputted by the user, and described music-book information comprises: lyrics information, note information and both corresponding relations.
Step 102: the duration of adjusting each syllable in the speech data, lyrics duration corresponding in the duration that makes each syllable and the music-book information aligns, and according to the tone of each note in the music-book information, adjust the speech pitch of speech data, each speech pitch point is alignd with the tone of corresponding note.
Step 103: each syllable formation melody data in conjunction with the speech pitch point behind the adjustment tone and after adjusting duration.
Below in conjunction with specific embodiment said method provided by the invention is described in detail, take song " whitewashing the craftsman " as example.
In step 101, the user can select by a user interface music-book information of song " whitewashing the craftsman ", perhaps also can be inputted by the user music-book information of " whitewashing the craftsman ".This music-book information mainly comprises: the correspondence relationship information of lyrics information, note information and the lyrics and note.In addition, can also comprise information such as title, subtitle, ci and qu author, time signature, as shown in Figure 2.
The user can be according to the lyrics input speech data of " whitewashing the craftsman ", for example read the lyrics of a section " whitewashing the craftsman ", the lyrics paragraph that the user reads is corresponding with the music-book information of above-mentioned user selection/input, comprises in the described music-book information that namely the user reads the correspondence relationship information of the lyrics information of lyrics paragraph, note information and the lyrics and note.
Obtaining of above-mentioned speech data and music-book information there is no dividing of certain precedence, can arrange according to user's use habit or hobby.
In step 102, need to finish two matching processs, the first is regulated the duration of each syllable in the speech data, and lyrics duration corresponding in the duration that makes each syllable and the music-book information is complementary, i.e. the coupling of voice duration and lyrics duration; It two is each speech pitch points of adjusting speech data, and the tone of corresponding note is complementary in each speech pitch point that makes speech data and the music-book information, i.e. the coupling of speech pitch and note tone.The below is elaborated to these two matching processs respectively.
First matching process: the coupling of voice duration and lyrics duration can be passed through duration matching algorithm (Time Alignment Algorithm) and realize that detailed process can as shown in Figure 3, may further comprise the steps:
Step 301: extract the characteristic parameter of speech data, comprise energy and the zero-crossing rate information of each frame.
Wherein a frame is generally the speech data of set time length, and such as the speech data of the set time length in 20ms to the 30ms scope, the speech data in this time span can be considered to a steady-state signal (relatively fixing such as average, variance).In the specific implementation, this time span value can be preset as a suitable empirical value.The energy information of one frame refers to the quadratic sum of the amplitude of this frame speech data, and the zero-crossing rate information of a frame refers to the frequency of the amplitude zero crossing of this frame speech data.Step 302: speech data is divided into voice segments and quiet section according to the energy of each frame and zero-crossing rate information.
In this step, can be at first be speech frame or mute frame according to energy value and the zero-crossing rate of each frame with each frame identification according to default energy threshold and zero-crossing rate threshold value, wherein: be speech frame greater than described energy threshold and zero-crossing rate greater than the frame identification of described zero-crossing rate threshold value with energy value, other frame identification is mute frame, and described energy threshold and zero-crossing rate threshold value can rule of thumb be worth and/or experimental data is set.Next, adjacent speech frame is designated a voice segments, adjacent mute frame is designated one quiet section.Speech data just is divided into the voice segments of being separated by and quiet section like this.
Step 303: the lyrics information according in the music-book information is cut into syllable with voice segments.
In an embodiment of the present invention, word of a syllable ordinary representation.
Through after the above-mentioned steps 302, speech data is labeled the voice segments that is divided into interval one by one and quiet section.Next in this step, corresponding voice segments (processing of sentence layer) and each sentence of each sentence comprises voice segments corresponding to each phrase (processing of phrase layer) in the lyrics of definite music-book information at first successively, and then voice segments corresponding to each phrase carried out phonetic segmentation, obtain the syllable after the cutting.
Particularly, process for the sentence layer, the number of voice segments is all more than or equal to the sentence number in the lyrics of music-book information in most cases.In this case, need the position of selected cut-off, then the voice segments between two cut-offs is merged, can obtain voice segments corresponding to each sentence in the lyrics of music-book information.The selection of cut-off can be determined according to following formula:
Min(abs(splitTime/totalTime-senLenInLRC/totalLenInLRC))
Wherein, splitTime refers to the duration sum from a upper cut-off to the voice segments the cut-off of current selection; TotalTime refers to the duration sum of all voice segments; SenLenInLRC refers to that the current sentence of wanting cutting sings duration in the music score of Chinese operas; TotalLenInLRC refers to the duration of singing of the whole music score of Chinese operas.Particularly, at first with each quiet section as candidate's cut-off, then each candidate's cut-off is calculated the value of abs (splitTime/totalTime-senLenInLRC/totalLenInLRC), therefrom find out minimum value, corresponding quiet section of this minimum value is exactly the cut-off of determining.In brief, said method by to quiet section one to one candidate's cut-off travel through, find successively the cut-off of each sentence the best.Need to prove, if the number of voice segments less than the sentence number in the lyrics of music-book information, then needs to re-execute above-mentioned steps 302, also namely again speech data is divided into voice segments and quiet section.
Process for the phrase layer, three kinds of situations are arranged: the first situation, if the number of the voice segments that sentence is corresponding equals the phrase number that this sentence comprises, at the phrase layer that each phrase and the corresponding voice segments of this sentence that this sentence comprises is corresponding one by one so, also be the corresponding phrase of each voice segments.The second situation, if the phrase number that the number of the voice segments that sentence is corresponding comprises greater than this sentence, then need according to the selected cut-off position of the system of selection of above-mentioned cut-off, and with the voice segments between two cut-offs with also be in the same place, determine the voice segments that each phrase is corresponding.The third situation, if the phrase number that the number of the voice segments that sentence is corresponding comprises less than this sentence, then directly the corresponding voice segments of this sentence is carried out phonetic segmentation, and the number of words order that the number of syllables that cuts out and this sentence comprise compared, divide again three kinds of situations: the first situation, if syllable number equals number of words, be exactly the corresponding word of a syllable so.The second situation if the method that syllable number greater than number of words, then adopts similar sentence layer to process is found out cut-off, merges the syllable between the cut-off, corresponds to corresponding phrase.The third situation if syllable number less than word, is then found out that the longest syllable of duration in the syllable that cuts out, and with it and be divided into two syllables, is done like this and is continued till syllable number and lyrics number equate.In this case, the phrase layer has been finished the work of syllablic tier simultaneously, no longer carries out the processing of syllablic tier.
At last, utilize voice segments corresponding to each phrase of syllable splitting algorithm to be cut into syllable, the voice segments that for example, will consist of phrase " I am and whitewash the craftsman " is cut into " I ", "Yes", " one ", " individual ", " powder ", " brush ", " craftsman " seven syllables.Particularly, the cutting syllable can adopt existing syllable splitting state machine algorithms.
Step 304: according to the duration of the lyrics in the music-book information, the duration of regulating each syllable in the voice segments make its with music-book information in corresponding lyrics duration align.
Because the corresponding relation of the lyrics and note in the music-book information, each word in the lyrics has different durations, for example, in music score, the lyrics " I ", "Yes" are different from the duration of " craftsman ", the duration of syllable " craftsman " takies a beat, and syllable " I " and "Yes" respectively account for beat half.In the speech data of user's input, because input is not have melodic voice, the duration that each syllable may take is the same or do not have melody, therefore, the duration of each syllable in the speech data need to be regulated unanimously with the duration of each syllable in the lyrics of music-book information.
Need to prove, because in Chinese, a word is made of initial consonant and simple or compound vowel of a Chinese syllable.Preferably, carry out duration when regulating at the syllable that comprises initial consonant and simple or compound vowel of a Chinese syllable to, if need to the duration of this syllable be elongated, then can keep the initial consonant duration constant, only elongate the duration of simple or compound vowel of a Chinese syllable; If need to the duration of this syllable be shortened, then initial consonant and simple or compound vowel of a Chinese syllable can be shortened simultaneously.This mode meets the custom of singing melody more, makes it more melodized.Based on this principle, adopt the mode of GMM (gauss hybrid models) cluster to carry out the estimation of initial consonant, simple or compound vowel of a Chinese syllable duration, adjust with the duration to consonant, vowel, can specifically comprise:
When the front and back of syllable are quiet section, illustrate that this syllable is independent syllable, can make the duration of initial consonant account for 16.2% of whole syllable duration; When the front of this syllable is quiet section, when the back is not quiet section, illustrate that this syllable is the first syllable of phrase or sentence, can make the initial consonant duration account for 27.6% of whole syllable duration.When the front of this syllable is not quiet section, when the back is quiet section, illustrate that this syllable is the last syllable of phrase or sentence, can make the initial consonant duration account for 24.8% of whole syllable duration.When the front and back of this syllable all are not quiet section, can make the initial consonant duration account for 32.9% of whole syllable duration.
Second matching process: the coupling of speech pitch and note tone can be passed through pitch matches algorithm (Pitch Alignment Algorithm) and realize that detailed process can as shown in Figure 4, may further comprise the steps:
Step 401: extract the characteristic parameter of speech data, comprise speech pitch information, specifically refer to the fundamental frequency information of each frame.Above-mentioned speech audio information comprises: each speech pitch point of the fundamental frequency average of speech data and speech data.
Step 402: based on the fundamental frequency average of all notes in the fundamental frequency average of all frames in the speech data and the music-book information, determine the tone mark of melody that speech data is converted to.
Music score itself is take the feature of tone mark as oneself, and the fundamental frequency of raw tone has larger gap with the tone of music score under many circumstances.If so that the final melody that forms of the voice after adjusting has the sound speciality of raw tone, then need simultaneously to determine based on the fundamental frequency average of all notes in the fundamental frequency average of speech data and the music-book information tone mark of melody.
The concrete definite method of melody tone mark can comprise: the fundamental frequency average P_aver that determines all notes in the fundamental frequency average F0_aver of speech data and the music-book information; If F0_aver>P_aver falls K-n semitone as the tone mark of melody with the fundamental frequency average of speech data, wherein, K is the semitone number that F0_aver exceeds than P_aver, and n is experiment value, and can get n is int (K/7), and int represents to round; If F0_aver<P_aver rises K-n semitone as the tone mark of melody with the fundamental frequency average of speech data, wherein, K is the F0_aver semitone number lower than P_aver, and n is experiment value, and can get n is int (K/7), and int represents to round.
Step 403: the tone mark of determining take step 402 is adjusted the frequency of each frame speech pitch point and is alignd with the tone of each note in the music-book information as benchmark.
Each note in the music score (Do, Re, Mi, Fa, Sol, La, Si and Do) has separately tone according to the tone mark of music score, when aliging, need to adjust according to the tone mark of determining in the step 402 frequency of speech pitch point, it is alignd with the tone of each note in the music-book information.
In addition, in the matching process of speech pitch and note tone, the smoothness of pitch contour is the key factor that determines melody tonequality.In the present invention can be further by the level and smooth or melody pitch contour of speech pitch envelope smoothly obtained more excellent melody tonequality.
The smoothing method of paper speech pitch envelope.Because speech pitch parameter extraction error, the phenomenon of unavoidable meeting frequency of occurrences sudden change in the speech pitch envelope, the fundamental frequency point of these frequency discontinuities is called wild point, and wild point is the arch-criminal who affects tonequality, therefore, need to carry out smoothing processing to the point of the open country in the speech pitch envelope.The smoothing method of speech pitch envelope can as shown in Figure 5, may further comprise the steps:
Step 501: each frame speech pitch point sequence is carried out segmentation, be in frequency-splitting between two adjacent fundamental frequency points of different segmentations greater than setting fragmentation threshold (default empirical value).
In this step, can obtain above-mentioned each frame speech pitch point sequence by the fundamental frequency point that extracts each frame of voice signal.
The speech pitch point sequence P that supposes the speech data of input is: { P 1, P 2, P 3... P N.When the frequency-splitting of adjacent two fundamental frequency points during greater than fragmentation threshold Threshold, with this two fundamental frequencies o'clock border as two segmentations, suppose P iAnd P I+1Frequency-splitting greater than Threshold, then sequence is divided into { P 1... P iAnd
A can get 4 for the scale-up factor that can affect smooth effect that experiment obtains.
Step 502: determine that length is less than open country point length threshold Th TimeSegmentation is put in the open country that is segmented into of (empirical value, general length is less, for example 0.06 second (s)).
Suppose the mode according to step 501, the speech pitch point sequence is divided into K segmentation
Figure BDA0000095779560000102
Wherein,
Figure BDA0000095779560000103
Length less than Th Time, then
Figure BDA0000095779560000104
Be wild point sequence.
Step 503: the frequency to speech pitch point in the segmentation of open country point is carried out the sinc interpolation processing.
According to the frequency of audio frequency point before and after the segmentation of open country point, carry out the sinc interpolation processing by the sequence length of open country point segmentation.
Above-mentioned flow process shown in Figure 5 can be carried out between step 402 and step 403, will adjust through the speech pitch point of speech pitch envelope smoothing processing in step 403, and the tone mark of determining according to step 402 aligns with the tone of each note in each music-book information.
The below introduces the smoothing method of melody pitch contour.Because the fundamental frequency of final melody is to get through the frequency adjustment to speech pitch point, that obtain in the melody pitch contour is the splicing result, probably produces obvious frequency discontinuity between the adjacent syllable of synthetic melody, thereby affects tonequality.The fundamental frequency point of the front m% of the fundamental frequency point of m% and a rear syllable adjacent with this syllable carries out smoothing processing after in the fundamental frequency point that each syllable can be comprised in the present invention, is specially the frequency of choosing first and last point and carries out the sinc interpolation processing, and sequence length is constant.Wherein, the experiment value of m% for setting can get 20%.
More than be the description that method provided by the present invention is carried out, the below is described in detail device provided by the present invention.As shown in Figure 6, this device comprises: user interface 600, music score administrative unit 610, duration adjustment unit 620, tone adjustment unit 630 and melody synthesis unit 640.
User interface 600 can be obtained the speech data of user's input and the music-book information of selecting from the music score administrative unit, music-book information comprises: lyrics information, note information and both corresponding relations.
Music score administrative unit 610 management music-book informations are selected for the user.
In addition, user interface 600 also can be shown to the user with the music-book information of music score administrative unit 610 management, selects for the user.
Duration adjustment unit 620 is adjusted the duration of each syllable in the speech datas, and lyrics duration corresponding in the duration that makes each syllable and the music-book information of selection aligns.
Tone adjustment unit 630 is adjusted the speech pitch of speech data according to the tone of each note in the music-book information of selecting, and each speech pitch point is alignd with the tone of corresponding note.
Each syllable after 640 combinations of melody synthesis unit are adjusted the speech pitch point behind the tone and adjusted duration forms melody data.
Wherein, duration adjustment unit 620 can specifically comprise: feature extraction subelement 621, segment identification subelement 622, phonetic segmentation subelement 623 and duration are adjusted subelement 624.
Feature extraction subelement 621 extracts energy and the zero-crossing rate information of each frame in the speech data of inputting.
Segment identification subelement 622 can be according to energy and the zero-crossing rate information of each frame, and speech data is divided into voice segments and quiet section.The associated description of concrete segmentation method in can the employing method.
Phonetic segmentation subelement 623 is syllable according to the lyrics information in the music-book information of selecting with the voice segments cutting;
Duration is adjusted lyrics duration corresponding in duration that subelement 624 adjusts each syllable in the speech datas and the music-book information and is alignd.
Wherein, segment identification subelement 622 can according to energy and the zero-crossing rate information of each frame, be speech frame or mute frame with each frame identification, with adjacent speech frame formation voice segments, with quiet section of adjacent mute frame formation.
In addition, phonetic segmentation subelement 623 may further include: the first module 6231, for voice segments corresponding to each sentence of the lyrics of determining music-book information; The second module 6232 is used for determining that each sentence comprises voice segments corresponding to each phrase; The 3rd module 6233 is used for voice segments corresponding to each phrase carried out phonetic segmentation.
Above-mentioned tone adjustment unit 630 can specifically comprise: subelement 632 and tone adjustment subelement 633 determined in feature extraction subelement 631, tone mark.
Feature extraction subelement 631 extracts the speech pitch information of the speech data of input.
Tone mark determines that subelement 632 based on the fundamental frequency average of all notes in the fundamental frequency average of speech data and the music-book information, determines the tone mark of melody.
Tone is adjusted subelement 633 and is determined that take tone mark tone mark that subelement 632 determines as benchmark, adjusts the frequency of each speech pitch point of speech data and align with the tone of each note in the music-book information.
Wherein, tone mark determines that subelement 632 can specifically comprise: four module 6321 is used for determining the fundamental frequency average F0_aver of speech data and the fundamental frequency average P_aver of all audio frequency of music-book information; The 5th module 6322 is used for when F0_aver>P_aver the fundamental frequency average of speech data being fallen K-n semitone as the tone mark of melody, wherein, K is the semitone number that F0_aver exceeds than P_aver, and n is experiment value, particularly can get n is int (K/7), and int represents to round; The 6th module 6323 is used for when F0_aver<P_aver, and the fundamental frequency average of speech data is risen K-n semitone as the tone mark of melody, wherein, K is the F0_aver semitone number lower than P_aver, and n is experiment value, particularly can get n is int (K/7), and int represents to round.
More preferably, in order further to improve the tonequality of melody, can realize by the mode of any or combination in the following dual mode.
The first, tone adjustment unit 630 further comprises: the level and smooth subelement 634 of speech pitch carries out segmentation with the speech pitch point, wherein is in frequency-splitting between two adjacent speech pitch points of different segmentations greater than setting fragmentation threshold; Determine that length is less than the segmentation of the wild point of being segmented into of default open country point length threshold; Frequency to speech pitch point in the segmentation of open country point is carried out the sinc interpolation processing; Speech data after the difference processing is offered described tone mark adjust subelement 633.
The second, this device can also comprise: in the speech data after melody smooth unit 650 is adjusted tone adjustment unit 630, the speech pitch point of the front m% that comprises with the speech pitch point of the rear m% that comprises in each note with a rear note carries out exporting to melody synthesis unit 640 after the sinc interpolation processing; Wherein, the experiment value of m% for setting.
Finally, melody synthesis unit 640 can be exported to audio-frequence player device with synthetic melody or play-over to the user.
The invention provides and a kind of sound is converted to the method and apparatus of melody, wherein method comprises: the speech data and the music-book information that obtain input; Adjust the duration of each syllable in the speech data, lyrics duration corresponding in the duration that makes each syllable and the music-book information aligns, and according to the tone of each note in the music-book information, adjust the speech pitch of speech data, the tone of corresponding note in each speech pitch point and the music-book information is alignd; Each syllable formation melody data in conjunction with the speech pitch point behind the adjustment tone and after adjusting duration.Can with the speech data of user's input, be converted to the melody with user voice speciality according to the music-book information of selecting by the present invention.
The above only is preferred embodiment of the present invention, and is in order to limit the present invention, within the spirit and principles in the present invention not all, any modification of making, is equal to replacement, improvement etc., all should be included within the scope of protection of the invention.

Claims (16)

1. one kind is the method for melody with speech conversion, and described method comprises:
Obtain speech data and music-book information, described speech data is inputted by the user, and described music-book information comprises: lyrics information, note information and both corresponding relations;
Adjust the duration of each syllable in the described speech data, lyrics duration corresponding in the duration that makes each syllable and the music-book information aligns;
According to the tone of each note in the described music-book information, adjust the speech pitch point of described speech data, the tone of corresponding note in described speech pitch point and the music-book information is alignd;
Each syllable formation melody data in conjunction with the speech pitch point behind the adjustment tone and after adjusting duration.
2. method according to claim 1, wherein, lyrics duration corresponding in the duration of each syllable in the described speech data of described adjustment, the duration that makes each syllable and music-book information aligns, and comprising:
Extract energy and the zero-crossing rate information of each frame in the described speech data;
According to the energy of each frame and zero-crossing rate information described speech data is divided into voice segments and quiet section;
Be syllable according to the lyrics information in the described music-book information with each voice segments cutting;
Adjust each syllable in the speech data duration so that its with music-book information in corresponding lyrics duration align.
3. method according to claim 2, wherein, described energy and zero-crossing rate information according to each frame is divided into voice segments and quiet section with described speech data, comprising:
Energy and zero-crossing rate information according to each frame are speech frame or mute frame with each frame identification;
Adjacent speech frame is consisted of voice segments adjacent mute frame is consisted of quiet section.
4. method according to claim 2, wherein, described is syllable according to the lyrics information in the described music-book information with each voice segments cutting, comprising:
Determine voice segments corresponding to each sentence in the lyrics of music-book information;
Determine voice segments corresponding to each phrase that each sentence comprises; And
The voice segments that each phrase is corresponding is carried out phonetic segmentation, obtain the syllable after the cutting.
5. method according to claim 2, wherein, the duration of each syllable in the described adjustment speech data so that its with music-book information in corresponding lyrics duration align and comprise:
Carry out duration when regulating at the syllable that comprises initial consonant and simple or compound vowel of a Chinese syllable to, if need to the duration of this syllable be elongated, then keep the initial consonant duration constant, only elongate the duration of simple or compound vowel of a Chinese syllable; If need to the duration of this syllable be shortened, then initial consonant and simple or compound vowel of a Chinese syllable are shortened simultaneously.
6. method according to claim 2, wherein, the duration of each syllable in the described adjustment speech data so that its with music-book information in corresponding lyrics duration align and comprise:
When the front and back of a syllable are quiet section, make the duration of this syllable initial consonant account for 16.2% of whole syllable duration;
When the front of this syllable is quiet section, when the back is not quiet section, make this syllable initial consonant duration account for 27.6% of whole syllable duration;
When the front of this syllable is not quiet section, when the back is quiet section, make this syllable initial consonant duration account for 24.8% of whole syllable duration; And
When the front and back of this syllable all are not quiet section, make this syllable initial consonant duration account for 32.9% of whole syllable duration.
7. method according to claim 1, wherein, described tone according to each note in the music-book information is adjusted the speech pitch point of described speech data, and the tone of corresponding note in described speech pitch point and the music-book information is alignd, and comprising:
Extract the speech pitch information of described speech data, described speech audio information comprises: each speech pitch point of the fundamental frequency average of speech data and speech data;
Determine the tone mark of melody that described speech data is converted to based on the fundamental frequency average of all notes in the fundamental frequency average of described speech data and the music-book information;
Take definite tone mark as benchmark, adjust the frequency of each speech pitch point of described speech data and align with the tone of each note in the music-book information.
8. method according to claim 7, wherein, described based on described speech data the fundamental frequency average and music-book information in the fundamental frequency average of all notes determine to comprise the tone mark of melody that described speech data is converted to:
Determine the fundamental frequency average P_aver of all notes in the fundamental frequency average F0_aver of described speech data and the described music-book information;
If F0_aver>P_aver, the fundamental frequency average of described speech data is fallen K-n semitone as the tone mark of the melody that described speech data is converted to, wherein, K is the semitone number that F0_aver exceeds than P_aver, n is experiment value, particularly can get n is int (K/7), and int represents to round;
If F0_aver<P_aver, the fundamental frequency average of described speech data is risen K-n semitone as the tone mark of the melody that described speech data is converted to, wherein, K is the F0_aver semitone number lower than P_aver, n is experiment value, particularly can get n is int (K/7), and int represents to round.
9. method according to claim 7, wherein, after the described tone mark of determining melody that described speech data is converted to, described method further comprises:
Described speech pitch point is carried out segmentation, wherein be in frequency-splitting between two adjacent speech pitch points of different segmentations greater than setting fragmentation threshold;
Determine that length is less than the segmentation of the wild point of being segmented into of default open country point length threshold;
Frequency to speech pitch point in the segmentation of open country point is carried out the sinc interpolation processing.
10. according to claim 1 or 7 described methods, wherein, described method further comprises: in through the speech data after adjusting tone, the speech pitch point of the front m% that comprises with the speech pitch point of the rear m% that comprises in each note with a rear note carries out the sinc interpolation processing; Wherein, the experiment value of m% for setting.
11. one kind is the device of melody with speech conversion, described device comprises: user interface (600), music score administrative unit (610), duration adjustment unit (620), tone adjustment unit (630) and melody synthesis unit (640);
Described user interface (600) is used for obtaining the speech data of user's input and the music-book information of selecting from the music score administrative unit, and described music-book information comprises: lyrics information, note information and both corresponding relations;
Described music score administrative unit (610) is used for the management music-book information and selects for the user;
Described duration adjustment unit (620) is used for adjusting the duration of described each syllable of speech data, and lyrics duration corresponding in the duration that makes each syllable and the music-book information of described selection aligns;
Described tone adjustment unit (630) is used for the tone according to each note of music-book information of described selection, adjusts the speech pitch point of speech data, and each speech pitch point is alignd with the tone of corresponding note;
Described melody synthesis unit (640) is used in conjunction with the speech pitch point behind the adjustment tone and each syllable after adjusting duration forms melody data.
12. device according to claim 11, wherein, described duration adjustment unit specifically comprises: feature extraction subelement (621), segment identification subelement (622), phonetic segmentation subelement (623) and duration are adjusted subelement (624);
Described feature extraction subelement (621) is for energy and the zero-crossing rate information of extracting described each frame of speech data;
Described segment identification subelement (622) is used for energy and the zero-crossing rate information of each frame of extracting according to described feature extraction subelement, and described speech data is divided into voice segments and quiet section;
Described phonetic segmentation subelement (623) is used for the lyrics information according to the music-book information of described selection, is syllable with described voice segments cutting;
Described duration is adjusted subelement (624), and lyrics duration corresponding in the duration that is used for adjusting described each syllable of speech data and the music-book information aligns.
13. device according to claim 12, wherein, described phonetic segmentation subelement (623) comprising:
The first module (6231) is for voice segments corresponding to each sentence of the lyrics of determining music-book information;
The second module (6232) is used for determining that each sentence comprises voice segments corresponding to each phrase;
The 3rd module (6233) is used for voice segments corresponding to each phrase carried out phonetic segmentation.
14. device according to claim 11, wherein, described tone adjustment unit (630) specifically comprises: subelement (632) and tone adjustment subelement (633) determined in feature extraction subelement (631), tone mark;
Described feature extraction subelement (631), for the speech audio information of the speech data that extracts input, described speech audio information comprises: each speech pitch point of the fundamental frequency average of speech data and speech data;
Subelement (632) determined in described tone mark, is used for based on the fundamental frequency average of speech data and the fundamental frequency average of all notes of music-book information, determines the tone mark of melody that described speech data is converted to;
Described tone is adjusted subelement (633), is used for determining the definite tone mark of subelement as benchmark take described tone mark, and the frequency of each speech pitch point of adjustment speech data is alignd with the tone of each note in the music-book information.
15. device according to claim 14, wherein, described tone mark determines that subelement (632) comprising:
Four module (6321) is used for determining the fundamental frequency average F0_aver of speech data and the fundamental frequency average P_aver of all audio frequency of music-book information;
The 5th module (6322), be used for when F0_aver>P_aver, the fundamental frequency average of speech data is fallen K-n semitone as the tone mark of melody, wherein, K is the semitone number that F0_aver exceeds than P_aver, n is experiment value, and particularly can get n is int (K/7), and int represents to round;
The 6th module (6323) is used for when F0_aver<P_aver, and the fundamental frequency average of speech data is risen K-n semitone as the tone mark of melody, wherein, K is the F0_aver semitone number lower than P_aver, and n is experiment value, particularly can get n is int (K/7), and int represents to round.
16. device according to claim 14, wherein, described tone adjustment unit (630) also comprises: the level and smooth subelement of speech pitch (634), be used for the speech pitch point is carried out segmentation, wherein be in frequency-splitting between two adjacent speech pitch points of different segmentations greater than the setting fragmentation threshold; Determine that length is less than the segmentation of the wild point of being segmented into of default open country point length threshold; The frequency of speech pitch point in the segmentation of open country point is carried out exporting to described tone adjustment subelement after the sinc interpolation processing.
CN2011102956675A 2011-09-30 2011-09-30 Method and device for transforming voice into melody Pending CN103035235A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2011102956675A CN103035235A (en) 2011-09-30 2011-09-30 Method and device for transforming voice into melody

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2011102956675A CN103035235A (en) 2011-09-30 2011-09-30 Method and device for transforming voice into melody

Publications (1)

Publication Number Publication Date
CN103035235A true CN103035235A (en) 2013-04-10

Family

ID=48022067

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2011102956675A Pending CN103035235A (en) 2011-09-30 2011-09-30 Method and device for transforming voice into melody

Country Status (1)

Country Link
CN (1) CN103035235A (en)

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103337244A (en) * 2013-05-20 2013-10-02 北京航空航天大学 Outlier modification algorithm in isolate syllable fundamental frequency curve
CN103456295A (en) * 2013-08-05 2013-12-18 安徽科大讯飞信息科技股份有限公司 Method and system for generating fundamental frequency parameters in singing synthesis
CN105206257A (en) * 2015-10-14 2015-12-30 科大讯飞股份有限公司 Voice conversion method and device
CN105829532A (en) * 2013-11-14 2016-08-03 查理·周 System and method for creating audible sound representations of atoms and molecules
CN106373580A (en) * 2016-09-05 2017-02-01 北京百度网讯科技有限公司 Singing synthesis method based on artificial intelligence and device
CN106898340A (en) * 2017-03-30 2017-06-27 腾讯音乐娱乐(深圳)有限公司 The synthetic method and terminal of a kind of song
CN107039024A (en) * 2017-02-10 2017-08-11 美国元源股份有限公司 Music data processing method and processing device
CN107146631A (en) * 2016-02-29 2017-09-08 北京搜狗科技发展有限公司 Music recognition methods, note identification model method for building up, device and electronic equipment
CN108053814A (en) * 2017-11-06 2018-05-18 芋头科技(杭州)有限公司 A kind of speech synthesis system and method for analog subscriber song
CN109493684A (en) * 2018-12-10 2019-03-19 北京金三惠科技有限公司 A kind of multifunctional digital music lesson system
CN109741724A (en) * 2018-12-27 2019-05-10 歌尔股份有限公司 Make the method, apparatus and intelligent sound of song
CN109979422A (en) * 2019-02-21 2019-07-05 百度在线网络技术(北京)有限公司 Fundamental frequency processing method, device, equipment and computer readable storage medium
CN109979497A (en) * 2017-12-28 2019-07-05 阿里巴巴集团控股有限公司 Generation method, device and system and the data processing and playback of songs method of song
CN110741430A (en) * 2017-06-14 2020-01-31 雅马哈株式会社 Singing synthesis method and singing synthesis system
CN111210850A (en) * 2020-01-10 2020-05-29 腾讯音乐娱乐科技(深圳)有限公司 Lyric alignment method and related product
CN112750420A (en) * 2020-12-23 2021-05-04 出门问问(苏州)信息科技有限公司 Singing voice synthesis method, device and equipment
CN112786013A (en) * 2021-01-11 2021-05-11 北京有竹居网络技术有限公司 Voice synthesis method and device based on album, readable medium and electronic equipment
CN112820257A (en) * 2020-12-29 2021-05-18 吉林大学 GUI sound synthesis device based on MATLAB
CN112951198A (en) * 2019-11-22 2021-06-11 微软技术许可有限责任公司 Singing voice synthesis
CN113053355A (en) * 2021-03-17 2021-06-29 平安科技(深圳)有限公司 Fole human voice synthesis method, device, equipment and storage medium
WO2022012164A1 (en) * 2020-07-16 2022-01-20 百果园技术(新加坡)有限公司 Method and apparatus for converting voice into rap music, device, and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0605348A2 (en) * 1992-12-30 1994-07-06 International Business Machines Corporation Method and system for speech data compression and regeneration
US5796916A (en) * 1993-01-21 1998-08-18 Apple Computer, Inc. Method and apparatus for prosody for synthetic speech prosody determination
CN101313477A (en) * 2005-12-21 2008-11-26 Lg电子株式会社 Music generating device and operating method thereof
CN101399036A (en) * 2007-09-30 2009-04-01 三星电子株式会社 Device and method for conversing voice to be rap music
US20090314155A1 (en) * 2008-06-20 2009-12-24 Microsoft Corporation Synthesized singing voice waveform generator

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0605348A2 (en) * 1992-12-30 1994-07-06 International Business Machines Corporation Method and system for speech data compression and regeneration
US5796916A (en) * 1993-01-21 1998-08-18 Apple Computer, Inc. Method and apparatus for prosody for synthetic speech prosody determination
CN101313477A (en) * 2005-12-21 2008-11-26 Lg电子株式会社 Music generating device and operating method thereof
CN101399036A (en) * 2007-09-30 2009-04-01 三星电子株式会社 Device and method for conversing voice to be rap music
US20090314155A1 (en) * 2008-06-20 2009-12-24 Microsoft Corporation Synthesized singing voice waveform generator

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
JORDI BONADA ET AL: "Synthesis of the Singing Voice by Performance Sampling and Spectral Models", 《IEEE SIGNAL PROCESSING MAGAZINE》 *
TAKESHI SAITOU ET AL: "Development of an F0 control model based on F0 dynamic characteristics for singing-voice synthesis", 《SPEECH COMMUNICATION》 *
TAKESHI SAITOU ET AL: "SPEECH-TO-SINGING SYNTHESIS:CONVERTING SPEAKING VOICES TO SINGING VOICES BY CONTROLLING ACOUSTIC FEATURES UNIQUE TO SINGING VOICES", 《2007 IEEE WORKSHOP ON APPLICATIONS OF SIGNAL PROCESSING TO AUDIO AND ACOUSTICS》 *

Cited By (33)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103337244B (en) * 2013-05-20 2015-08-26 北京航空航天大学 Outlier amending method in a kind of isolate syllable fundamental frequency curve
CN103337244A (en) * 2013-05-20 2013-10-02 北京航空航天大学 Outlier modification algorithm in isolate syllable fundamental frequency curve
CN103456295A (en) * 2013-08-05 2013-12-18 安徽科大讯飞信息科技股份有限公司 Method and system for generating fundamental frequency parameters in singing synthesis
CN103456295B (en) * 2013-08-05 2016-05-18 科大讯飞股份有限公司 Sing synthetic middle base frequency parameters and generate method and system
CN105829532A (en) * 2013-11-14 2016-08-03 查理·周 System and method for creating audible sound representations of atoms and molecules
CN105206257B (en) * 2015-10-14 2019-01-18 科大讯飞股份有限公司 A kind of sound converting method and device
CN105206257A (en) * 2015-10-14 2015-12-30 科大讯飞股份有限公司 Voice conversion method and device
CN107146631B (en) * 2016-02-29 2020-11-10 北京搜狗科技发展有限公司 Music identification method, note identification model establishment method, device and electronic equipment
CN107146631A (en) * 2016-02-29 2017-09-08 北京搜狗科技发展有限公司 Music recognition methods, note identification model method for building up, device and electronic equipment
CN106373580A (en) * 2016-09-05 2017-02-01 北京百度网讯科技有限公司 Singing synthesis method based on artificial intelligence and device
CN106373580B (en) * 2016-09-05 2019-10-15 北京百度网讯科技有限公司 The method and apparatus of synthesis song based on artificial intelligence
CN107039024A (en) * 2017-02-10 2017-08-11 美国元源股份有限公司 Music data processing method and processing device
CN106898340A (en) * 2017-03-30 2017-06-27 腾讯音乐娱乐(深圳)有限公司 The synthetic method and terminal of a kind of song
CN110741430B (en) * 2017-06-14 2023-11-14 雅马哈株式会社 Singing synthesis method and singing synthesis system
CN110741430A (en) * 2017-06-14 2020-01-31 雅马哈株式会社 Singing synthesis method and singing synthesis system
CN108053814A (en) * 2017-11-06 2018-05-18 芋头科技(杭州)有限公司 A kind of speech synthesis system and method for analog subscriber song
CN108053814B (en) * 2017-11-06 2023-10-13 芋头科技(杭州)有限公司 Speech synthesis system and method for simulating singing voice of user
CN109979497A (en) * 2017-12-28 2019-07-05 阿里巴巴集团控股有限公司 Generation method, device and system and the data processing and playback of songs method of song
CN109493684A (en) * 2018-12-10 2019-03-19 北京金三惠科技有限公司 A kind of multifunctional digital music lesson system
CN109493684B (en) * 2018-12-10 2021-02-23 北京金三惠科技有限公司 Multifunctional digital music teaching system
CN109741724A (en) * 2018-12-27 2019-05-10 歌尔股份有限公司 Make the method, apparatus and intelligent sound of song
CN109979422B (en) * 2019-02-21 2021-09-28 百度在线网络技术(北京)有限公司 Fundamental frequency processing method, device, equipment and computer readable storage medium
CN109979422A (en) * 2019-02-21 2019-07-05 百度在线网络技术(北京)有限公司 Fundamental frequency processing method, device, equipment and computer readable storage medium
CN112951198A (en) * 2019-11-22 2021-06-11 微软技术许可有限责任公司 Singing voice synthesis
CN111210850B (en) * 2020-01-10 2021-06-25 腾讯音乐娱乐科技(深圳)有限公司 Lyric alignment method and related product
CN111210850A (en) * 2020-01-10 2020-05-29 腾讯音乐娱乐科技(深圳)有限公司 Lyric alignment method and related product
WO2022012164A1 (en) * 2020-07-16 2022-01-20 百果园技术(新加坡)有限公司 Method and apparatus for converting voice into rap music, device, and storage medium
CN112750420A (en) * 2020-12-23 2021-05-04 出门问问(苏州)信息科技有限公司 Singing voice synthesis method, device and equipment
CN112750420B (en) * 2020-12-23 2023-01-31 出门问问创新科技有限公司 Singing voice synthesis method, device and equipment
CN112820257A (en) * 2020-12-29 2021-05-18 吉林大学 GUI sound synthesis device based on MATLAB
CN112820257B (en) * 2020-12-29 2022-10-25 吉林大学 GUI voice synthesis device based on MATLAB
CN112786013A (en) * 2021-01-11 2021-05-11 北京有竹居网络技术有限公司 Voice synthesis method and device based on album, readable medium and electronic equipment
CN113053355A (en) * 2021-03-17 2021-06-29 平安科技(深圳)有限公司 Fole human voice synthesis method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN103035235A (en) Method and device for transforming voice into melody
CN108806656B (en) Automatic generation of songs
WO2021218138A1 (en) Song synthesis method, apparatus and device, and storage medium
CN101308652B (en) Synthesizing method of personalized singing voice
CN104347080B (en) The medium of speech analysis method and device, phoneme synthesizing method and device and storaged voice analysis program
CN108806655B (en) Automatic generation of songs
US9818396B2 (en) Method and device for editing singing voice synthesis data, and method for analyzing singing
CN104272382B (en) Personalized singing synthetic method based on template and system
CN102779508B (en) Sound bank generates Apparatus for () and method therefor, speech synthesis system and method thereof
CN101399036B (en) Device and method for conversing voice to be rap music
Molina et al. SiPTH: Singing transcription based on hysteresis defined on the pitch-time curve
CN103915093B (en) A kind of method and apparatus for realizing singing of voice
CN112951198A (en) Singing voice synthesis
JP4829477B2 (en) Voice quality conversion device, voice quality conversion method, and voice quality conversion program
Tamaru et al. JVS-MuSiC: Japanese multispeaker singing-voice corpus
Mesaros Singing voice identification and lyrics transcription for music information retrieval invited paper
CN111370024A (en) Audio adjusting method, device and computer readable storage medium
CN112289300B (en) Audio processing method and device, electronic equipment and computer readable storage medium
Umbert et al. Generating singing voice expression contours based on unit selection
Koguchi et al. PJS: Phoneme-balanced Japanese singing-voice corpus
JP5598516B2 (en) Voice synthesis system for karaoke and parameter extraction device
JP2022120188A (en) Music reproduction system, method and program for controlling the same
Bonada et al. Hybrid neural-parametric f0 model for singing synthesis
CN104376850A (en) Estimation method for fundamental frequency of Chinese whispered speech
CN108922505B (en) Information processing method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20130410