CN101399036A

CN101399036A - Device and method for conversing voice to be rap music

Info

Publication number: CN101399036A
Application number: CNA2007101641328A
Authority: CN
Inventors: 朱璇; 史媛媛; 邓菁; 严基完; 李在原
Original assignee: Beijing Samsung Telecommunications Technology Research Co Ltd; Samsung Electronics Co Ltd
Current assignee: Beijing Samsung Telecommunications Technology Research Co Ltd; Samsung Electronics Co Ltd; Samsung C&T Corp
Priority date: 2007-09-30
Filing date: 2007-09-30
Publication date: 2009-04-01
Anticipated expiration: 2027-09-30
Also published as: CN101399036B

Abstract

The invention provides a device for converting speeches into talking and singing music and a method. The device comprises: an accompanying music generating part which is used for generating talking and singing accompanying music; a speech conversion part which converts speeches input by users into talking and singing form based on the accompanying music generated by the accompanying music generating part; a music mixer for mixing the talking and signing accompanying music generated by the accompanying music generating part with the speeches in talking and signing form which are converted by the speech conversion part to format the talking and signing music.

Description

With equipment and the method for speech conversion for Chinese musical telling music

Technical field

The present invention relates to a kind of equipment and method that voice (speech) is converted to a Chinese musical telling (Chinese musical telling) music, more particularly, relating to a kind of voice that can directly act on user's input is Chinese musical telling music with this speech conversion and keeps the fundamental tone of user speech and the equipment and the method for tone color.

Background technology

A Chinese musical telling (Rap) is present popular a kind of musical form, and it is to play one of core element of breathing out (Hip-hop) music.Chinese musical telling music tells apace that with under the rhythm sound background of machinery a succession of rhyme sentence is a feature.Chinese musical telling music is accompaniment with simple percussion music usually, also has a lot of Chinese musical telling music not have accompaniment music.The lyrics humour of Chinese musical telling music, humorous often has ironic.Therefore at present, Chinese musical telling music is subjected to people very much, and especially young man's likes.

Along with the pursuit of people to personality elements, people wish to make the Chinese musical telling music of being sung by own in the Chinese musical telling music of listening others to sing more and more, and with this personal sign as oneself, as the ringing sound of cell phone of conduct oneself etc.But singing Chinese musical telling music often needs the singer to have certain music theory knowledge and singing skills, and this acquires a certain degree of difficulty for ordinary consumer.Therefore, domestic consumer often wishes only to pass through briefly some language, then said voice (speech) is converted to the Chinese musical telling music of being furnished with accompaniment, makes the Chinese musical telling music of oneself.That is, needing a kind of is the music synthetic technology of Chinese musical telling music with the said speech conversion of user.

The synthetic field of music mainly contains two class technology at present, one class is based on automatic music composition (automatic music composition) technology that the template of pre-stored is wrirted music, and another kind of is synthetic (singing voice synthesis) technology of singing voice that singing voice is synthesized.

Traditional automatic music composition technology mainly comprises 5 parts: template base comprises a plurality of waveform segments or a plurality of musical features data (majority is based on MIDI's); Music structure template or rule are used for organizing automatically melody; The MIDI interface allows the user to import tune or other signal, as the fundamental tone and the rhythm etc.; Interactive interface makes the user can revise fundamental tone, duration, chord, rhythm, repetition, musical instrument, equilibrium, filtering, audio mixing etc.; Readout device, the synthetic melody of being done.For example, in U.S. Patent Application Publication US6,835,884B2 number, US6,576,828B2 number, US6,175,072B1 number, US6,153, No. 821, US6,169, No. 242, US6,353, No. 170 and US5 disclose several traditional automatic music composition technology in 801, No. 694.

Traditional singing voice synthetic technology then mainly is made up of 6 parts: audio database comprises a plurality of waveform sound fragments or sound modeling parameter; Input media is used to receive the music score and the lyrics; The language device, unit and the voice unit (VU) of selecting is connected to voice or singing voice is used to select a sound; Be used for synthesizing and the voice of smooth connection or the device of singing voice based on phoneme synthesizing method; Be used for singing the device that condition is revised the fundamental tone of synthetic voice or singing voice, duration, sonographic property etc. corresponding to music score and music; Readout device, synthetic singing voice.For example, at U.S. Patent Application Publication US6,304, No. 846 and US7 disclose several traditional singing voice synthetic technologys in 016, No. 841.

Yet this two classes technology has following shortcoming: the user can't only import the lyrics by speaking; Can not produce Chinese musical telling accompaniment music automatically; Can't automatically user's voice be converted to Chinese musical telling form.

Therefore, need a kind of equipment and method that the speech conversion of user's input can be become Chinese musical telling music.

Summary of the invention

The object of the present invention is to provide a kind of voice that can directly act on user's input to be Chinese musical telling music with this speech conversion and to keep the fundamental tone of user speech and the equipment and the method for tone color.

According to an aspect of the present invention, provide a kind of with the equipment of speech conversion for Chinese musical telling music, described equipment comprises: the accompaniment music generating portion is used for generating Chinese musical telling accompaniment music; The speech conversion part, based on the accompaniment music that accompaniment music generating unit branch generates, the speech conversion that the user is imported is a Chinese musical telling form; The music mix device, the voice of Chinese musical telling accompaniment music that accompaniment music generating unit branch is generated and the Chinese musical telling form changed by the speech conversion part mix mutually, to form Chinese musical telling music.

According to a further aspect in the invention, provide a kind of, comprise step: a) generate Chinese musical telling accompaniment music the method for speech conversion for Chinese musical telling music; B) based on the accompaniment music that generates, the speech conversion that the user is imported is a Chinese musical telling form; The voice of the Chinese musical telling accompaniment music that c) will generate and the Chinese musical telling form of conversion mix mutually, to form Chinese musical telling music.

Description of drawings

By the description of carrying out below in conjunction with accompanying drawing to embodiment, above-mentioned and/or other purposes of the present invention and advantage will become apparent, wherein:

Fig. 1 illustrates according to an exemplary embodiment of the present invention with the block diagram of speech conversion for the structure of the equipment of Chinese musical telling music;

Fig. 2 illustrates the block diagram of the structure of syllable cutting unit according to an exemplary embodiment of the present invention;

Fig. 3 illustrates the decoding network that does not have supervision syllable cutting unit according to an exemplary embodiment of the present invention;

Fig. 4 illustrates the employed pressure alignment schemes of supervision syllable cutting unit according to an exemplary embodiment of the present invention;

Fig. 5 illustrates the block diagram of the structure of prosodic analysis device according to an exemplary embodiment of the present invention;

Fig. 6 shows three kinds of reinforcement notes in the different music beats;

Fig. 7 illustrates the block diagram of the structure of accompaniment music maker according to an exemplary embodiment of the present invention;

Fig. 8 illustrates the block diagram of the structure of speech convertor according to an exemplary embodiment of the present invention;

Fig. 9 A to Figure 11 B shows according to an exemplary embodiment of the present invention with the example of speech conversion for the user interface that equipment provided of Chinese musical telling music;

Figure 12 illustrates according to an exemplary embodiment of the present invention with the process flow diagram of speech conversion for the method for Chinese musical telling music;

Figure 13 to Figure 15 illustrates the detailed process of the step in the method shown in Figure 12 respectively.

Embodiment

Now, will describe embodiments of the invention in detail, the example of the embodiment of the invention has been shown in the accompanying drawing.In the accompanying drawings, identical label refers to identical parts all the time.Below, embodiments of the invention are described with reference to the accompanying drawings to explain the present invention.

Substantially, according to of the present invention be used for speech conversion for a Chinese musical telling music equipment can be divided into two parts: accompaniment music generating portion and voice signal conversion portion.

In the accompaniment music generating portion, set up Chinese musical telling accompaniment music template base.This accompaniment music template base comprises various typical case's Chinese musical telling accompaniment musics of classifying according to different music beats, different rhythm, different musical instrument.Because Chinese musical telling accompaniment music is generally 4/4 and claps,, the accompaniment music template in this Chinese musical telling accompaniment music template base claps so mostly being 4/4 greatly.

Chinese musical telling accompaniment music generally all is a music fairly simple, that repeatability is strong.Therefore, the accompaniment music template in the template base mostly is the short snatch of music that only comprises 8 trifles greatly.In generating the accompaniment music process, generally constantly repeat to form rhythmical accompaniment music by the snatch of music that these are short.

The accompaniment music in Chinese musical telling accompaniment music template base, the user also can import the snatch of music oneself liked as accompaniment music.In addition, if prosodic information the unknown of accompaniment music then also can be provided for detecting automatically the position of each beat (beat) in the accompaniment music and intensity to extract the device of the rhythm (rhythm) information.

At the voice signal conversion portion, at first can based on speech recognition algorithm the voice of importing be divided into consonant and vowel, and can discern the possible position of each rhyme (rhythm feet) by the prosodic analysis device by the syllable cutting unit.In addition, can revise the duration (duration) and the position of each syllable of input voice according to the rhythm model of user-selected or the accompaniment music that imports, and it is synchronous import the reference position of the reference position (onset) of voice medial vowel and the beat in the accompaniment music.For the locational vowel of strong beat, can strengthen handling by improving intensity and changing pitch contour (pitch trend).After finishing the speech conversion operation, also need the border between the adjacent syllable is carried out smoothly, to guarantee the continuity of waveform.

In addition, according to of the present invention be used for speech conversion for a Chinese musical telling music equipment also can comprise the interactive user interface part.By this interactive user interface part, the user can import the snatch of music liked as Chinese musical telling accompaniment music.In addition, the user also can revise alignment between the beat of the syllable of input voice and accompaniment music by interactive user interface, can align again to voice then, thereby make itself and accompaniment music synchronous.The user can revise the fundamental tone attribute (as rising, reduction or crooked pitch contour) of each syllable of input in the voice, can revise the intensity (as strengthening or weakening syllable intensity) of each syllable of input voice.In addition, the user can also use the audio fragment of pre-stored to modify synthetic Chinese musical telling music, and slows down or accelerate the speed of synthetic Chinese musical telling music.

Below, with reference to Fig. 1 describe in more detail be used for according to an exemplary embodiment of the present invention with speech conversion for a Chinese musical telling music equipment.

Fig. 1 illustrates to be used for according to an exemplary embodiment of the present invention the block diagram of speech conversion for the equipment of Chinese musical telling music.With reference to Fig. 1, described being used for comprises the equipment (below, abbreviate " voice-Chinese musical telling conversion equipment ") 100 of speech conversion for Chinese musical telling music: user interface 101, syllable cutting unit 102, speech convertor 103, prosodic analysis device 104, accompaniment music maker 105 and music mix device 106.

As shown in Figure 1, the user can come input speech signal by microphone.In addition, the user can pass through keyboard input operation order etc.Can pass through loudspeaker plays by the Chinese musical telling music that voice-Chinese musical telling conversion equipment 100 synthesizes.

User interface 101 provides user interface for syllable cutting unit 102, speech convertor 103, accompaniment music maker 105 and music mix device 106.

Syllable cutting unit 102 is used for each syllable of input speech signal is cut apart.The function that available speech recognition device (speech recognizer) is realized syllable cutting unit 102 by the pause in short-term (short-pause) in the detection input voice and quiet (silence).In addition, also can adopt voiceless sound/voiced sound to adjudicate each syllable is divided into consonant and vowel.Syllable cutting unit 102 can be to speech convertor 103 outputs about the duration of each syllable and the information of intensity.In addition, syllable cutting unit 102 also can provide the syllable recognition result to prosodic analysis device 104, so that extract the rhythm model of input voice.With reference to Fig. 2 to Fig. 4 the structure and the operation of syllable cutting unit 102 are described in more detail after a while.

Prosodic analysis device 104 is used for the rhythm model of the voice signal of analysis user input.Prosodic analysis device 104 according to syllable cutting unit 102 (promptly, speech recognition device) Shu Chu result comes the duration and the intensity of each syllable of input voice is counted, detect possible rhyme, summarize the rhythm model (rhythm pattern) of input voice then.The rhythm model of the input voice that obtain can be imported into accompaniment music maker 105, to be used to select suitable accompaniment music.With reference to Fig. 5 to Fig. 6 the structure of prosodic analysis device 104 is described in more detail after a while.

The speech production Chinese musical telling accompaniment music that accompaniment music maker 105 is used to the user to import.The rhythm model of the input voice that provided according to prosodic analysis device 104, accompaniment music maker 105 can be selected suitable accompaniment music template automatically and with its continuous repetition, mate to realize the best rhythm model between input voice and the accompaniment music from the accompaniment music template base of talking and singing.In addition, accompaniment music maker 105 also can provide interactive user interface to the user, thereby the user can import the snatch of music liked as accompaniment music by this interactive user interface, and accompaniment music maker 105 extracts the rhythm models of the accompaniment music that imports automatically and obtains optimum matching paths with the input voice.The accompaniment music of selecting or importing is exported to music mix device 106, and its rhythm model is exported to speech convertor 103.With reference to Fig. 7 the structure of accompaniment music maker 105 is described in more detail after a while.

Best rhythm model matching result between the accompaniment music that is provided according to the voice and the accompaniment music maker 105 of user input, 103 pairs of input voice of cutting apart through syllable of speech convertor are changed, so that the input voice meet the rhythm model of accompaniment music.In order to realize this function, speech convertor 103 needs to carry out following steps: make the reference position of beat of the reference position of the vowel in the input voice and accompaniment music synchronous; Revise the duration of each syllable according to the rhythm model of accompaniment music; Rhythm model according to accompaniment music is revised the duration that pauses; Syllable at needs are strengthened strengthens the intensity of this syllable and changes its pitch contour; Border between the adjacent syllable is carried out smoothly.Voice after the conversion are imported in the music mix device 106.With reference to Fig. 8 the structure and the operation of speech convertor 103 are described in more detail after a while.

Music mix device 106 be used for from the accompaniment music of accompaniment music maker 105 with mix mutually by the voice after speech convertor 103 conversions, to produce synthetic Chinese musical telling music.Music mix device 106 also can provide interactive user interface to the user, thereby the user can change the synthetic ratio of each track and regulate the equilibrium of Chinese musical telling music.In addition, the user user interface that provided can also be provided add some specific audios to synthetic Chinese musical telling music, perhaps regulates some parameters (as, the speed of Chinese musical telling music) of Chinese musical telling music.

Below, with reference to Fig. 2 to Fig. 8 the 26S Proteasome Structure and Function of the syllable cutting unit 102 in voice-Chinese musical telling conversion equipment 100, speech convertor 103, prosodic analysis device 104, accompaniment music maker 105 and music mix device 106 is described in more detail.

Fig. 2 is the block diagram that the syllable cutting unit 102 in voice-Chinese musical telling conversion equipment 100 according to an exemplary embodiment of the present invention is shown.With reference to Fig. 2, syllable cutting unit 102 comprises feature extractor 201, hidden Markov model (HMM) database 202, do not have supervision (unsupervised) syllable cutting unit 203 and supervision (supervised) syllable cutting unit 204 is arranged.

Do not have supervision syllable cutting unit 203 and have supervision syllable cutting unit 204 under different patterns, to work respectively.Under automatic mode, the input of syllable cutting unit 102 is voice signals that the user imports, and its output is duration, intensity, pronunciation from each syllable of the input voice that do not have 203 outputs of supervision syllable cutting unit etc.If the user manually imports the lyrics of voice by user interface, then the output of syllable cutting unit 102 comes self-supervisory syllable cutting unit 203.

Below, each building block of syllable cutting unit 102 is explained in more detail.

Feature extractor 201 is feature extraction unit, and its voice from user's input extract traditional Mel cepstrum coefficient (MFCC) proper vector.Each proper vector comprises 13 dimension MFCC features, 13 dimension MFCC first order difference features and 13 dimension MFCC second order difference feature, totally 39 dimensions.The time width of each signature analysis window and time shifting are respectively 20 milliseconds and 10 milliseconds.Because MFCC is a kind of common feature in the field of speech recognition, and is extensive use of in many speech recognition systems, be known in those skilled in the art, therefore will no longer be described its details.

Hidden Markov model (HMM) is a kind of statistical model recognizer commonly used in the speech recognition system.The language material (speech corpus) and the feature extractor 201 that utilize the mark of hundreds of hour to cross can train HMM based on phoneme (phone) by expectation maximization (EM) algorithm.In HMM database 202, store the HMM that trains.These HMM do not have supervision syllable cutting unit 203 and one of input parameter of supervision syllable cutting unit 204 are arranged.At for example ＂ Spoken Language Processing:AGuide to Theory that X.D.Huang, A.Acero and H.W.Hon showed of Prentice Hall PTR publication, Algorithm, describe the details of MFCC proper vector and HMM in " voice signal digital processing " that Yang Hangjun, the Chi Huisheng etc. that and System Development ＂ and Electronic Industry Press publish are shown in detail, therefore will no longer be described in greater detail here.

Not having supervision syllable cutting unit 203 is used for automatically the syllable of input voice being cut apart.In this case, the content of input voice is unknown.

The input parameter that does not have supervision syllable cutting unit 203 comprises: the 39 dimension MFCC proper vectors of being extracted from the input voice by feature extractor 201; Be pre-stored in the HMM that the EM algorithm trains that passes through in the HMM database 202.Can comprise the duration of each syllable the input voice and energy intensity, the reference position of each vowel, the possible position of rhyme etc. from the parameter of not having 203 outputs of supervision syllable cutting unit.

Fig. 3 shows the decoding network that does not have supervision syllable cutting unit 203.As shown in Figure 3, the decoding network that does not have supervision syllable cutting unit 203 is to use the syllable loop (syllable loop) of Viterbi method, wherein, the state chain (state-chain) of " syllable N " expression N syllable, pause the in short-term state chain of model of " SP " expression, the state chain of the quiet model of " SIL " expression.The output of this decoding network is and the state string (state string) of importing the voice optimum matching.By recall consonant and vowel in divisible each syllable along time shaft.

Supervision syllable cutting unit 203 is not similar with having, and has supervision syllable cutting unit 204 to have the input of the HMM of 39 dimensional feature vectors stream and pre-training yet.In addition, supervision syllable cutting unit 203 is not different with having, and has supervision syllable cutting unit 204 also to have another input parameter: the lyrics of input speech signal.Because the user except by having imported the voice signal such as microphone, has also imported the lyrics corresponding with voice, so the content of input voice is known in advance.Therefore, in being arranged, supervision syllable cutting unit 204 do not need to have the decoding network of above-mentioned syllable loop structure.

According to an exemplary embodiment of the present, there is pressure alignment (force-alignment) method commonly used in the training stage of employing speech recognition in the supervision syllable cutting unit 204 to come the syllable of input voice is cut apart.Fig. 4 shows employed pressure alignment schemes in the supervision syllable cutting unit 204 according to an exemplary embodiment of the present invention.In Fig. 4, transverse axis represents that along the proper vector of the input voice of time mark the longitudinal axis represents that every bit is represented the likelihood score mark (likelihood score) by current proper vector and state computation according to the tactic state chain of the input lyrics.According to maximum-likelihood criterion, can depict the optimum matching path.Then, divisible each syllable that goes out to import in the voice.Because the content of input voice is known in advance, will be very accurate so the syllable segmentation result in the supervision syllable cutting unit 204 is arranged.

Below, with reference to Fig. 5 and Fig. 6 prosodic analysis device 104 is described in detail.

Fig. 5 illustrates the block diagram of the structure of prosodic analysis device 104 according to an exemplary embodiment of the present invention.As shown in Figure 5, prosodic analysis device 104 comprises: Voice Activity Detector 301, rhyme detecting device 302, stress detecting device 303, syllable duration normalization unit 304 and rhythm model maker 305.

At first, rhythm model is carried out simplicity of explanation.In western music, most typical music beat (music meter) is that 2/4 bat, 3/4 is clapped and 4/4 bat, represents to comprise in each trifle two, three and four crotchets respectively.Usually, the music beat of Chinese musical telling music is 4/4 bat.For each music beat, there is certain rule to decide and strengthens which note.When the singer gave song recitals, he more or less can be subjected to the influence of music beat rule.

Fig. 6 shows three kinds of reinforcement notes in the different music beats, and wherein, the black round dot represents to strengthen note, and the grey round dot is represented false add forte symbol, and white round dot is represented the off beat symbol.As mentioned above, music beat has determined strengthen which note.

In addition, also should consider rhyme, so that obtain the better rhythm model of music.Rhyme is generally limited by the word with identical vowel.If every a word all finishes with identical rhyme, sound that then these statements will be rhymed as poem.In fact, the lyrics of most of pleasing to the ear song all rhyme.

In order to obtain the Chinese musical telling music of better effects if, should analyze the rhythm model of input speech signal.As mentioned above, rhythm model is being carried out in the analytic process, having three important elements to need to consider: stress, rhyme and music beat.Prosodic analysis device 104 is used for extracting these important elements from input speech signal.

Voice Activity Detector 301 in the prosodic analysis device 104 is sought " SIL (quiet) " in by the syllable string of syllable cutting unit 102 identifications.The end in short of quiet ordinary representation.Be transfused to rhyme detecting device 302 to be used to analyze rhyme about the quiet information that finds.

The input parameter of rhyme detecting device 302 comprises: by the quiet position in the Voice Activity Detector 301 detected input voice and each quiet vowel before.Utilize these input informations, rhyme detecting device 302 can be found out the position at rhyme place.

Stress detecting device 303 utilizes the energy intensity of each syllable in the input voice, and searching has the syllable of higher-strength as the stress in the input voice.

Syllable duration normalization unit is used for the duration standard of each syllable is turned to the length of whole note, half point note, 1/4th notes or 1/8th notes.

Rhythm model maker 305 utilizes the output of rhyme detecting device 302, stress detecting device 303 and syllable duration normalization unit 304 to generate the rhythm model of input voice.

Below, the 26S Proteasome Structure and Function of accompaniment music maker 105 is described with reference to Fig. 7.

Fig. 7 illustrates the block diagram of the structure of accompaniment music maker 105 according to an exemplary embodiment of the present invention.With reference to Fig. 7, according to the pattern (automatic mode and semi-automatic pattern) that accompaniment music maker 105 is worked, accompaniment music maker 105 is broadly divided into two parts: automatic accompaniment music generating portion and semi-automatic accompaniment music generating portion.

At first introduce automatic accompaniment music generating portion.Automatically the accompaniment music generating portion receives the rhythm model from the input speech signal of prosodic analysis device 104, and it can comprise accompaniment music template base 401, template repetitive 402, template selector switch 403 and music signal maker 404.

Accompaniment music template base 401 is databases of accompaniment music fragment.Each accompaniment music template is classified according to its music beat, rhythm (tempo), musical instrument (instruments) and music score (score) etc.

Adopt three kinds of simple music beats in the accompaniment music template base 401, comprise that 2/4 bat, 3/4 is clapped and 4/4 bat, the characteristic of these music beats is described with reference to Fig. 6.Accompaniment music template base 401 comprises three kinds of typical rhythm, promptly 60,90,120BPM (per minute beat number), respectively expression at a slow speed, middling speed and fast.In addition, in accompaniment music template base 401, selected tens of kinds of musical instruments, as stringed musical instrument, keyboard instrument, wind instrument etc., and at every kind of music beat, rhythm and musical instrument pre-stored some first solos.

In addition, at the file layout of accompaniment music template, in fact midi format is more suitable for than original waveform form or compressed waveform form.Because MIDI is a kind of music format of symbolism, so MIDI is more flexible aspect some attributes of change (as rhythm or musical instrument).Certainly, the music file of original waveform form or compressed waveform form also can be used as the template of accompaniment music, and just its application mode is flexible not as the MIDI file.

The rhythm model information of accompaniment music template as the position of the position of beat and strong beat and intensity, rhyme etc., can be pre-stored in the storer in the accompaniment music template base 401.

Chinese musical telling accompaniment music is very simple usually, and only has 4 or 8 trifles of repetition.Therefore, the time span of the accompaniment music template in the accompaniment music template base 401 is generally very short.Template repetitive 402 is used at template selector switch 403 short accompaniment music template being carried out repetition.In order to guarantee the tonequality of speech conversion, carrying out template repetition time span afterwards must be between 0.5～2 times of the time span of importing voice.For example, if input voice length is 40 seconds.Accompaniment music length should be between 20～80 seconds so.If the length of 1 accompaniment music template is 8 seconds, then it should be repeated 3 times (24 seconds), 4 times (32 seconds), 5 times (40 seconds), 6 times (48 seconds), 7 times (56 seconds), 8 times (64 seconds), 9 times (72 seconds), 10 times (80 seconds) respectively, form 8 accompaniment music templates.By dynamic programming (DP) algorithm, select only accompaniment music template then.

According to the rhythm model of the input voice of importing from prosodic analysis device 104 and the rhythm model of accompaniment music template, template selector switch 403 adopts dynamic programming (DP) algorithms to calculate the coupling mark of importing between voice and each the accompaniment music template (matching score).The accompaniment music template that obtains the highest coupling mark is and the accompaniment music of importing the voice optimum matching.Template selector switch 403 also can send to the DP matching result reference of speech convertor 103 as speech conversion.

Described DP algorithm is as follows:

D(i，j)＝MAX{D(i-1，j)，D(i-1，j-1)，D(i-2，j-1)}+d(i，j)

For?all?the(i，j)：

Initial:d(i，j)＝0

Matching?a?stressed?beat:d(i，j)＝d(i，j)+2

Matching?a?weak?beat:d(i，j)＝d(i，j)+1

Matching?a?rhythm?feet:d(i，j)＝d(i，j)+1

Wherein, i represents to import the sequence number of the syllable sequence of voice, and j represents the sequence number of the beat sequence of accompaniment music.And d (i, j) be the coupling mark (local score) of the rhythm model of the rhythm model of i syllable of input voice and j beat of accompaniment music, (i j) is total coupling mark ((accumulated score) of the rhythm model of the rhythm model of i syllable of input voice and j beat of accompaniment music to D.

(i, initial value j) is 0 to all d.If i syllable and j beat of accompaniment music of input voice are strong beat, then (i j) adds 2 fens to d; If i syllable and j beat of accompaniment music of input voice are weak beat, then (i j) adds 1 fen to d; If i syllable and j beat of accompaniment music of input voice are rhyme, then (i j) adds 1 fen to d.

D (i, j) should D (i-1, j), D (i-1, j-1) and D (i-2 j-1) selects maximal value in three numerical value, adds d (i then, j), be total coupling mark of the rhythm model of the rhythm model of i syllable of input voice and j beat of accompaniment music.

And each D (i, j) be by D (i ,-1j), D (i-1, j-1) and D (i-2, j-1) in three numerical value which accumulation and information, need go on record.By comparing the size of Dynamic matching mark, should select the highest accompaniment music template of Dynamic matching mark as final accompaniment music.And the optimum matching path of the rhythm model of the input rhythm model of voice and this accompaniment music can be recalled by the path jump information and obtained.

If the accompaniment music template is a midi format by pre-stored, then also need music signal maker 404 to come the composite music signal according to special parameter.The music signal that is generated is sent to music mix device 106 as Chinese musical telling accompaniment music.

Now, semi-automatic accompaniment music generating portion will be described.As shown in Figure 7, semi-automatic accompaniment music generating portion receives the snatch of music by user interface 101 appointments by the user, and it mainly comprises sound signal beat detecting device 405, beat intensity detector 406 and rhythm matching unit 407.

Sound signal beat detecting device 405 is used for detecting the idiophonic beat of music audio signal.According to an exemplary embodiment of the present, for example, can be by each sub-band detection peak value based on music sound signal, peak counting to each subband signal, make up all peak values at identical time tag place, seek the position of each beat, detect the idiophonic beat in the music audio signal.But, it should be appreciated by those skilled in the art that the beat that also can adopt other common method to detect sound signal (referring to J.Foote, " The Beat Spectrum:A New Approach to Rhythm Analysis " in Proc.ofICME, pp.881-884,2001; E.Scheirer, " Tempo and Beat Analysis of AcousticMusic Signals ", J.Acoust.Soc.Am., vol.103, no.1, pp.588-601,1998; And M.Alonso, B.David and G.Richard, " Tempo and Beat Estimation of MusicalSignals " in Proc.of ISMIR, pp.158-163.2004).

Beat intensity detector 406 is used for obtaining the intensity at sound signal beat detecting device 405 detected each beat.The intensity of each beat that beat intensity detector 406 calculates and relatively detects is divided into two classes with all beats: strong beat and non-strong beat.

By result, can extract the rhythm model of simple form in conjunction with sound signal beat detecting device 405 and 406 outputs of beat intensity detector.The information that in this rhythm model, only comprises relevant strong beat, and in rhythm matching unit 407, use this strong beat information to come to mate with the input voice.

Rhythm matching unit 407 is used to obtain to import the optimum matching path between the rhythm model of accompaniment music of the rhythm model of voice and appointment.In the rhythm matching unit 407 in employed method and the template selector switch 403 employed DP matching algorithm similar.But, in the rhythm matching unit 407 in employed method and the template selector switch 403 employed DP matching algorithm have with, be: the rhythm model of the accompaniment music that is extracted by the beat intensity detector is a reduced form.

Employed DP matching algorithm is as follows in the rhythm matching unit 407:

D(i，j)＝MAX{D(i-1，j)，D(i-1，j-1)，D(i-2，j-1)}+d(i，j)

For?all?the(i，j)：

Initial:d(i，j)＝0

Matchinga?stressed?beat:d(i，j)＝d(i，j)+2

Matching?a?unstressed?beat:d(i，j)＝d(i，j)+1

Wherein, i represents to import the sequence number of the syllable sequence of voice, and j represents the sequence number of the beat sequence of accompaniment music.And d (i, j) be the coupling mark of the rhythm model of the rhythm model of i syllable of input voice and j beat of accompaniment music, (i j) is total coupling mark of the rhythm model of the rhythm model of i syllable of input voice and j beat of accompaniment music to D.

(i, initial value j) is 0 to all d.If i syllable and j beat of accompaniment music of input voice are strong beat, then (i j) adds 2 fens to d; If i syllable and j beat of accompaniment music of input voice are non-strong beat, then (i j) adds 1 fen to d.

D (i, j) should D (i ,-1j), D (i-1, j-1) and D (i-2 j-1) selects maximal value in three numerical value, adds d (i then, j), be total coupling mark of the rhythm model of the rhythm model of i syllable of input voice and j beat of accompaniment music.

And each D (i, j) be by D (i ,-1j), D (i-1, j-1) and D (i-2, j-1) in three numerical value which accumulation and information, need go on record.By comparing the size of Dynamic matching distance, should select the highest accompaniment music template of Dynamic matching mark as final accompaniment music.And the optimum matching path of the rhythm model of the input rhythmic pattern of voice and this accompaniment music can be recalled by the path jump information and obtained.

As mentioned above, accompaniment music maker 105 is included in the automatic accompaniment music generating portion and the semi-automatic accompaniment music generating portion of working under the different patterns.The output of accompaniment music maker 105 mainly comprises: the sound signal of accompaniment music, will in music mix unit 106, use; The rhythm model of accompaniment music will use in speech convertor 103; Optimum matching path between the rhythm model of input voice and the rhythm model of accompaniment music will be used in speech convertor 103.

Below, the 26S Proteasome Structure and Function of the speech convertor 103 in voice-Chinese musical telling conversion equipment 100 is described with reference to Fig. 8.

Fig. 8 is the block diagram of the structure of speech convertor 103 according to an exemplary embodiment of the present invention.With reference to Fig. 8, speech convertor 103 be used for according to the rhythm model of the accompaniment music of input rhythm model of voice and generation and between the optimum matching path be Chinese musical telling form with the input speech conversion of cutting apart through syllable.As shown in Figure 8, speech convertor 103 comprises: unit 503 strengthened in reference position lock unit 501, syllable duration modification unit 502, syllable, syllable weakens unit 504 and syllable edge smoothing device 505.

Reference position lock unit 501 is used for making the reference position of beat of the reference position of vowel of input voice and accompaniment music synchronous.

The syllable duration is revised the duration that unit 502 is used for revising according to the rhythm model of accompaniment music each syllable of input voice, and revises the duration of pause according to the rhythm model of accompaniment music.

The intensity that unit 503 is used to strengthen the syllable that needs strengthen strengthened in syllable, and the pitch contour of this syllable is changed into raised shape with the acquisition stiffening effect.

If the intensity at the syllable of weak beat position is too high, then syllable weakens the intensity that unit 504 is used to reduce this syllable.

Syllable edge smoothing device 505 is used for the border between the adjacent syllable is carried out smoothly.Syllable edge smoothing device 505 is regulated the phase place of sound signal to guarantee the waveform continuity between a pair of adjacent syllable.

In addition, speech convertor 103 can also be controlled by user interface 101 except above-mentioned automatic mode.Speech convertor 103 can provide user interface to change matching result to the user, and for example, the user can make up two syllables in the beat, perhaps a syllable is expanded to two beats.In addition, speech convertor 103 also can provide user interface to change the attribute of each syllable to the user, and for example, the user can raise, reduce or the fundamental tone of crooked any syllable, and can strengthen or weaken the intensity of any syllable.

Music mix device 106 receives from the audio tracks of the voice of the Chinese musical telling form of speech convertor 104 and from the audio tracks of the accompaniment music of accompaniment music maker 105, and it is mixed into the Chinese musical telling music that has accompaniment.

The main operation of music mix device 106 is as follows: synthesize two audio tracks according to the ratio of user's appointment or 1 to 1 ratio of acquiescence; Provide user interface to modify synthetic music to the user; Provide user interface to slow down or to accelerate Chinese musical telling music to the user; Also can carry out equilibrium to Chinese musical telling music.At modifying synthetic music, the user can add such as " ", " ", " Kazakhstan ", " ", " ", “ Ye " etc. interjection; add some natural sounds, and can add scrape and play some other audio of breathing out in the music such as sound of the wind, bicycle bell sound etc.

Below, will be applied to mobile phone with voice-Chinese musical telling conversion equipment 100 is that example is described user interface according to an exemplary embodiment of the present invention.But, it should be appreciated by those skilled in the art that to the invention is not restricted to this.Described voice-Chinese musical telling conversion equipment 100 also can be applicable in other device, for example among music player, PDA, the PC etc.

The user interface that provides in voice-Chinese musical telling conversion equipment 100 according to an exemplary embodiment of the present invention is provided Fig. 9 A to Figure 12 B.

Fig. 9 A and Fig. 9 B show the user interface that voice-Chinese musical telling conversion equipment 100 is provided under automatic mode.Shown in Fig. 9 A, the user only needs by the one section voice signal of microphone records on the mobile phone for example, the input voice can be automatically converted to the Chinese musical telling song that has the accompaniment music of talking and singing according to voice of the present invention-Chinese musical telling conversion equipment 100 then.Afterwards, shown in Fig. 9 B, the user can appreciate synthetic Chinese musical telling music by the earphone on the mobile phone for example.

Figure 10 A shows the user interface that is used to import the accompaniment music fragment according to voice of the present invention-Chinese musical telling conversion equipment 100 is provided.As previously described, except select accompaniment music from the accompaniment music template base, the user also can import the snatch of music oneself liked as Chinese musical telling accompaniment music by the user interface shown in Figure 10 A.Can from the snatch of music that imports, extract prosodic information according to voice of the present invention-Chinese musical telling conversion equipment 100, obtain the optimum matching path, and be applied to import voice.Figure 10 B shows the user interface according to voice of the present invention-lyrics that are used for input speech signal that Chinese musical telling conversion equipment 100 is provided.The user can import the lyrics by the user interface shown in Figure 10 B, thereby makes to the rhythm model analysis result of input voice more accurate.

The user interface that is used to revise and edit Chinese musical telling music attribute that is provided according to voice of the present invention-Chinese musical telling conversion equipment 100 is provided for Figure 11 A and Figure 11 B.

Shown in Figure 11 A, the user can edit syllable by this user interface cuts apart the result of mating with rhythm model.This user interface can be revised its attribute at each syllable of input voice.Because current example is at the application on the mobile phone, so some buttons that often use on the mobile phone can be designated as special-purpose.For example, on on the mobile phone/down/left side/right key can be used separately as the cursor that makes on the screen and move directionkeys with the syllable of selecting to edit, numerical key 1,2 and 3 can be used separately as the button of the fundamental tone (rising, crooked or reduction) of regulating syllable, numerical key 4 and 6 can be used as the button of strengthening and weakening the intensity of syllable, and numerical key 7 and 9 can be used for making selecteed syllable to move to the button of last beat or back one beat.

The user interface that is used for adding to a synthetic Chinese musical telling music audio that is provided according to voice of the present invention-Chinese musical telling conversion equipment 100 is provided Figure 11 B.Similar to top example, numerical key 1～6 can be used separately as interpolation " ", " ", " Kazakhstan ", " ", " ", “ Ye " interjection to be to modify the button of Chinese musical telling music; and numerical key 7 and 9 can be used for accelerating and the speed of the Chinese musical telling music that slows down, and does not change the fundamental tone and the tone color of Chinese musical telling music.In addition, the user also can and listen to synthetic Chinese musical telling music by this user interface broadcast, adds the audio of expectation in place by corresponding button, thereby can obtain better Chinese musical telling music.

Below, describe according to of the present invention the method for speech conversion with reference to Figure 12 to Figure 15 for Chinese musical telling music.

Figure 12 illustrates according to an exemplary embodiment of the present invention with the process flow diagram of speech conversion for the method for Chinese musical telling music.As shown in figure 12, as user during,, each syllable of input speech signal is cut apart in step 1201 by one section voice of the input of microphone for example.Fig. 2 is described as reference, speech conversion can be used the EM algorithm in advance for the method for Chinese musical telling music utilize the language material of hundreds of hour to train HMM and set up the HMM database according to of the present invention, and from the input voice, extract 39 dimension MFCC proper vectors, thereby utilize HMM in the HMM database to obtain to import the reference position etc. of the duration of each syllable of voice and intensity, each vowel according to user's input.In addition, if the user has also imported the lyrics corresponding with voice,, can cut apart the syllable of input speech signal more accurately then in step 1201.

In step 1203, analyze the input voice according to the result who in step 1201, obtains, to obtain the rhythm model of input voice.In step 1203, mainly consider these three units of stress, rhyme and music beat pattern that usually scans.

The description of the process of step 1203 more being described with reference to Figure 13.With reference to Figure 13, when analyzing the rhythm model of input voice, at first, in step 1301, quiet (the ordinary representation end in short) in the input voice that carried out cutting apart in step 1201 is detected.Subsequently,, utilize the quiet position in the detected input voice and each quiet vowel before in the step 1301, detect the position at the rhyme place in the input voice in step 1303.

Next, in step 1305, use the intensity of importing each syllable in the voice to detect the position at stress place.In step 1307, the duration standard of each syllable is turned to the length of whole note, half point note, 1/4th notes or 1/8th notes.Utilize the result of above-mentioned steps,, generate the rhythm model of input voice in step 1309.

Return Figure 12 again, in the step 1205 after the step 1203, generate Chinese musical telling accompaniment music.

According to of the present invention with the method for speech conversion for Chinese musical telling music in, the mode that generates Chinese musical telling accompaniment music has two kinds: accompaniment music generating mode automatically, wherein, from the accompaniment music template base, select suitable accompaniment music according to the rhythm model of input voice; Semi-automatic accompaniment music generating mode, wherein, the user can import the snatch of music liked as accompaniment music.

Below, be described in more detail according to the process of this dual mode respectively with reference to Figure 14 A and Figure 14 B step 1205.

Figure 14 A shows the process that generates accompaniment music under automatic accompaniment music generate pattern.With reference to Figure 14 A, under automatic accompaniment music generate pattern, will use pre-established accompaniment music template base to select accompaniment music.Some typical accompaniment music templates of classifying have been stored in the described accompaniment music template base according to its music beat, rhythm, musical instrument etc.Because Chinese musical telling accompaniment music is very simple usually, only has 4 or 8 trifles, so the accompaniment music template in the accompaniment music template base is very short usually.Therefore, in step 1401, at first will constantly repeat, to form the accompaniment music template of appropriate length from the accompaniment music template of accompaniment music template base.Then, in step 1403,, from the accompaniment music template base, select the accompaniment music that is fit to based on the analysis result of rhythm model to the input voice.

As top described, can adopt DP algorithm to calculate input voice and each, and select to have the accompaniment music template of the highest coupling mark through the coupling mark between the accompaniment music template that repeats with reference to Fig. 7.Simultaneously, the rhythm model of the accompaniment music that can also obtain to select.In addition, if the accompaniment music template of storage is a midi format, then need be in step 1405 the composite music signal.

Figure 14 B shows the process that generates accompaniment music under semi-automatic accompaniment music generate pattern.With reference to Figure 14 B, under semi-automatic accompaniment music generate pattern,, then in step 1407, at first detect the idiophonic beat in the sound signal that is imported if the user imports specific snatch of music as accompaniment music.Afterwards, in step 1409, detect the intensity of each beat that in step 1407, obtains, and all beats are divided into strong beat and non-strong beat according to intensity.According to the testing result of above-mentioned steps, the simple rhythm model information of the snatch of music that can obtain to import, this rhythm model information only comprise the information of relevant strong beat.

In step 1411, according to the rhythm model that imports music, the optimum matching path between the rhythm model of the rhythm model of acquisition input voice and the accompaniment music of appointment.In step 1411, can adopt the DP matching algorithm to calculate the optimum matching path of importing between voice and the accompaniment music equally.

Return Figure 12 again, utilize the rhythm model and the optimum matching path of the accompaniment music that obtains in step 1205, in operation 1207, the input speech conversion that will cut apart through syllable is a Chinese musical telling form.

At length say, with reference to Figure 15, in will importing the process that speech conversion is a Chinese musical telling form,,, that the reference position of the beat of the reference position of the vowel of input in the voice and accompaniment music is synchronous according to the rhythm model of accompaniment music at first in step 1501.In step 1503, according to each syllable in the rhythm model modification input voice of accompaniment music and the duration of pause.Then, in step 1505, at the syllable that needs in the input voice to strengthen, by the pitch contour of this syllable is changed into the effect that raised shape obtains to strengthen this syllable.

In step 1507,, then can reduce the intensity of this syllable if the intensity of the syllable of weak beat position is too high.At last, in step 1509, the border between the adjacent syllable is carried out smoothly, so that the waveform between the adjacent syllable is continuous, no breakpoint.It should be noted that step 1501 can be as required carries out according to different orders to 1509.

In addition, in step 1207 process, also can control speech conversion process by the user.For example, the user can raise, reduce or the fundamental tone of crooked any syllable, perhaps strengthens or weaken the intensity of any syllable.

Return Figure 12 again, after the conversion of finishing the input voice,, utilize accompaniment music that in step 1205, obtains and the voice of in step 1207, changing to synthesize the Chinese musical telling music that has accompaniment in step 1209.In addition, the user also can add audio to synthetic Chinese musical telling music by interactive user interface.

Should be noted that more of the present invention during revise, the function described in flowchart block or the operation can not according to shown in order carry out.For example, in fact two square frames or the operation that show continuously can be carried out simultaneously, perhaps some the time can carry out with opposite order according to function corresponding.

The invention provides a kind of speech conversion with user's input is the equipment and the method for Chinese musical telling music.This equipment and method can directly act on the voice of user's input, are converted into Chinese musical telling form, and are equipped with accompaniment music, have kept the fundamental tone and the tone color of user speech simultaneously.Simultaneously, equipment of the present invention and method also provide abundant interactive interface for the user, allow the user can revise the attribute of importing each word of voice, to reach the effect of user's expectation.

By according to of the present invention with equipment and the method for speech conversion for Chinese musical telling music, can produce the Chinese musical telling music that user's characteristic be arranged automatically based on the voice of user input.In addition, can be the voice generation Chinese musical telling accompaniment of user's input automatically, the user also can select other music of liking as accompaniment music simultaneously.This equipment and method can be applicable to the mancarried device that mobile phone, music player, PDA etc. have record and playback reproducer, also can be applied in the middle of the equipment such as personal computer, portable computing.

According to of the present invention speech conversion is composed the hope of the Chinese musical telling song of oneself for the Chinese musical telling equipment of music and method have satisfied the user, thereby satisfied the pursuit of user individual character.The user can be only says the article oneself liked, poem, the lyrics etc. by for example microphone can form the Chinese musical telling music of being sung by user's oneself sound.Therefore, also can make the Chinese musical telling music of oneself by equipment of the present invention and method even without the user of any composition knowledge or singing skills.

Although shown and described several exemplary embodiment, but those skilled in the art should understand that, under the situation of principle that does not break away from the embodiment that in claim and equivalent thereof, limits its scope and spirit, can change these exemplary embodiments.

Claims

1, a kind of with the equipment of speech conversion for Chinese musical telling music, comprising:

The accompaniment music generating portion is used for generating Chinese musical telling accompaniment music;

The speech conversion part, based on the accompaniment music that accompaniment music generating unit branch generates, the speech conversion that the user is imported is a Chinese musical telling form;

The music mix part, the voice of Chinese musical telling accompaniment music that accompaniment music generating unit branch is generated and the Chinese musical telling form changed by the speech conversion part mix mutually, to form Chinese musical telling music.

2, equipment as claimed in claim 1, wherein, described speech conversion partly comprises:

The syllable dispenser is divided into a plurality of syllables to obtain the information about each syllable with the input voice;

The prosodic analysis device based on the information that the syllable cutting unit obtains, is analyzed the input voice to detect the rhythm model of input voice;

Speech convertor, the rhythm model of the accompaniment music that generates according to accompaniment music generating unit branch, the input speech conversion that will cut apart through syllable is a Chinese musical telling form.

3, equipment as claimed in claim 2, wherein, described syllable dispenser comprises:

Feature extractor extracts the MFCC proper vector from the input voice;

The HMM database stores the HMM based on phoneme that trains by expectation-maximization algorithm;

The syllable cutting unit utilizes the MFCC proper vector of being extracted by feature extractor and is stored in HMM in the HMM database, will import voice and be divided into a plurality of syllables.

4, equipment as claimed in claim 3, wherein, described information about each syllable comprises at least a in the possible position of the reference position of the duration of each syllable and intensity, each vowel and rhyme.

5, equipment as claimed in claim 3, wherein, each MFCC proper vector comprises 13 dimension MFCC features, 13 dimension MFCC first order difference features and 13 dimension MFCC second order difference feature, totally 39 dimensions.

6, equipment as claimed in claim 3, wherein, described syllable cutting unit comprises:

Do not have supervision syllable cutting unit, utilize the MFCC proper vector of extraction and the HMM of storage to come the syllable of input voice is cut apart;

Supervision syllable cutting unit is arranged, utilize the MFCC proper vector of extracting, the HMM of storage and the lyrics of user's input to come the syllable of input voice is cut apart.

7, equipment as claimed in claim 6, wherein, the decoding network of described nothing supervision syllable cutting unit is to use the syllable loop of Viterbi method.

8, equipment as claimed in claim 6, wherein, described have supervision syllable cutting unit to adopt the pressure alignment schemes to come the syllable of input voice is cut apart.

9, equipment as claimed in claim 2, wherein, described prosodic analysis device detects at least a in stress, rhyme and the music beat in the input voice.

10, equipment as claimed in claim 9, wherein, described prosodic analysis device comprises:

Voice Activity Detector detects quiet in the input voice;

The rhyme detecting device based on by the detected quiet position of Voice Activity Detector and the vowel of each quiet front, detects the position at rhyme place;

The stress detecting device, detection has high-intensity syllable;

Syllable duration normalization unit is with the duration standardization of each syllable.

11, equipment as claimed in claim 2, wherein, described accompaniment music generating portion comprises:

The accompaniment music template base, storage Chinese musical telling accompaniment music template;

The template selector switch based on the rhythm model by the detected input voice of described prosodic analysis device, is selected Chinese musical telling accompaniment music template, as the accompaniment music of input voice from the accompaniment music template base.

12, equipment as claimed in claim 11, wherein, described template selector switch calculates the coupling mark between input voice and each Chinese musical telling accompaniment music template, and selects to have the accompaniment music of the Chinese musical telling accompaniment music template of the highest coupling mark as the input voice.

13, equipment as claimed in claim 12, wherein, described template selector switch use dynamic programming algorithm is calculated the coupling mark between input voice and each Chinese musical telling accompaniment music template.

14, equipment as claimed in claim 11, wherein, described accompaniment music generating portion also comprises:

The template repetitive according to the length of input voice, carries out repetition with the Chinese musical telling accompaniment music template in the accompaniment music template base.

15, equipment as claimed in claim 14, wherein, the template repetitive carries out repetition to Chinese musical telling accompaniment music template, so that the length of the Chinese musical telling accompaniment music template after repeating is between 0.5～2 times of the length of importing voice.

16, equipment as claimed in claim 11, wherein, the Chinese musical telling accompaniment music template in the accompaniment music template base is midi format or the music format that has marked rhythm model.

17, equipment as claimed in claim 16, wherein, described accompaniment music generating portion also comprises:

The music signal maker is based on the Chinese musical telling accompaniment music template generation music signal of the midi format of being selected by the template selector switch.

18, equipment as claimed in claim 11, wherein, the Chinese musical telling accompaniment music template in the accompaniment music template base is according at least a classification the in music beat, rhythm, musical instrument and the music score.

19, equipment as claimed in claim 18, wherein, the prosodic information of Chinese musical telling accompaniment music template is pre-stored in the accompaniment music template base.

20, equipment as claimed in claim 2, wherein, described accompaniment music generating portion comprises:

The beat detecting device, the detection user specifies the beat as the snatch of music of the accompaniment music of input voice;

The beat intensity detector detects the intensity by the detected beat of beat detecting device, and all beats are divided into strong beat and non-strong beat, thereby obtains the rhythm model of specified accompaniment music;

Rhythm matching unit, the optimum matching path between the rhythm model of the rhythm model of calculating input voice and the accompaniment music of appointment.

21, equipment as claimed in claim 20, wherein, described rhythm matching unit uses dynamic programming algorithm to calculate described optimum matching path.

22, as claim 11 or 20 described equipment, wherein, described speech convertor comprises:

The reference position lock unit makes the reference position of beat of the reference position of vowel of input voice and accompaniment music synchronous;

The syllable duration is revised the unit, according to the rhythm model of accompaniment music, revises the duration of each syllable in the input music and the duration of pause;

The unit strengthened in syllable, according to the rhythm model of accompaniment music, improves the intensity that needs the syllable strengthened in the input voice;

Syllable weakens the unit, according to the rhythm model of accompaniment music, reduces to import the intensity of the syllable that is in weak beat position in the voice;

Syllable edge smoothing unit carries out smoothly the border between the adjacent syllable.

23, equipment as claimed in claim 22 also comprises: user interface, and the user can carry out at least a in the following operation by this user interface: the result with the rhythm model coupling cut apart in editor's syllable; Add and modify audio; Change the speed of Chinese musical telling music.

24, a kind of with the method for speech conversion for Chinese musical telling music, comprising:

A) generate Chinese musical telling accompaniment music;

B) based on the accompaniment music that generates, the speech conversion that the user is imported is a Chinese musical telling form;

The voice of the Chinese musical telling accompaniment music that c) will generate and the Chinese musical telling form of conversion mix mutually, to form Chinese musical telling music.

25, method as claimed in claim 24, wherein, step b) comprises:

D) will import voice and be divided into a plurality of syllables to obtain information about each syllable;

E) based on the information analysis input voice that obtain, to detect the rhythm model of input voice;

F) according to the rhythm model of accompaniment music, the input speech conversion that will cut apart through syllable is a Chinese musical telling form.

26, method as claimed in claim 25, wherein, step d) comprises:

G) from the input voice, extract the MFCC proper vector;

H) the MFCC proper vector of utilize extracting and be pre-stored in HMM in the HMM database will be imported voice and be divided into a plurality of syllables.

27, method as claimed in claim 26, wherein, step h) comprising:

When the user does not import the lyrics,, utilize the MFCC proper vector of extraction and the HMM of storage to come the syllable of input voice is cut apart by using the syllable loop of Viterbi method;

When the user has imported the lyrics, adopt and force alignment schemes, utilize the MFCC proper vector of extracting, the HMM of storage and the lyrics of user's input to come the syllable of input voice is cut apart.

28, method as claimed in claim 25, wherein, step e) comprises:

Detect quiet in the input voice;

Based on the detected quiet position and the vowel of each quiet front, detect the position at rhyme place;

Detection has high-intensity syllable;

Duration standardization with each syllable.

29, method as claimed in claim 25, wherein, step a) comprises:

I), from the accompaniment music template base, select Chinese musical telling accompaniment music template, as the accompaniment music of input voice based on the rhythm model of detected input voice in step e).

30, method as claimed in claim 29, wherein, in step I) in, use the coupling mark between dynamic programming algorithm calculating input voice and each Chinese musical telling accompaniment music template, and select to have the accompaniment music of the Chinese musical telling accompaniment music template of the highest coupling mark as the input voice.

31, method as claimed in claim 29, wherein, step a) also comprises:

J) according to the length of input voice, the Chinese musical telling accompaniment music template in the accompaniment music template base is carried out repetition.

32, method as claimed in claim 31, wherein, at step j) in, Chinese musical telling accompaniment music template is carried out repetition, so that the length of the Chinese musical telling accompaniment music template after repeating is between 0.5～2 times of the length of importing voice.

33, method as claimed in claim 29, wherein, when the Chinese musical telling accompaniment music template in the described accompaniment music template base was midi format, step a) also comprised: the Chinese musical telling accompaniment music template based on the midi format of selecting generates music signal.

34, method as claimed in claim 25, wherein, step a) comprises:

The detection user specifies the beat as the snatch of music of the accompaniment music of input voice;

Detect the intensity of beat, and all beats are divided into strong beat and non-strong beat, thereby obtain the rhythm model of specified accompaniment music;

Optimum matching path between the rhythm model of the rhythm model of calculating input voice and the accompaniment music of appointment.

35, method as claimed in claim 34 wherein, uses dynamic programming algorithm to calculate described optimum matching path.

36, as claim 29 or 34 described methods, wherein, step f) comprises:

Make the reference position of beat of the reference position of vowel of input voice and accompaniment music synchronous;

According to the rhythm model of accompaniment music, revise the duration of each syllable in the input music and the duration of pause;

According to the rhythm model of accompaniment music, improve the intensity that needs the syllable strengthened in the input voice;

According to the rhythm model of accompaniment music, reduce to import the intensity of the syllable that is in weak beat position in the voice;

Border between the adjacent syllable is carried out smoothly.