CN101312038B

CN101312038B - Method for synthesizing voice

Info

Publication number: CN101312038B
Application number: CN2007101045813A
Authority: CN
Inventors: 祖漪清; 曹振海
Original assignee: Nuance Communications Inc
Current assignee: Nuance Communications Inc
Priority date: 2007-05-25
Filing date: 2007-05-25
Publication date: 2012-01-04
Anticipated expiration: 2027-05-25
Also published as: WO2008147649A1; WO2008147649A8; CN101312038A

Abstract

A method for synthesizing speech from input strings can be used to improve quality of context-speech synthetic speech. The method includes processing input strings in order to provide an acoustic parameter sequence (step 305), producing a candidate micro-segmentation collection for each acoustic parameter in the acoustic parameter sequence from a speech database (step 320), then determining an optimized micro-segmentation sequence for the acoustic parameter sequence from the candidate micro-segmentation collection (step 315), and finally concatenating the micro-segmentation of the optimized micro-segmentation sequence to produce synthetic speech (step 325).

Description

The method that is used for synthetic speech

Technical field

It is synthetic that the present invention relates generally to text-voice (TTS), and in particular use differential section (micro-segment) synthetic speech from text string.

Background technology

It is synthetic that text voice (TTS) conversion also is called as serially connected (concatenative) text voice usually, and it makes electronic equipment can receive the input text string and provides the sound signal of text string to represent with the form of synthetic speech.Concerning serially connected phonetic synthesis, the basic voice unit such as phoneme or diphones is contacted.But concerning using based on the equipment of voice unit synthetic speech from the uncertain reception text string of quantity of phoneme, this equipment possibly be difficult to provide high-quality true synthetic speech.This is because the pronunciation of phoneme, syllable or word normally depends on contextual.

Because the storage and the processing power of a lot of equipment are limited, therefore, all expection rhythms that in the sound bank such as sounding waveform corpus (corpus), may not comprise phoneme, syllable or word change.For example, though possibly accepted by the polyphone of inter-syllable such as diphones-diphones based on being connected with of phoneme,, syllable inner be that the phone string on basis is connected with and possibly produces factitious sound with the phoneme.This is because the concatenation points between speech segmentation-speech segmentation can cause factitious sounding to change usually.

The typical diphone speech storehouse that is used for English can have about 1200 diphones, but in order to reduce the inner polyphone in voiced sound-voiced sound border, sound bank needs gathering together of n phoneme.Thus, have surprising that the sound bank of all pronunciations of all characters may be big.Therefore, in most tts system, all need estimate the appropriate pronunciation of input text string based on the acoustic analysis of the sound bank that uses limited size.Especially, when this sound bank being built in the limited hand-held electronic equipment of memory span, the size of this sound bank will be very limited.

Description of drawings

For the ease of understanding and actual embodiment of the present invention, now will be to carrying out reference with reference to the described illustrative embodiments of accompanying drawing, wherein same reference numbers is represented identical or intimate parts all the time in each accompanying drawing.The detailed description of these accompanying drawings and hereinafter is included in the instructions together and constitutes the part of instructions, and is used to further describe embodiment and explains according to various principle and advantages of the present invention, wherein:

Fig. 1 is the synoptic diagram of having described according to the employing mobile phone form electronic equipment of certain embodiments of the invention;

Fig. 2 has described being used for from the process flow diagram of the method for input text string synthetic speech according to certain embodiments of the invention;

Fig. 3 has described being used for from the general flow figure of the method for input string synthetic speech according to certain embodiments of the invention;

Fig. 4 be described according to certain embodiments of the invention be used for input string is handled so that the general flow figure of the method for parameters,acoustic sequence is provided; And

Fig. 5 is the diagram of having described according to the pitch model that comprises five normalization tone contour models of certain embodiments of the invention.

The technician should be understood that the parts in the accompanying drawing are illustrated for brevity, and these parts are not necessarily to draw in proportion.For example, the size of some parts possibly be exaggerated with respect to miscellaneous part in the accompanying drawing, so that improve the understanding to the embodiment of the invention.

Embodiment

Before describing in detail according to embodiments of the invention, what should be noted that is that these embodiment mainly are the method step relevant with synthetic speech from input string and the combination of apparatus assembly.Therefore; In the accompanying drawings these apparatus assemblies and method step are gone up with ordinary symbol in position and represented; Only show the detail that those are relevant with understanding the embodiment of the invention thus, in order to avoid because of the those of ordinary skills of description that have benefited from this being sayed conspicuous details blured present disclosure.

In this article, such as first and second, relational terms such as top and bottom only are used for an entity or action are made a distinction with another entity or action, rather than necessarily require or hint at this type of entity or between moving to have this actual relationship.The mode that comprises that term " comprises ", " comprising " or other any variants are intended to cover nonexcludability; Make processing, method, goods or the equipment comprise a series of key elements not merely comprise these key elements thus, but can comprise other those clearly do not enumerate or for these processing, method, goods or equipment intrinsic key element.Under the situation that do not having much more more restrictions, the key element that is limited by " comprising ... " also is not precluded within processing, method, goods or the equipment that comprises this key element and also has other identical element.

With reference to figure 1, this synoptic diagram has been described the form electronic equipment according to the employing mobile phone 100 of certain embodiments of the invention.Mobile phone 100 comprises radio frequency communications unit 102, and its common data address bus 117 that is coupled into processor 103 communicates.In addition, phone 100 also has keypad 106 and display screen 105, and wherein for instance, this display screen can be to be coupled into the touch-screen that communicates with processor 103.

Processor 103 also comprises encoder/decoder 111, and has the code ROM (read-only memory) (ROM) 112 that is associated, and it is used to store the data that are used for the signal that Code And Decode speech or other can send and received by mobile phone 100.Processor 103 also comprises microprocessor 113, and this microprocessor 113 is coupled to encoder/decoder 111, character ROM (read-only memory) (ROM) 114, random-access memory (ram) 104, programmable storage 116 and Subscriber Identity Module (SIM) interface 118 through common data address bus 117.Programmable storage 116 operatively is coupled to SIM interface 118 with SIM; And its each storing phone number database (TND) especially; This database comprises the number field that is used for telephone number, and is used for and the telephone number of the number field name field of identifier associated uniquely.

Radio frequency communications unit 102 is the combined type Receiver And Transmitters with community antenna.Communication unit 102 has the transceiver 108 that is coupled via radio frequency amplifier 109 and antenna 107.In addition, this transceiver 108 also is coupled to combined type modulator/demodulator 110, and said combined type modulator/demodulator 110 is coupled with encoder/decoder 111.

Microprocessor 113 has the port that is used to be coupled to keypad 106 and display screen 105.This microprocessor 113 also has the port that is used to be coupled to alarm module 115, microphone 120 and communications speaker 122, wherein this alarm module 115 driver of comprising alarm speaker, vibrating motor usually and being associated.Character ROM114 storage is used for Code And Decode can be by the code of communication unit 102 transmissions or data that receive, such as control channel message.In certain embodiments of the present invention, character ROM114, programmable storage 116 or SIM can also store the operational code (OC) that is used for microprocessor 113, and the code that is used to carry out the function that is associated with mobile phone 100.For example, programmable storage 116 can comprise phonetic synthesis service routine code components 125, and it is configured to make carries out a kind of being used for from the method for input string synthetic speech.

Thus, some embodiment of the present invention comprises a kind of method of using mobile phone 100 synthetic speech from input string.For instance, this input string can be text message or the Email that is included in the text string that receives on the mobile phone 100.This method comprises: handle this input string so that a parameters,acoustic sequence to be provided.Then, use this parameters,acoustic sequence from sound bank, to produce candidate's differential section sequence of sets.Then, from candidate's differential section sequence of sets, confirm a preferred differential section sequence for the parameters,acoustic sequence.At last, the differential section in the preferred differential section sequence is contacted, so that produce synthetic speech.

Therefore, some embodiment of the present invention makes it possible to use the parameters,acoustic sequence of differential section and expression target acoustic model rather than use phoneme or double-tone usually to carry out phonetic synthesis.The differential section can be the voice segment of any length, but is shorter than phoneme or diphones usually.For instance, the differential section can be the speech frame of 20ms, and the voice segment of phoneme comprises several this speech frames usually.Can provide the more frequency and the rhythm to change owing to comparing with the voice segment that phoneme or diphones synthesized through contacting, so the overall sound quality of text-voice (TTS) system can improve through the synthetic voice segment of polyphone differential section.

With reference to figure 2, this flow chart description being used for according to certain embodiments of the invention from the method 200 of input string 205 synthetic speechs.At first, input string 205 is handled, so that a parameters,acoustic sequence 230 is provided.Then, use parameters,acoustic sequence 230 from sound bank, to produce the sequence 24 0 of candidate's differential section set 235.Then, be that parameters,acoustic sequence 230 is confirmed a preferred differential section sequence 24 5 from the sequence 24 0 of candidate's differential section set 235.At last, the differential section in the preferred differential section sequence 24 5 is contacted, so that produce a synthetic speech signal 250.For instance, describe corresponding speech frame 255 with the differential section in the preferred differential section sequence 24 5 and can be loaded among the RAM104 of mobile phone 100, then by polyphone and on communications speaker 122, play, so that produce synthetic speech signal 250.

With reference to figure 3, this process flow diagram has further described a kind of conventional method 300 according to some embodiment of the present invention synthetic speech from input string.In step 305, input string is handled, so that a parameters,acoustic sequence is provided.For instance, the parameters,acoustic in the parameters,acoustic sequence 230 can comprise frequency spectrum parameter, pitch parameters and energy parameter.

According to some embodiment of the present invention, the parameters,acoustic that also is called as the target speech unit is to use rhythm position from input string, to produce.For instance, rhythm position can comprise position and this word the position in sentence of certain syllable in word.

Frequency spectrum parameter can use known spectrum signature method for expressing to come modeling; The spectrum signature method for expressing comprises for example linear predictive coding (Linear Predictive Coding; LPC) method, line spectrum pair (Linear Spectral Pairs; LSP) method or Mei Er cepstrum coefficient (Mel-Frequency Cepstral Coefficient, MFCC) method.Thus, through using rhythm position, can confirm the frequency spectrum parameter of phoneme.For instance, can use the position spectral model such as gauss hybrid models (GMM) to be mapped to frequency spectrum parameter such as the phoneme acoustic feature the rhythm position.Pitch parameters can use pitch model to confirm, wherein pitch model defines the tone contour of syllable according to the rhythm position of syllable.Pitch model can comprise the tone contour model, for example WO_stress, WO_unstress, WF_stress, WF_unstress or WS.

Concerning energy parameter, can use different strategies for the speech part and the unvoiced segments of syllable.The voice part can define the energy profile pattern for syllable.Can through use cv class (cv-like) unit in syllable the position and/or whether be the condition of stressed syllable, definition different energy outline about this syllable.Concerning unvoiced segments, can be phoneme definitions energy profile pattern, each (non-voice) phoneme can have one or more energy profile patterns.The energy profile of non-voice phoneme can depend on position and syllable the position in word of phoneme in syllable.In order to reduce needed amount of memory, if some (non-voice) phoneme has similar position and similar sharpness mode, same energy profile pattern can be shared in these phonemes so.For instance, phoneme " s ", " sh " and " ch " can share same energy profile pattern, and likewise, " g ", " d " can share another identical energy profile pattern with " k ".

In step 310, use the parameters,acoustic sequence from sound bank 315, to produce candidate's differential section sequence of sets.According to some embodiment of the present invention, this candidate's differential section set can use target cost function and duration model to produce.For instance, this target cost function can be the weighted sum of frequency spectrum cost, tone cost and cost of energy.Lower target cost possibly mean the acoustic characteristic and the parameters,acoustic close match of candidate's differential section.For example; Each parameters,acoustic in parameters,acoustic sequence 230, mobile phone 100 can find to have candidate's differential section (for example speech frame) set with the acoustic characteristic of the estimation duration close match of this parameters,acoustic and this parameters,acoustic through search sound bank 315.Then, can select the speech frame of this close match, so that produce the sequence 24 0 of candidate's differential section set 235.

In order to reduce the processing time; Speech frame in the sound bank 315 can be classified into several speech frames set through the rhythm position of using speech frame, and can with one of speech frame set of the rhythm position close match of this parameters,acoustic in search for candidate differential section.

In step 320, from the set of candidate's differential section, confirm a preferred differential section sequence for the parameters,acoustic sequence.For instance, can use viterbi algorithm to confirm preferred differential section sequence 24 5 here, and the path cost function of this viterbi algorithm can be the summation of target cost function and polyphone cost function.

According to some embodiment of the present invention, the target cost function can be the weighted sum of frequency spectrum cost function, tone cost function and cost of energy function.For example, the frequency spectrum cost function can be the measuring of the difference degree aspect spectrum signature between the parameters,acoustic (also being called as target differential section) in candidate's differential section and the parameters,acoustic sequence 230.Similarly, tone cost function and cost of energy function can be measured the difference degree aspect tone and energy feature between parameters,acoustic and the candidate's differential section respectively.For instance, the target cost function can be defined as follows:

C ^T(u _{I, k})=K _S ^TC _S ^T(u _{I, k})+K _P ^TC _P ^T(u _{I, k})+K _E ^TC _E ^T(u _{I, k}) (equality 1)

Wherein, u _{I, k}Be k candidate's differential section of i parameters,acoustic in the parameters,acoustic sequence 230, C ^T(u _{I, k}) be the target cost function, C ^T _S(u _{I, k}) be the frequency spectrum cost function, C ^T _P(u _{I, k}) be the tone cost function, C ^T _E(u _{I, k}) be the cost of energy function, and K ^T _S, K ^T _PAnd K ^T _EIt is weighted value.

The polyphone cost function can be the weighted sum of spectral difference function, difference of pitch function and energy difference function.The spectral difference function can be measured the difference degree aspect spectrum signature between two adjacent micro segmentations.Likewise, difference of pitch function and energy difference function can be measured the difference degree aspect tone and energy feature between two adjacent micro segmentations respectively.For instance, the polyphone cost function can be defined as follows:

C ^C(u _i-1，j，u _i，k)＝

K _S ^CC _S ^C(u _{I-1, j}, u _{I, k})+K _P ^CC _P ^C(u _{I-1, j}, u _{I, k})+K _E ^CC _E ^C(u _{I-1, j}, u _{I, k}) (equality 2)

Wherein, u _{I-1, j}Be j candidate's differential section of i-1 parameters,acoustic in the parameters,acoustic sequence 230, u _{I, k}Be k candidate's differential section of i parameters,acoustic in the parameters,acoustic sequence 230, C ^C(u _{I-1, j}, u _{I, k}) be the polyphone cost, C ^C _S(u _{I-1, j}, u _{I, k}) be u _{I-1, j}With u _{I, k}Between the spectral difference function, C ^C _P(u _{I-1, j}, u _{I, k}) be u _{I-1, j}With u _{I, k}Between the difference of pitch function, C ^C _E(u _{I-1, j}, u _{I, k}) be u _{I-1, j}With u _{I, k}Between the energy difference function, and K ^C _S, K ^C _PAnd K ^C _EIt is weighted value.

Then,, the differential section in the preferred differential section sequence is contacted, so that produce synthetic speech in step 325.

With reference to figure 4, this general flow figure has described according to the processing input string of certain embodiments of the invention so that the substep of the step 305 in the method 300 of parameters,acoustic sequence is provided.In step 405, input string is handled, so that an aligned phoneme sequence is provided.For instance, input string 205 can be a text message or the email message that on mobile phone 100, receives, and aligned phoneme sequence can be a string of representing the text message pronunciation with the phonemic alphabet sheet form.

In step 410, in aligned phoneme sequence, confirm the syllable border, so that a syllable sequence is provided.For example, English word possibly comprise several syllables, confirms this syllable border in this word then, so that syllable sequence is provided.For example, the aligned phoneme sequence " ihksplehn " that relates to English word " explain " can be divided into the syllable sequence that has comprised such as " ihk " and " splehn " these two syllables.

Then, in step 415, recognin syllable unit in syllable sequence is so that provide consonant joint sequence.Consonant joint unit can be equal to or less than syllable, and can be cv class voice unit (it can comprise consonant and vowel).Thus, consonant joint sequence can comprise cv class voice unit and consonant.For instance, in syllable sequence (" ihk "+" splehn "), can identify two cv class voice units (" ih " and " lehn ").Then, corresponding consonant joint sequence can be (" ih "+" k "+" s "+" p "+" lehn ").

According to some embodiment of the present invention, represent the pronunciation of input text through using cv class voice unit, can reduce the quantity of describing the needed elementary cell of word.For example, the dictionary that has comprised 202,000 words possibly comprise 24,980 syllables, and 6,707 cv class unit only.

Then, in step 420, the antithetical phrase syllable sequence is handled, so that provide a differential section to describe sequence.For example, the duration through using the duration model to come each element in the estimator syllable sequence, can estimate quantity into the needed differential section of each element synthetic speech.For instance, consider following cv class voice unit (consonant joint): ih.If the estimation duration of cv class voice unit is approximately equal to five differential sections, this consonant joint can as followsly be mapped to five differential sections descriptions so:

ih _f?ih _f?ih _f?ih _f?ih _f，

Ih wherein _fBe that the differential section is described.

According to some embodiment of the present invention, the estimation duration of consonant joint can obtain through using a duration model, and wherein this model has comprised the average duration of phoneme and the rhythm attribute of phoneme.For instance, the duration of phoneme p can obtain according to following equality:

L _p=k * L _Avg(equality 3)

Wherein, L _pBe the estimation duration of phoneme p, L _AvgBe the average phoneme duration of phoneme p, and k is according to comprising phoneme number in the syllable that contains phoneme p, comprising the rhythm attribute coefficients that factor obtained of the type of number of syllables and phoneme p in the word of this syllable.

Then,, the differential section is described sequence handle, so that the parameters,acoustic sequence is provided in step 425.For example; The differential section is described each differential section in the sequence and is described and can be mapped to the parameters,acoustic that is used to describe the acoustic characteristic that this differential section describes; Wherein for instance, this acoustic characteristic can be frequency spectrum (frequency characteristic) and rhythm characteristic (tone, energy or duration).The differential section is described sequence can comprise that a plurality of differential sections describe, and wherein to describe all be about usually less than the description of the voice differential section of phoneme to each differential section.Concerning the differential section was described each the differential section description in the sequence, parameters,acoustic can use acoustic model to estimate.For instance, parameters,acoustic can comprise frequency spectrum parameter s _n, pitch parameters p _nAnd energy parameter e _n

With reference to figure 5; This diagram has been described the pitch model of using according to certain embodiments of the invention, and wherein this model comprises five normalization tone contour model: WO_stress 505, WO_unstress 510, WF_stress 515, WF_unstress 520 and WS 525.WO_stress 505 tone contour model definitions be positioned at the beginning of word or the tone contour of middle stressed syllable with a plurality of syllables.WO_unstress 510 tone contour model definitions be positioned at the beginning of word or the tone contour of middle unstressed syllable with a plurality of syllables.WF_stress 515 tone contour model definitions be positioned at the tone contour of stressed syllable of the end of word with a plurality of syllables.WF_unstress 520 tone contour model definitions be positioned at the tone contour of unstressed syllable of the end of word with a plurality of syllables.WF 525 tone contour model definitions have only the tone contour of the syllable in the word of a syllable.

Thus, the advantage of certain embodiments of the invention comprises: the sound quality that has improved synthetic speech.Compare with the voice segment that synthesizes through polyphone phoneme or diphones, can provide improved voice continuity and more rhythms to change through the synthetic voice segment of polyphone differential section.Thus, the overall sound quality of tts system can improve, and is particularly all the more so in the resource-constrained portable equipment such as mobile phone and PDA(Personal Digital Assistant).

It should be understood that; The embodiment of the invention described herein can comprise one or more conventional processors and unique programmed instruction of being stored; Wherein said programmed instruction control this one or more processors, so as to combine some non-processor circuit carry out described herein some, the function of most of or all these synthetic speechs from input string.Non-processor circuit can be including, but not limited to radio receiver, radio transmitter, signal driver, clock circuit, power circuit and user input device.Likewise, these functions can be interpreted into the step that is used for from the method for input string synthetic speech.As replacement; Some or all functions can be realized by the state machine of the instruction that do not have program stored therein; Or in one or more special ICs (ASIC), realize; Wherein in said special IC, some combination of each function or some function may be implemented as customized logic.Certainly, also can use the combination of these two kinds of methods.Thus, the method and apparatus that is used for these functions has been described here.In addition; What it is also contemplated that is; Although might pay considerable effort, and might receive the promotion of for example pot life, current techniques and economic consideration factor and need make a lot of design alternatives, concerning those of ordinary skills; Under the guiding that receives notion disclosed herein and principle, they are easy to just can produce these software instructions, program and IC with minimum test.

In the description of preceding text, specific embodiment of the present invention is disclosed.But, one of ordinary skill in the art will realize that under the situation that does not deviate from the scope of the invention that accompanying claims sets forth various modifications and to change all be feasible.Therefore, this instructions and accompanying drawing should be counted as illustrative rather than restrictive, and all this modifications all should be included in the scope of the present invention.Benefit given here, advantage, issue-resolution and possibly to produce any benefit, advantage, solution or make it that more tangible any one or a plurality of key element should not be interpreted into be important, the necessary or basic characteristic or the key element of any one or all authority requirement.The present invention is only limited accompanying claims, and these claims have comprised any modification and all equivalents of these claims in the application's checking process.

Claims

1. one kind is used for from the method for input text string synthetic speech, and this method comprises:

Handle the input text string, so that aligned phoneme sequence is provided;

Confirm the syllable border in this aligned phoneme sequence, so that syllable sequence is provided;

Consonant joint unit in the identification syllable sequence is so that provide consonant joint sequence;

From said consonant joint sequence, produce the differential section and describe sequence; Wherein, Each element in the said consonant joint sequence is described by the differential section, for the quantity of the needed said differential section of each element synthetic speech is to confirm according to the duration of the estimation of said element;

Handle said differential section and describe sequence so that the parameters,acoustic sequence to be provided;

Use this parameters,acoustic sequence from sound bank, to produce candidate's differential section sequence of sets;

From candidate's differential section sequence of sets, confirm a preferred differential section sequence for the parameters,acoustic sequence; And

Differential section in this preferred differential section sequence of contacting is so that produce synthetic speech.

2. according to the process of claim 1 wherein, said differential section is described sequence and is to use the duration model from consonant joint sequence, to produce, and wherein, this duration model comprises the average duration of phoneme and the rhythm attribute coefficients of phoneme.

3. according to the process of claim 1 wherein, consonant joint sequence comprises one or more in cv class voice unit or the phoneme.

4. according to the process of claim 1 wherein, the parameters,acoustic in the parameters,acoustic sequence comprises frequency spectrum parameter, pitch parameters and energy parameter.

5. according to the process of claim 1 wherein, said candidate's differential section sequence of sets is to use target cost function and duration model from sound bank, to select.

6. according to the method for claim 5, wherein, the target cost function is the weighted sum of frequency spectrum cost, tone cost and cost of energy.

7. according to the process of claim 1 wherein, preferred differential section sequence is to use viterbi algorithm definite for the parameters,acoustic sequence from candidate's differential section sequence of sets.

8. according to the method for claim 7, wherein, viterbi algorithm comprises the path cost function, and this path cost function is the summation of target cost function and polyphone cost function.

9. according to Claim 8 method, wherein, the target cost function is the weighted sum of frequency spectrum cost function, tone cost function and cost of energy function.

10. according to Claim 8 method, wherein, the polyphone cost function is the weighted sum of spectral difference function, difference of pitch function and energy difference function.

11. according to the method for claim 9, wherein, the target cost function is defined as follows:

C ^T(u _i，k)＝K _S ^TC _S ^T(u _i，k)+K _P ^TC _P ^T(u _i，k)+K _E ^TC _E ^T(u _i，k)

Wherein, u _{I, k}Be k candidate's differential section of i parameters,acoustic in the parameters,acoustic sequence, C ^T(u _{I, k}) be the target cost function, C _S ^T(u _{I, k}) be the frequency spectrum cost function, C _P ^T(u _{I, k}) be the tone cost function, C _E ^T(u _{I, k}) be the cost of energy function, and K _S ^T, K _P ^TAnd K _E ^TIt is weighted value.

12. according to the method for claim 10, wherein, the polyphone cost function defines according to following equality:

C ^C(u _i-1，j，u _i，k)＝

K _S ^CC _S ^C(u _i-1，j，u _i，k)+K _P ^CC _P ^C(u _i-1，j，u _i，k)+K _E ^CC _E ^C(u _i-1，j，u _i，k)

Wherein, u _{I-1, j}Be j candidate's differential section of i-1 parameters,acoustic in the parameters,acoustic sequence, u _{I, k}Be k candidate's differential section of i parameters,acoustic in the parameters,acoustic sequence, C ^C(u _{I-1, j}, u _{I, k}) be to be used to the u that contacts _{I-1, j}With u _{I, k}The polyphone cost, C _S ^C(u _{I-1, j}, u _{I, k}) be u _{I-1, j}With u _{I, k}Between the spectral difference function, C _P ^C(u _{I-1, j}, u _{I, k}) be u _{I-1, j}With u _{I, k}Between the difference of pitch function, C _E ^C(u _{I-1, j}, u _{I, k}) be u _{I-1, j}With u _{I, k}Between the energy difference function, and K _S ^C, K _P ^CAnd K _E ^CIt is weighted value.

13. method according to claim 4; Wherein, Pitch parameters is one of following pitch model: WO_stress, WO_unstress, WF_stress, WF_unstress or WS; Wherein, WO_stress has defined the tone contour of the beginning that is positioned at the word with a plurality of syllables or middle stressed syllable, and WO_unstress has defined the tone contour of the beginning that is positioned at the word with a plurality of syllables or middle unstressed syllable, and WF_stress has defined the tone contour of the stressed syllable of the end that is positioned at the word with a plurality of syllables; WF_unstress has defined the tone contour of the unstressed syllable of the end that is positioned at the word with a plurality of syllables, and WS has defined the tone contour of the syllable in the word that has only a syllable.

14. according to the method for claim 4, wherein, energy parameter comprises speech part and unvoiced segments.

15. according to the method for claim 2, wherein, the duration model is defined by following equality:

L _p＝k×L _avg

Wherein, L _pBe the estimation duration of phoneme p, L _AvgBe the average phoneme duration of phoneme p, and k is according to comprising phoneme number in the syllable that contains phoneme p, comprising the rhythm attribute coefficients that a plurality of factor obtained of the type of number of syllables and phoneme p in the word of this phoneme p.