Embodiment
Before describing in detail according to embodiments of the invention, what should be noted that is that these embodiment mainly are the method step relevant with synthetic speech from input string and the combination of apparatus assembly.Therefore; In the accompanying drawings these apparatus assemblies and method step are gone up with ordinary symbol in position and represented; Only show the detail that those are relevant with understanding the embodiment of the invention thus, in order to avoid because of the those of ordinary skills of description that have benefited from this being sayed conspicuous details blured present disclosure.
In this article, such as first and second, relational terms such as top and bottom only are used for an entity or action are made a distinction with another entity or action, rather than necessarily require or hint at this type of entity or between moving to have this actual relationship.The mode that comprises that term " comprises ", " comprising " or other any variants are intended to cover nonexcludability; Make processing, method, goods or the equipment comprise a series of key elements not merely comprise these key elements thus, but can comprise other those clearly do not enumerate or for these processing, method, goods or equipment intrinsic key element.Under the situation that do not having much more more restrictions, the key element that is limited by " comprising ... " also is not precluded within processing, method, goods or the equipment that comprises this key element and also has other identical element.
With reference to figure 1, this synoptic diagram has been described the form electronic equipment according to the employing mobile phone 100 of certain embodiments of the invention.Mobile phone 100 comprises radio frequency communications unit 102, and its common data address bus 117 that is coupled into processor 103 communicates.In addition, phone 100 also has keypad 106 and display screen 105, and wherein for instance, this display screen can be to be coupled into the touch-screen that communicates with processor 103.
Processor 103 also comprises encoder/decoder 111, and has the code ROM (read-only memory) (ROM) 112 that is associated, and it is used to store the data that are used for the signal that Code And Decode speech or other can send and received by mobile phone 100.Processor 103 also comprises microprocessor 113, and this microprocessor 113 is coupled to encoder/decoder 111, character ROM (read-only memory) (ROM) 114, random-access memory (ram) 104, programmable storage 116 and Subscriber Identity Module (SIM) interface 118 through common data address bus 117.Programmable storage 116 operatively is coupled to SIM interface 118 with SIM; And its each storing phone number database (TND) especially; This database comprises the number field that is used for telephone number, and is used for and the telephone number of the number field name field of identifier associated uniquely.
Radio frequency communications unit 102 is the combined type Receiver And Transmitters with community antenna.Communication unit 102 has the transceiver 108 that is coupled via radio frequency amplifier 109 and antenna 107.In addition, this transceiver 108 also is coupled to combined type modulator/demodulator 110, and said combined type modulator/demodulator 110 is coupled with encoder/decoder 111.
Microprocessor 113 has the port that is used to be coupled to keypad 106 and display screen 105.This microprocessor 113 also has the port that is used to be coupled to alarm module 115, microphone 120 and communications speaker 122, wherein this alarm module 115 driver of comprising alarm speaker, vibrating motor usually and being associated.Character ROM114 storage is used for Code And Decode can be by the code of communication unit 102 transmissions or data that receive, such as control channel message.In certain embodiments of the present invention, character ROM114, programmable storage 116 or SIM can also store the operational code (OC) that is used for microprocessor 113, and the code that is used to carry out the function that is associated with mobile phone 100.For example, programmable storage 116 can comprise phonetic synthesis service routine code components 125, and it is configured to make carries out a kind of being used for from the method for input string synthetic speech.
Thus, some embodiment of the present invention comprises a kind of method of using mobile phone 100 synthetic speech from input string.For instance, this input string can be text message or the Email that is included in the text string that receives on the mobile phone 100.This method comprises: handle this input string so that a parameters,acoustic sequence to be provided.Then, use this parameters,acoustic sequence from sound bank, to produce candidate's differential section sequence of sets.Then, from candidate's differential section sequence of sets, confirm a preferred differential section sequence for the parameters,acoustic sequence.At last, the differential section in the preferred differential section sequence is contacted, so that produce synthetic speech.
Therefore, some embodiment of the present invention makes it possible to use the parameters,acoustic sequence of differential section and expression target acoustic model rather than use phoneme or double-tone usually to carry out phonetic synthesis.The differential section can be the voice segment of any length, but is shorter than phoneme or diphones usually.For instance, the differential section can be the speech frame of 20ms, and the voice segment of phoneme comprises several this speech frames usually.Can provide the more frequency and the rhythm to change owing to comparing with the voice segment that phoneme or diphones synthesized through contacting, so the overall sound quality of text-voice (TTS) system can improve through the synthetic voice segment of polyphone differential section.
With reference to figure 2, this flow chart description being used for according to certain embodiments of the invention from the method 200 of input string 205 synthetic speechs.At first, input string 205 is handled, so that a parameters,acoustic sequence 230 is provided.Then, use parameters,acoustic sequence 230 from sound bank, to produce the sequence 24 0 of candidate's differential section set 235.Then, be that parameters,acoustic sequence 230 is confirmed a preferred differential section sequence 24 5 from the sequence 24 0 of candidate's differential section set 235.At last, the differential section in the preferred differential section sequence 24 5 is contacted, so that produce a synthetic speech signal 250.For instance, describe corresponding speech frame 255 with the differential section in the preferred differential section sequence 24 5 and can be loaded among the RAM104 of mobile phone 100, then by polyphone and on communications speaker 122, play, so that produce synthetic speech signal 250.
With reference to figure 3, this process flow diagram has further described a kind of conventional method 300 according to some embodiment of the present invention synthetic speech from input string.In step 305, input string is handled, so that a parameters,acoustic sequence is provided.For instance, the parameters,acoustic in the parameters,acoustic sequence 230 can comprise frequency spectrum parameter, pitch parameters and energy parameter.
According to some embodiment of the present invention, the parameters,acoustic that also is called as the target speech unit is to use rhythm position from input string, to produce.For instance, rhythm position can comprise position and this word the position in sentence of certain syllable in word.
Frequency spectrum parameter can use known spectrum signature method for expressing to come modeling; The spectrum signature method for expressing comprises for example linear predictive coding (Linear Predictive Coding; LPC) method, line spectrum pair (Linear Spectral Pairs; LSP) method or Mei Er cepstrum coefficient (Mel-Frequency Cepstral Coefficient, MFCC) method.Thus, through using rhythm position, can confirm the frequency spectrum parameter of phoneme.For instance, can use the position spectral model such as gauss hybrid models (GMM) to be mapped to frequency spectrum parameter such as the phoneme acoustic feature the rhythm position.Pitch parameters can use pitch model to confirm, wherein pitch model defines the tone contour of syllable according to the rhythm position of syllable.Pitch model can comprise the tone contour model, for example WO_stress, WO_unstress, WF_stress, WF_unstress or WS.
Concerning energy parameter, can use different strategies for the speech part and the unvoiced segments of syllable.The voice part can define the energy profile pattern for syllable.Can through use cv class (cv-like) unit in syllable the position and/or whether be the condition of stressed syllable, definition different energy outline about this syllable.Concerning unvoiced segments, can be phoneme definitions energy profile pattern, each (non-voice) phoneme can have one or more energy profile patterns.The energy profile of non-voice phoneme can depend on position and syllable the position in word of phoneme in syllable.In order to reduce needed amount of memory, if some (non-voice) phoneme has similar position and similar sharpness mode, same energy profile pattern can be shared in these phonemes so.For instance, phoneme " s ", " sh " and " ch " can share same energy profile pattern, and likewise, " g ", " d " can share another identical energy profile pattern with " k ".
In step 310, use the parameters,acoustic sequence from sound bank 315, to produce candidate's differential section sequence of sets.According to some embodiment of the present invention, this candidate's differential section set can use target cost function and duration model to produce.For instance, this target cost function can be the weighted sum of frequency spectrum cost, tone cost and cost of energy.Lower target cost possibly mean the acoustic characteristic and the parameters,acoustic close match of candidate's differential section.For example; Each parameters,acoustic in parameters,acoustic sequence 230, mobile phone 100 can find to have candidate's differential section (for example speech frame) set with the acoustic characteristic of the estimation duration close match of this parameters,acoustic and this parameters,acoustic through search sound bank 315.Then, can select the speech frame of this close match, so that produce the sequence 24 0 of candidate's differential section set 235.
In order to reduce the processing time; Speech frame in the sound bank 315 can be classified into several speech frames set through the rhythm position of using speech frame, and can with one of speech frame set of the rhythm position close match of this parameters,acoustic in search for candidate differential section.
In step 320, from the set of candidate's differential section, confirm a preferred differential section sequence for the parameters,acoustic sequence.For instance, can use viterbi algorithm to confirm preferred differential section sequence 24 5 here, and the path cost function of this viterbi algorithm can be the summation of target cost function and polyphone cost function.
According to some embodiment of the present invention, the target cost function can be the weighted sum of frequency spectrum cost function, tone cost function and cost of energy function.For example, the frequency spectrum cost function can be the measuring of the difference degree aspect spectrum signature between the parameters,acoustic (also being called as target differential section) in candidate's differential section and the parameters,acoustic sequence 230.Similarly, tone cost function and cost of energy function can be measured the difference degree aspect tone and energy feature between parameters,acoustic and the candidate's differential section respectively.For instance, the target cost function can be defined as follows:
C
T(u
I, k)=K
S TC
S T(u
I, k)+K
P TC
P T(u
I, k)+K
E TC
E T(u
I, k) (equality 1)
Wherein, u
I, kBe k candidate's differential section of i parameters,acoustic in the parameters,acoustic sequence 230, C
T(u
I, k) be the target cost function, C
T S(u
I, k) be the frequency spectrum cost function, C
T P(u
I, k) be the tone cost function, C
T E(u
I, k) be the cost of energy function, and K
T S, K
T PAnd K
T EIt is weighted value.
The polyphone cost function can be the weighted sum of spectral difference function, difference of pitch function and energy difference function.The spectral difference function can be measured the difference degree aspect spectrum signature between two adjacent micro segmentations.Likewise, difference of pitch function and energy difference function can be measured the difference degree aspect tone and energy feature between two adjacent micro segmentations respectively.For instance, the polyphone cost function can be defined as follows:
C
C(u
i-1,j,u
i,k)=
K
S CC
S C(u
I-1, j, u
I, k)+K
P CC
P C(u
I-1, j, u
I, k)+K
E CC
E C(u
I-1, j, u
I, k) (equality 2)
Wherein, u
I-1, jBe j candidate's differential section of i-1 parameters,acoustic in the parameters,acoustic sequence 230, u
I, kBe k candidate's differential section of i parameters,acoustic in the parameters,acoustic sequence 230, C
C(u
I-1, j, u
I, k) be the polyphone cost, C
C S(u
I-1, j, u
I, k) be u
I-1, jWith u
I, kBetween the spectral difference function, C
C P(u
I-1, j, u
I, k) be u
I-1, jWith u
I, kBetween the difference of pitch function, C
C E(u
I-1, j, u
I, k) be u
I-1, jWith u
I, kBetween the energy difference function, and K
C S, K
C PAnd K
C EIt is weighted value.
Then,, the differential section in the preferred differential section sequence is contacted, so that produce synthetic speech in step 325.
With reference to figure 4, this general flow figure has described according to the processing input string of certain embodiments of the invention so that the substep of the step 305 in the method 300 of parameters,acoustic sequence is provided.In step 405, input string is handled, so that an aligned phoneme sequence is provided.For instance, input string 205 can be a text message or the email message that on mobile phone 100, receives, and aligned phoneme sequence can be a string of representing the text message pronunciation with the phonemic alphabet sheet form.
In step 410, in aligned phoneme sequence, confirm the syllable border, so that a syllable sequence is provided.For example, English word possibly comprise several syllables, confirms this syllable border in this word then, so that syllable sequence is provided.For example, the aligned phoneme sequence " ihksplehn " that relates to English word " explain " can be divided into the syllable sequence that has comprised such as " ihk " and " splehn " these two syllables.
Then, in step 415, recognin syllable unit in syllable sequence is so that provide consonant joint sequence.Consonant joint unit can be equal to or less than syllable, and can be cv class voice unit (it can comprise consonant and vowel).Thus, consonant joint sequence can comprise cv class voice unit and consonant.For instance, in syllable sequence (" ihk "+" splehn "), can identify two cv class voice units (" ih " and " lehn ").Then, corresponding consonant joint sequence can be (" ih "+" k "+" s "+" p "+" lehn ").
According to some embodiment of the present invention, represent the pronunciation of input text through using cv class voice unit, can reduce the quantity of describing the needed elementary cell of word.For example, the dictionary that has comprised 202,000 words possibly comprise 24,980 syllables, and 6,707 cv class unit only.
Then, in step 420, the antithetical phrase syllable sequence is handled, so that provide a differential section to describe sequence.For example, the duration through using the duration model to come each element in the estimator syllable sequence, can estimate quantity into the needed differential section of each element synthetic speech.For instance, consider following cv class voice unit (consonant joint): ih.If the estimation duration of cv class voice unit is approximately equal to five differential sections, this consonant joint can as followsly be mapped to five differential sections descriptions so:
ih
f?ih
f?ih
f?ih
f?ih
f,
Ih wherein
fBe that the differential section is described.
According to some embodiment of the present invention, the estimation duration of consonant joint can obtain through using a duration model, and wherein this model has comprised the average duration of phoneme and the rhythm attribute of phoneme.For instance, the duration of phoneme p can obtain according to following equality:
L
p=k * L
Avg(equality 3)
Wherein, L
pBe the estimation duration of phoneme p, L
AvgBe the average phoneme duration of phoneme p, and k is according to comprising phoneme number in the syllable that contains phoneme p, comprising the rhythm attribute coefficients that factor obtained of the type of number of syllables and phoneme p in the word of this syllable.
Then,, the differential section is described sequence handle, so that the parameters,acoustic sequence is provided in step 425.For example; The differential section is described each differential section in the sequence and is described and can be mapped to the parameters,acoustic that is used to describe the acoustic characteristic that this differential section describes; Wherein for instance, this acoustic characteristic can be frequency spectrum (frequency characteristic) and rhythm characteristic (tone, energy or duration).The differential section is described sequence can comprise that a plurality of differential sections describe, and wherein to describe all be about usually less than the description of the voice differential section of phoneme to each differential section.Concerning the differential section was described each the differential section description in the sequence, parameters,acoustic can use acoustic model to estimate.For instance, parameters,acoustic can comprise frequency spectrum parameter s
n, pitch parameters p
nAnd energy parameter e
n
With reference to figure 5; This diagram has been described the pitch model of using according to certain embodiments of the invention, and wherein this model comprises five normalization tone contour model: WO_stress 505, WO_unstress 510, WF_stress 515, WF_unstress 520 and WS 525.WO_stress 505 tone contour model definitions be positioned at the beginning of word or the tone contour of middle stressed syllable with a plurality of syllables.WO_unstress 510 tone contour model definitions be positioned at the beginning of word or the tone contour of middle unstressed syllable with a plurality of syllables.WF_stress 515 tone contour model definitions be positioned at the tone contour of stressed syllable of the end of word with a plurality of syllables.WF_unstress 520 tone contour model definitions be positioned at the tone contour of unstressed syllable of the end of word with a plurality of syllables.WF 525 tone contour model definitions have only the tone contour of the syllable in the word of a syllable.
Thus, the advantage of certain embodiments of the invention comprises: the sound quality that has improved synthetic speech.Compare with the voice segment that synthesizes through polyphone phoneme or diphones, can provide improved voice continuity and more rhythms to change through the synthetic voice segment of polyphone differential section.Thus, the overall sound quality of tts system can improve, and is particularly all the more so in the resource-constrained portable equipment such as mobile phone and PDA(Personal Digital Assistant).
It should be understood that; The embodiment of the invention described herein can comprise one or more conventional processors and unique programmed instruction of being stored; Wherein said programmed instruction control this one or more processors, so as to combine some non-processor circuit carry out described herein some, the function of most of or all these synthetic speechs from input string.Non-processor circuit can be including, but not limited to radio receiver, radio transmitter, signal driver, clock circuit, power circuit and user input device.Likewise, these functions can be interpreted into the step that is used for from the method for input string synthetic speech.As replacement; Some or all functions can be realized by the state machine of the instruction that do not have program stored therein; Or in one or more special ICs (ASIC), realize; Wherein in said special IC, some combination of each function or some function may be implemented as customized logic.Certainly, also can use the combination of these two kinds of methods.Thus, the method and apparatus that is used for these functions has been described here.In addition; What it is also contemplated that is; Although might pay considerable effort, and might receive the promotion of for example pot life, current techniques and economic consideration factor and need make a lot of design alternatives, concerning those of ordinary skills; Under the guiding that receives notion disclosed herein and principle, they are easy to just can produce these software instructions, program and IC with minimum test.
In the description of preceding text, specific embodiment of the present invention is disclosed.But, one of ordinary skill in the art will realize that under the situation that does not deviate from the scope of the invention that accompanying claims sets forth various modifications and to change all be feasible.Therefore, this instructions and accompanying drawing should be counted as illustrative rather than restrictive, and all this modifications all should be included in the scope of the present invention.Benefit given here, advantage, issue-resolution and possibly to produce any benefit, advantage, solution or make it that more tangible any one or a plurality of key element should not be interpreted into be important, the necessary or basic characteristic or the key element of any one or all authority requirement.The present invention is only limited accompanying claims, and these claims have comprised any modification and all equivalents of these claims in the application's checking process.