CN101312038A - Method for synthesizing voice - Google Patents

Method for synthesizing voice Download PDF

Info

Publication number
CN101312038A
CN101312038A CNA2007101045813A CN200710104581A CN101312038A CN 101312038 A CN101312038 A CN 101312038A CN A2007101045813 A CNA2007101045813 A CN A2007101045813A CN 200710104581 A CN200710104581 A CN 200710104581A CN 101312038 A CN101312038 A CN 101312038A
Authority
CN
China
Prior art keywords
sequence
parameters
acoustic
differential section
phoneme
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CNA2007101045813A
Other languages
Chinese (zh)
Other versions
CN101312038B (en
Inventor
祖漪清
曹振海
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nuance Communications Inc
Original Assignee
Motorola Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Motorola Inc filed Critical Motorola Inc
Priority to CN2007101045813A priority Critical patent/CN101312038B/en
Priority to PCT/US2008/062822 priority patent/WO2008147649A1/en
Publication of CN101312038A publication Critical patent/CN101312038A/en
Application granted granted Critical
Publication of CN101312038B publication Critical patent/CN101312038B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/06Elementary speech units used in speech synthesisers; Concatenation rules

Abstract

A method for synthesizing speech from input strings can be used to improve quality of context-speech synthetic speech. The method includes processing input strings in order to provide an acoustic parameter sequence (step 305), producing a candidate micro-segmentation collection for each acoustic parameter in the acoustic parameter sequence from a speech database (step 310), then determining an optimized micro-segmentation sequence for the acoustic parameter sequence from the candidate micro-segmentation collection (step 315), and finally concatenating the micro-segmentation of the optimized micro-segmentation sequence to produce synthetic speech (step 320).

Description

The method that is used for synthetic speech
Technical field
It is synthetic that the present invention relates generally to text-voice (TTS), and in particular use differential section (micro-segment) synthetic speech from text string.
Background technology
It is synthetic that text voice (TTS) conversion is also referred to as serially connected (concatenative) text voice usually, and it makes electronic equipment can receive the input text string and provides the sound signal of text string to represent with the form of synthetic speech.Concerning serially connected phonetic synthesis, the basic voice unit such as phoneme or diphones is contacted.But concerning using based on the equipment of voice unit synthetic speech from the uncertain reception text string of quantity of phoneme, this equipment may be difficult to provide high-quality true synthetic speech.This is because the pronunciation of phoneme, syllable or word normally depends on contextual.
Because the storage and the processing power of a lot of equipment are limited, therefore, all expection rhythms that may not comprise phoneme, syllable or word in the sound bank such as sounding waveform corpus (corpus) change.For example, though may be accepted by the polyphone of inter-syllable such as diphones-diphones based on being connected with of phoneme,, the phone string based on phoneme in syllable inside is connected with and may produces factitious sound.This is because the concatenation points between speech segmentation-speech segmentation can cause factitious sounding to change usually.
The typical diphone speech storehouse that is used for English can have about 1200 diphones, but in order to reduce the polyphone of voiced sound-inside, voiced sound border, sound bank needs gathering together of n phoneme.Thus, have surprising that the sound bank of all pronunciations of all characters may be big.Therefore, in most tts system, all need estimate the appropriate pronunciation of input text string based on the acoustic analysis of the sound bank that uses limited size.Especially, when this sound bank being built in the limited hand-held electronic equipment of memory span, the size of this sound bank will be very limited.
Description of drawings
For the ease of understanding and the actual the present invention of enforcement, will carry out reference to the described illustrative embodiments of reference accompanying drawing now, wherein same reference numbers is represented identical or intimate parts all the time in each accompanying drawing.These accompanying drawings and detailed description hereinafter are included in the instructions together and constitute the part of instructions, and are used to further describe embodiment and illustrate according to various principle and advantages of the present invention, wherein:
Fig. 1 is the synoptic diagram of having described according to the electronic equipment of the employing mobile phone form of certain embodiments of the invention;
Fig. 2 has described being used for from the process flow diagram of the method for input text string synthetic speech according to certain embodiments of the invention;
Fig. 3 has described being used for from the general flow figure of the method for input string synthetic speech according to certain embodiments of the invention;
Fig. 4 be described according to certain embodiments of the invention be used for input string is handled so that the general flow figure of the method for parameters,acoustic sequence is provided; And
Fig. 5 is the diagram of having described according to the pitch model that comprises five normalization tone contour models of certain embodiments of the invention.
The technician should be understood that the parts in the accompanying drawing are illustrated for brevity, and these parts are not necessarily drawn in proportion.For example, the size of some parts may be exaggerated with respect to miscellaneous part in the accompanying drawing, so that improve the understanding to the embodiment of the invention.
Embodiment
Before describing in detail according to embodiments of the invention, what should be noted that is that these embodiment mainly are the method step relevant with synthetic speech from input string and the combination of apparatus assembly.Therefore, in the accompanying drawings these apparatus assemblies and method step are gone up in position with ordinary symbol and represented, only show the detail that those are relevant with understanding the embodiment of the invention thus, in order to avoid because of the those of ordinary skills of description that have benefited from this being sayed conspicuous details blured present disclosure.
In this article, such as first and second, relational terms such as top and bottom only are used for an entity or action are made a distinction with another entity or action, rather than necessarily require or hint at this type of entity or between moving to have this actual relationship.The mode that comprises that term " comprises ", " comprising " or other any variants are intended to cover nonexcludability, make processing, method, goods or the equipment comprise a series of key elements not merely comprise these key elements thus, but can comprise other those clearly do not enumerate or be the intrinsic key element of these processing, method, goods or equipment.Do not having under the situation of more restrictions, the key element that is limited by " comprising ... " also is not precluded within processing, method, goods or the equipment that comprises this key element and also has other identical element.
With reference to figure 1, this synoptic diagram has been described the electronic equipment according to the form of the employing mobile phone 100 of certain embodiments of the invention.Mobile phone 100 comprises radio frequency communications unit 102, and its common data address bus 117 that is coupled into processor 103 communicates.In addition, phone 100 also has keypad 106 and display screen 105, and wherein for instance, this display screen can be to be coupled into the touch-screen that communicates with processor 103.
Processor 103 also comprises encoder/decoder 111, and has the code ROM (read-only memory) (ROM) 112 that is associated, and it is used to store the data that are used for the signal that Code And Decode speech or other can send and be received by mobile phone 100.Processor 103 also comprises microprocessor 113, and this microprocessor 113 is coupled to encoder/decoder 111, character ROM (read-only memory) (ROM) 114, random-access memory (ram) 104, programmable storage 116 and Subscriber Identity Module (SIM) interface 118 by common data address bus 117.Programmable storage 116 and SIM operatively are coupled to SIM interface 118, and its each storing phone number database (TND) especially, this database comprises the number field that is used for telephone number, and is used for the name field with the related uniquely identifier of the telephone number of number field.
Radio frequency communications unit 102 is the combined type Receiver And Transmitters with community antenna.Communication unit 102 has the transceiver 108 that is coupled via radio frequency amplifier 109 and antenna 107.In addition, this transceiver 108 also is coupled to combined type modulator/demodulator 110, and described combined type modulator/demodulator 110 is coupled with encoder/decoder 111.
Microprocessor 113 has the port that is used to be coupled to keypad 106 and display screen 105.This microprocessor 113 also has the port that is used to be coupled to alarm module 115, microphone 120 and communications speaker 122, wherein this alarm module 115 driver of comprising alarm speaker, vibrating motor usually and being associated.Character ROM 114 storages are used for Code And Decode can be by the code of communication unit 102 transmissions or data that receive, such as control channel message.In certain embodiments of the present invention, character ROM 114, programmable storage 116 or SIM can also store the operational code (OC) that is used for microprocessor 113, and the code that is used to carry out the function that is associated with mobile phone 100.For example, programmable storage 116 can comprise phonetic synthesis service routine code components 125, and it is configured to make carries out a kind of being used for from the method for input string synthetic speech.
Thus, some embodiment of the present invention comprises a kind of method of using mobile phone 100 synthetic speech from input string.For instance, this input string can be text message or the Email that is included in the text string that receives on the mobile phone 100.This method comprises: handle this input string so that a parameters,acoustic sequence to be provided.Then, use this parameters,acoustic sequence from sound bank, to produce candidate's differential section sequence of sets.Then, from candidate's differential section sequence of sets, determine a preferred differential section sequence for the parameters,acoustic sequence.At last, the differential section in the preferred differential section sequence is contacted, so that produce synthetic speech.
Therefore, some embodiment of the present invention makes it possible to use the parameters,acoustic sequence of differential section and expression target acoustic model rather than use phoneme or double-tone usually to carry out phonetic synthesis.The differential section can be the voice segment of any length, but is shorter than phoneme or diphones usually.For instance, the differential section can be the speech frame of 20ms, and the voice segment of phoneme comprises several this speech frames usually.Can provide the more frequency and the rhythm to change owing to comparing with the voice segment that phoneme or diphones synthesized by contacting, so the overall sound quality of text-voice (TTS) system can improve by the synthetic voice segment of polyphone differential section.
With reference to figure 2, this flow chart description being used for according to certain embodiments of the invention from the method 200 of input string 205 synthetic speechs.At first, input string 205 is handled, so that a parameters,acoustic sequence 230 is provided.Then, use parameters,acoustic sequence 230 from sound bank, to produce the sequence 24 0 of candidate's differential section set 235.Then, be that parameters,acoustic sequence 230 is determined a preferred differential section sequence 24 5 from the sequence 24 0 of candidate's differential section set 235.At last, the differential section in the preferred differential section sequence 24 5 is contacted, so that produce a synthetic speech signal 250.For instance, describe among the RAM 104 that corresponding speech frame 255 can be loaded into mobile phone 100, contacted then and on communications speaker 122, play, so that produce synthetic speech signal 250 with the differential section in the preferred differential section sequence 24 5.
With reference to figure 3, this process flow diagram has further described a kind of conventional method 300 according to some embodiment of the present invention synthetic speech from input string.In step 305, input string is handled, so that a parameters,acoustic sequence is provided.For instance, the parameters,acoustic in the parameters,acoustic sequence 230 can comprise frequency spectrum parameter, pitch parameters and energy parameter.
According to some embodiment of the present invention, the parameters,acoustic that is also referred to as the target voice unit is to use rhythm position to produce from input string.For instance, rhythm position can comprise position and this word the position in sentence of certain syllable in word.
Frequency spectrum parameter can use known spectrum signature method for expressing to come modeling, the spectrum signature method for expressing comprises for example linear predictive coding (Linear Predictive Coding, LPC) method, line spectrum pair (Linear Spectral Pairs, LSP) method or Mel cepstrum coefficient (Mel-Frequency Cepstral Coefficient, MFCC) method.Thus, by using rhythm position, can determine the frequency spectrum parameter of phoneme.For instance, can use the position spectral model such as gauss hybrid models (GMM) to be mapped to frequency spectrum parameter such as the phoneme acoustic feature the rhythm position.Pitch parameters can use pitch model to determine, wherein pitch model defines the tone contour of syllable according to the rhythm position of syllable.Pitch model can comprise the tone contour model, for example WO_stress, WO_unstress, WF_stress, WF_unstress or WS.
Concerning energy parameter, can use different strategies with unvoiced segments for the speech part of syllable.Dialogue line is divided, and can define the energy profile pattern for syllable.Can by use cv class (cv-like) unit in syllable the position and/or about this syllable whether as the condition of stressed syllable, define different energy profile patterns.Concerning unvoiced segments, can be phoneme definitions energy profile pattern, each (non-voice) phoneme can have one or more energy profile patterns.The energy profile of non-voice phoneme can depend on position and syllable the position in word of phoneme in syllable.In order to reduce needed amount of memory, if some (non-voice) phoneme has similar position and similar sharpness mode, same energy profile pattern can be shared in these phonemes so.For instance, phoneme " s ", " sh " and " ch " can share same energy profile pattern, and similarly, " g ", " d " can share another identical energy profile pattern with " k ".
In step 310, use the parameters,acoustic sequence from sound bank 315, to produce candidate's differential section sequence of sets.According to some embodiment of the present invention, this candidate's differential section set can use target cost function and duration model to produce.For instance, this target cost function can be the weighted sum of frequency spectrum cost, tone cost and cost of energy.Lower target cost may mean that the acoustic characteristic and the parameters,acoustic of candidate's differential section closely mate.For example, each parameters,acoustic in parameters,acoustic sequence 230, mobile phone 100 can find to have candidate's differential section (for example speech frame) set of the acoustic characteristic of closely mating with the estimation duration of this parameters,acoustic and this parameters,acoustic by search sound bank 315.Then, can select the speech frame of this tight coupling, so that produce the sequence 24 0 of candidate's differential section set 235.
In order to reduce the processing time, speech frame in the sound bank 315 can be classified into several speech frames set by the rhythm position of using speech frame, and can closely search for candidate's differential section in one of the speech frame set of coupling with the rhythm position of this parameters,acoustic.
In step 320, from the set of candidate's differential section, determine a preferred differential section sequence for the parameters,acoustic sequence.For instance, can use viterbi algorithm to determine preferred differential section sequence 24 5 here, and the path cost function of this viterbi algorithm can be the summation of target cost function and polyphone cost function.
According to some embodiment of the present invention, the target cost function can be the weighted sum of frequency spectrum cost function, tone cost function and cost of energy function.For example, the frequency spectrum cost function can be the measuring of the difference degree aspect spectrum signature between the parameters,acoustic (being also referred to as target differential section) in candidate's differential section and the parameters,acoustic sequence 230.Similarly, tone cost function and cost of energy function can be measured the difference degree aspect tone and energy feature between parameters,acoustic and the candidate's differential section respectively.For instance, the target cost function can be defined as follows:
C T(u I, k)=K S TC S T(u I, k)+K P TC P T(u I, k)+K E TC E T(u I, k) (equation 1)
Wherein, u I, kBe k candidate's differential section of i parameters,acoustic in the parameters,acoustic sequence 230, C T(u I, k) be the target cost function, C T S(u I, k) be the frequency spectrum cost function, C T P(u I, k) be the tone cost function, C T E(u I, k) be the cost of energy function, and K T S, K T PAnd K T EIt is weighted value.
The polyphone cost function can be the weighted sum of spectral difference function, difference of pitch function and energy difference function.The spectral difference function can be measured the difference degree aspect spectrum signature between two adjacent differential sections.Similarly, difference of pitch function and energy difference function can be measured the difference degree aspect tone and energy feature between two adjacent differential sections respectively.For instance, the polyphone cost function can be defined as follows:
C C(u I-1, j, u I, k)=(equation 2)
K S CC S C(u i-1,j,u i,k)+K P CC P C(u i-1,j,u i,k)+K E CC E C(u i-1,j,u i,k)
Wherein, u I-1, jBe j candidate's differential section of i-1 parameters,acoustic in the parameters,acoustic sequence 230, u I, kBe k candidate's differential section of i parameters,acoustic in the parameters,acoustic sequence 230, C C(u I-1, k, u I, k) be the polyphone cost, C C S(u I-1, k, u I, j) be u I-1, jWith u I, kBetween the spectral difference function, C C P(u I-1, k, u I, j) be u I-1, jWith u I, kBetween the difference of pitch function, C C E(u I-1, k, u I, j) be u I-1, jWith u I, kBetween the energy difference function, and K C S, K C PAnd K C EIt is weighted value.
Then,, the differential section in the preferred differential section sequence is contacted, so that produce synthetic speech in step 325.
With reference to figure 4, this general flow figure has described according to the processing input string of certain embodiments of the invention so that the substep of the step 305 in the method 300 of parameters,acoustic sequence is provided.In step 405, input string is handled, so that an aligned phoneme sequence is provided.For instance, input string 205 can be a text message or the email message that receives on mobile phone 100, and aligned phoneme sequence can be a string of representing the text message pronunciation with the phonemic alphabet sheet form.
In step 410, in aligned phoneme sequence, determine the syllable border, so that a syllable sequence is provided.For example, English word may comprise several syllables, determines this syllable border in this word then, so that syllable sequence is provided.For example, the aligned phoneme sequence " ihksplehn " that relates to English word " explain " can be divided into the syllable sequence that has comprised such as " ihk " and " splehn " these two syllables.
Then, in step 415, recognin syllable unit in syllable sequence is so that provide consonant joint sequence.Consonant joint unit can be equal to or less than syllable, and can be cv class voice unit (it can comprise consonant and vowel).Thus, consonant joint sequence can comprise cv class voice unit and consonant.For instance, in syllable sequence (" ihk "+" splehn "), can identify two cv class voice units (" ih " and " lehn ").Then, corresponding consonant joint sequence can be (" ih "+" k "+" s "+" p "+" lehn ").
According to some embodiment of the present invention, represent the pronunciation of input text by using cv class voice unit, can reduce the quantity of describing the needed elementary cell of word.For example, the dictionary that has comprised 202,000 words may comprise 24,980 syllables, and 6,707 cv class unit only.
Then, in step 420, the antithetical phrase syllable sequence is handled, so that provide a differential section to describe sequence.For example, the duration by using the duration model to come each element in the estimator syllable sequence, can estimate quantity into the needed differential section of each element synthetic speech.For instance, consider following cv class voice unit (consonant joint): ih.If the estimation duration of cv class voice unit is approximately equal to five differential sections, this consonant joint can as followsly be mapped to five differential sections descriptions so:
ih f?ih f?ih f?ih f?ih f
Ih wherein fBe that the differential section is described.
According to some embodiment of the present invention, the estimation duration of consonant joint can obtain by using a duration model, and wherein this model has comprised the average duration of phoneme and the rhythm attribute of phoneme.For instance, the duration of phoneme p can obtain according to following equation:
L p=k * L Avg(equation 3)
Wherein, L pBe the estimation duration of phoneme p, L AvgBe the average phoneme duration of phoneme p, and k is according to comprising phoneme number in the syllable that contains phoneme p, comprising the rhythm attribute coefficients that factor obtained of the type of number of syllables in the word of this syllable and phoneme p.
Then,, the differential section is described sequence handle, so that the parameters,acoustic sequence is provided in step 425.For example, the differential section is described each differential section in the sequence and is described and can be mapped to the parameters,acoustic that is used to describe the acoustic characteristic that this differential section describes, wherein for instance, this acoustic characteristic can be frequency spectrum (frequency characteristic) and rhythm characteristic (tone, energy or duration).The differential section is described sequence can comprise that a plurality of differential sections describe, and wherein to describe all be about usually less than the description of the voice differential section of phoneme to each differential section.The differential section is described each differential section in the sequence describes, and parameters,acoustic can use acoustic model to estimate.For instance, parameters,acoustic can comprise frequency spectrum parameter s n, pitch parameters p nAnd energy parameter e n
With reference to figure 5, this diagram has been described the pitch model of using according to certain embodiments of the invention, and wherein this model comprises five normalization tone contour model: WO_stress 505, WO_unstress 510, WF_stress 515, WF_unstress 520 and WS 525.WO_stress 505 tone contour model definitions be positioned at the beginning of word or the tone contour of middle stressed syllable with a plurality of syllables.WO_unstress 510 tone contour model definitions be positioned at the beginning of word or the tone contour of middle unstressed syllable with a plurality of syllables.WF_stress 515 tone contour model definitions be positioned at the tone contour of stressed syllable of the end of word with a plurality of syllables.WF_unstress 520 tone contour model definitions be positioned at the tone contour of stressed syllable of the end of word with a plurality of syllables.WF 525 tone contour model definitions have only the tone contour of the syllable in the word of a syllable.
Thus, the advantage of certain embodiments of the invention comprises: the sound quality that has improved synthetic speech.Compare with the voice segment that synthesizes by polyphone phoneme or diphones, can provide improved voice continuity and more rhythms to change by the synthetic voice segment of polyphone differential section.Thus, the overall sound quality of tts system can improve, and is particularly all the more so in the resource-constrained portable equipment such as mobile phone and PDA(Personal Digital Assistant).
It should be understood that, the embodiment of the invention described herein can comprise one or more conventional processors and unique programmed instruction of being stored, wherein said programmed instruction control this one or more processors, so as in conjunction with some non-processor circuit carry out described herein some, the function of most of or all these synthetic speechs from input string.Non-processor circuit can be including, but not limited to radio receiver, radio transmitter, signal driver, clock circuit, power circuit and user input device.Similarly, these functions can be interpreted into the step that is used for from the method for input string synthetic speech.As an alternative, some or all functions can be realized by the state machine of the instruction that do not have program stored therein, or in one or more special ICs (ASIC), realize, wherein in described special IC, some combination of each function or some function may be implemented as customized logic.Certainly, also can use the combination of these two kinds of methods.Thus, the method and apparatus that is used for these functions has been described here.In addition, what it is also contemplated that is, although might need to pay considerable effort, and might be subjected to the promotion of for example pot life, current techniques and economic consideration factor and need make a lot of design alternatives, but for those of ordinary skills, under the guiding that is subjected to notion disclosed herein and principle, they are easy to just can produce these software instructions, program and IC with minimum test.
In description above, specific embodiments of the invention are disclosed.But, one of ordinary skill in the art will realize that under the situation that does not deviate from the scope of the invention that claims set forth various modifications and to change all be feasible.Therefore, this instructions and accompanying drawing should be counted as illustrative rather than restrictive, and all this modifications all should be included in the scope of the present invention.Benefit given here, advantage, issue-resolution and may to produce any benefit, advantage, solution or make it that more tangible any one or a plurality of key element should not be interpreted into be important, the necessary or basic feature or the key element of any one or all authority requirement.The present invention is only limited by claims, and these claims have comprised any modification in the application's checking process and all equivalents of these claims.

Claims (16)

1. one kind is used for from the method for input string synthetic speech, and this method comprises:
Handle input string, so that the parameters,acoustic sequence is provided;
Use this parameters,acoustic sequence from sound bank, to produce candidate's differential section sequence of sets;
From candidate's differential section sequence of sets, determine a preferred differential section sequence for the parameters,acoustic sequence; And
Differential section in this preferred differential section sequence of contacting is so that produce synthetic speech.
2. according to the process of claim 1 wherein, handle input string so that provide the step of parameters,acoustic sequence to comprise:
Handle input string, so that aligned phoneme sequence is provided;
Determine the syllable border in this aligned phoneme sequence, so that syllable sequence is provided;
Consonant joint unit in the identification syllable sequence is so that provide consonant joint sequence;
From consonant joint sequence, produce the differential section and describe sequence; And
Handle the differential section and describe sequence, so that the parameters,acoustic sequence is provided.
3. according to the method for claim 2, wherein, the differential section is described sequence and is to use the duration model to produce from consonant joint sequence, wherein should comprise the average duration of phoneme and the rhythm attribute of phoneme by the duration model.
4. according to the method for claim 2, wherein, consonant joint sequence comprises one or more in cv class voice unit or the phoneme.
5. according to the process of claim 1 wherein, the parameters,acoustic in the parameters,acoustic sequence comprises frequency spectrum parameter, pitch parameters and energy parameter.
6. according to the process of claim 1 wherein, the set of candidate's differential section is to use target cost function and duration model to select from sound bank.
7. according to the method for claim 6, wherein, the target cost function is the weighted sum of frequency spectrum cost, tone cost and cost of energy.
8. according to the process of claim 1 wherein, preferred differential section sequence is to use viterbi algorithm definite for the parameters,acoustic sequence from the set of candidate's differential section.
9. method according to Claim 8, wherein, viterbi algorithm comprises the path cost function, this path cost function is the summation of target cost function and polyphone cost function.
10. according to the method for claim 9, wherein, the target cost function is the weighted sum of frequency spectrum cost function, tone cost function and cost of energy function.
11. according to the method for claim 9, wherein, the polyphone cost function is the weighted sum of spectral difference function, difference of pitch function and energy difference function.
12. according to the method for claim 10, wherein, the target cost function is defined as follows:
C T(u i,k)=K S TC S T(u i,j)+K P TC P T(u i,j)+K E TC E Tu i,j)
Wherein, u I, kBe k candidate's differential section of i parameters,acoustic in the parameters,acoustic sequence, C T(u I, k) be the target cost function, C T S(u I, j) be the frequency spectrum cost function, C T P(u I, j) be the tone cost function, C T E(u I, j) be the cost of energy function, and K T S, K T PAnd K T EIt is weighted value.
13. according to the method for claim 11, wherein, the polyphone cost function defines according to following equation:
C C(u i-1,j,u i,k)=
K S CC S C(u i-1,j,u i,k)+K P CC P C(u i-1,j,u i,k)+K E CC E C(u i-1,j,u i,k)
Wherein, u I-1, jBe j candidate's differential section of i-1 parameters,acoustic in the parameters,acoustic sequence, u I, kBe k candidate's differential section of i parameters,acoustic in the parameters,acoustic sequence, C C(u I-1, k, u I, k) be to be used to the u that contacts I-1, jWith u I, kThe polyphone cost, C C S(u I-1, k, u I, j) be u I-1, kWith u I, kBetween the spectral difference function, C C P(u I-1, k, u I, j) be u I-1, kWith u I, kBetween the difference of pitch function, C C E(u I-1, k, u I, j) be u I-1, kWith u I, kBetween the energy difference function, and K C S, K C PAnd K C EIt is weighted value.
14. according to the method for claim 5, wherein, pitch parameters is one of following pitch model: WO_stress, WO_unstress, WF_stress, WF_stress or WS_stress.
15. according to the method for claim 5, wherein, energy parameter comprises speech part and unvoiced segments.
16. according to the method for claim 3, wherein, the duration model is defined by following equation:
L p=k×L avg
Wherein, L pBe the estimation duration of phoneme p, L AvgBe the average phoneme duration of phoneme p, and k is according to comprising phoneme number in the syllable that contains phoneme p, comprising the rhythm attribute coefficients that a plurality of factor obtained of the type of number of syllables in the word of this phoneme p and phoneme p.
CN2007101045813A 2007-05-25 2007-05-25 Method for synthesizing voice Expired - Fee Related CN101312038B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN2007101045813A CN101312038B (en) 2007-05-25 2007-05-25 Method for synthesizing voice
PCT/US2008/062822 WO2008147649A1 (en) 2007-05-25 2008-05-07 Method for synthesizing speech

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2007101045813A CN101312038B (en) 2007-05-25 2007-05-25 Method for synthesizing voice

Publications (2)

Publication Number Publication Date
CN101312038A true CN101312038A (en) 2008-11-26
CN101312038B CN101312038B (en) 2012-01-04

Family

ID=39564770

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2007101045813A Expired - Fee Related CN101312038B (en) 2007-05-25 2007-05-25 Method for synthesizing voice

Country Status (2)

Country Link
CN (1) CN101312038B (en)
WO (1) WO2008147649A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101510424B (en) * 2009-03-12 2012-07-04 孟智平 Method and system for encoding and synthesizing speech based on speech primitive
CN102779508A (en) * 2012-03-31 2012-11-14 安徽科大讯飞信息科技股份有限公司 Speech corpus generating device and method, speech synthesizing system and method
CN104115222A (en) * 2012-02-16 2014-10-22 大陆汽车有限责任公司 Method and device for phonetising data sets containing text
WO2018209556A1 (en) * 2017-05-16 2018-11-22 Beijing Didi Infinity Technology And Development Co., Ltd. System and method for speech synthesis
CN113192522A (en) * 2021-04-22 2021-07-30 北京达佳互联信息技术有限公司 Audio synthesis model generation method and device and audio synthesis method and device
CN113314096A (en) * 2020-02-25 2021-08-27 阿里巴巴集团控股有限公司 Speech synthesis method, apparatus, device and storage medium

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
RU2421827C2 (en) 2009-08-07 2011-06-20 Общество с ограниченной ответственностью "Центр речевых технологий" Speech synthesis method
CN113409759B (en) * 2021-07-07 2023-04-07 浙江工业大学 End-to-end real-time speech synthesis method

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE19610019C2 (en) * 1996-03-14 1999-10-28 Data Software Gmbh G Digital speech synthesis process
GB2313530B (en) * 1996-05-15 1998-03-25 Atr Interpreting Telecommunica Speech synthesizer apparatus
US6173263B1 (en) * 1998-08-31 2001-01-09 At&T Corp. Method and system for performing concatenative speech synthesis using half-phonemes
US7082396B1 (en) * 1999-04-30 2006-07-25 At&T Corp Methods and apparatus for rapid acoustic unit selection from a large speech corpus
JP4080989B2 (en) * 2003-11-28 2008-04-23 株式会社東芝 Speech synthesis method, speech synthesizer, and speech synthesis program

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101510424B (en) * 2009-03-12 2012-07-04 孟智平 Method and system for encoding and synthesizing speech based on speech primitive
CN104115222A (en) * 2012-02-16 2014-10-22 大陆汽车有限责任公司 Method and device for phonetising data sets containing text
US9436675B2 (en) 2012-02-16 2016-09-06 Continental Automotive Gmbh Method and device for phonetizing data sets containing text
CN102779508A (en) * 2012-03-31 2012-11-14 安徽科大讯飞信息科技股份有限公司 Speech corpus generating device and method, speech synthesizing system and method
WO2018209556A1 (en) * 2017-05-16 2018-11-22 Beijing Didi Infinity Technology And Development Co., Ltd. System and method for speech synthesis
CN109313891A (en) * 2017-05-16 2019-02-05 北京嘀嘀无限科技发展有限公司 System and method for speech synthesis
CN109313891B (en) * 2017-05-16 2023-02-21 北京嘀嘀无限科技发展有限公司 System and method for speech synthesis
CN113314096A (en) * 2020-02-25 2021-08-27 阿里巴巴集团控股有限公司 Speech synthesis method, apparatus, device and storage medium
CN113192522A (en) * 2021-04-22 2021-07-30 北京达佳互联信息技术有限公司 Audio synthesis model generation method and device and audio synthesis method and device
CN113192522B (en) * 2021-04-22 2023-02-21 北京达佳互联信息技术有限公司 Audio synthesis model generation method and device and audio synthesis method and device

Also Published As

Publication number Publication date
WO2008147649A8 (en) 2010-03-04
CN101312038B (en) 2012-01-04
WO2008147649A1 (en) 2008-12-04

Similar Documents

Publication Publication Date Title
CN101312038B (en) Method for synthesizing voice
EP1168299B1 (en) Method and system for preselection of suitable units for concatenative speech
EP1139332A9 (en) Spelling speech recognition apparatus
Turk et al. Robust processing techniques for voice conversion
Qian et al. A cross-language state sharing and mapping approach to bilingual (Mandarin–English) TTS
EP2462586B1 (en) A method of speech synthesis
Qian et al. Improved prosody generation by maximizing joint probability of state and longer units
JP4829477B2 (en) Voice quality conversion device, voice quality conversion method, and voice quality conversion program
Chou et al. A set of corpus-based text-to-speech synthesis technologies for Mandarin Chinese
Balyan et al. Speech synthesis: a review
CN112309367A (en) Speech synthesis method, speech synthesis device, storage medium and electronic equipment
Yan et al. Analysis and synthesis of formant spaces of British, Australian, and American accents
Soong A phonetically labeled acoustic segment (PLAS) approach to speech analysis-synthesis
KR20010092645A (en) Client-server speech information transfer system and method
Toledano et al. Initialization, training, and context-dependency in HMM-based formant tracking
Phan et al. Improvement of naturalness for an HMM-based Vietnamese speech synthesis using the prosodic information
Mullah A comparative study of different text-to-speech synthesis techniques
Chouireb et al. Towards a high quality Arabic speech synthesis system based on neural networks and residual excited vocal tract model
Kim et al. Unit Generation Based on Phrase Break Strength and Pruning for Corpus‐Based Text‐to‐Speech
Anumanchipalli et al. KLATTSTAT: knowledge-based parametric speech synthesis.
Donovan Topics in decision tree based speech synthesis
Nthite et al. End-to-End Text-To-Speech synthesis for under resourced South African languages
Karabetsos et al. HMM-based speech synthesis for the Greek language
EP1589524A1 (en) Method and device for speech synthesis
Dong et al. A Unit Selection-based Speech Synthesis Approach for Mandarin Chinese.

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
ASS Succession or assignment of patent right

Owner name: NUANCE COMMUNICATIONS CO., LTD.

Free format text: FORMER OWNER: MOTOROLA INC.

Effective date: 20100909

C41 Transfer of patent application or patent right or utility model
COR Change of bibliographic data

Free format text: CORRECT: ADDRESS; FROM: ILLINOIS, UNITED STATES TO: MASSACHUSETTS, UNITED STATES

TA01 Transfer of patent application right

Effective date of registration: 20100909

Address after: Massachusetts, USA

Applicant after: Nuance Communications Inc

Address before: Illinois Instrunment

Applicant before: Motorola Inc.

C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20120104

Termination date: 20210525