WO2008147649A1

WO2008147649A1 - Method for synthesizing speech

Info

Publication number: WO2008147649A1
Application number: PCT/US2008/062822
Authority: WO
Inventors: Yi-Qing Zu; Zhen-Hai Cao
Original assignee: Motorola, Inc.
Priority date: 2007-05-25
Filing date: 2008-05-07
Publication date: 2008-12-04
Also published as: WO2008147649A8; CN101312038B; CN101312038A

Abstract

A method for synthesizing speech from an input string enables improved sound quality of Text-To-Speech synthesis. The method includes processing the input string to provide a sequence of acoustic parameters (step 305). For each acoustic parameter in the sequence of acoustic parameters a set of candidate micro-segments is generated from a speech library (step 310). A preferred micro-segment sequence is then determined from the sets of candidate micro-segments for the sequence of acoustic parameters (step 320). Micro-segments in the preferred micro-segment sequence are then concatenated to produce synthesized speech (step 325).

Description

METHOD FOR SYNTHESIZING SPEECH

Field of the Invention

The present invention relates generally to Text-To-Speech (TTS) synthesis and in particular to synthesizing speech from a text string using micro-segments.

Background

Text-To-Speech (TTS) conversion, often referred to as concatenative text to speech synthesis, enables electronic devices to receive an input text string and provide an audio signal representation of the string in the form of synthesized speech. For concatenative speech synthesis, basic speech units such as phones or diphones are concatenated. However, a device that synthesizes speech using phone-based speech units from a non-deterministic number of received text strings can have difficulty providing high quality, realistic synthesized speech. That is because the sound of a phone, syllable, or word is often context dependent.

Due to limited memory and processing power in many devices, not all desired prosodic variations of phones, syllables, or words can be included in a speech library such as an utterance waveform corpus. For example, although a phone-based concatenation, such as diphone-to-diphone, might be acceptable for inter-syllable concatenation, the phone-based concatenation of phones within a syllable may produce unnatural sounds. That is because concatenation points between voiced-voiced segments often cause unnatural sounding transitions.

A typical diphone speech library for the English language may have around 1200 diphones, but to reduce the concatenations within voiced-to-voiced phone boundaries, a speech library would require n-phone clusters. Thus a speech library of all pronunciations of all characters can be prohibitively large. In most TTS systems there is therefore a need to estimate the appropriate pronunciation of The size of such a speech library can be particularly limited when the speech library is embedded in a handheld electronic device having limited memory capacity.

Brief Description of the Figures

In order that the invention may be readily understood and put into practical effect, reference will now be made to exemplary embodiments as illustrated with reference to the accompanying figures, wherein like reference numbers refer to identical or functionally similar elements throughout the separate views. The figures together with a detailed description below, are incorporated in and form part of the specification, and serve to further illustrate the embodiments and explain various principles and advantages, in accordance with the present invention, where:

FIG. 1 is a schematic diagram illustrating an electronic device in the form of a mobile telephone, according to some embodiments of the present invention;

FIG. 2 is a flow diagram illustrating a method for synthesizing speech from an input text string, according to some embodiments of the present invention;

FIG. 3 is a general flow diagram illustrating a method for synthesizing speech from an input string, according to some embodiments of the present invention;

FIG. 4 is a general flow diagram illustrating a method for processing an input string to provide a sequence of acoustic parameters, according to some embodiments of the present invention; and FIG. 5 is a diagram illustrating a pitch model comprising five normalized pitch contour models, according to some embodiments of the present invention.

Skilled artisans will appreciate that elements in the figures are illustrated example, the dimensions of some of the elements in the figures may be exaggerated relative to other elements to help to improve understanding of embodiments of the present invention.

Detailed Description

Before describing in detail embodiments that are in accordance with the present invention, it should be observed that the embodiments reside primarily in combinations of method steps and apparatus components related to synthesizing speech from an input string. Accordingly, the apparatus components and method steps have been represented where appropriate by conventional symbols in the drawings, showing only those specific details that are pertinent to understanding the embodiments of the present invention so as not to obscure the disclosure with details that will be readily apparent to those of ordinary skill in the art having the benefit of the description herein. In this document, relational terms such as first and second, top and bottom, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non- exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. An element preceded by "comprises a ..." does not, without more constraints, preclude the existence of additional identical elements in the process, method, article, or apparatus that comprises the element.

Referring to FIG. 1, a schematic diagram illustrates an electronic device in the form of a mobile telephone 100, according to some embodiments of the present invention. The mobile telephone 100 comprises a radio frequency and address bus 117 of a processor 103. The telephone 100 also has a keypad 106 and a display screen 105, such as a touch screen coupled to be in communication with the processor 103.

The processor 103 also includes an encoder/decoder 111 with an associated code Read Only Memory (ROM) 112 for storing data for encoding and decoding voice or other signals that may be transmitted or received by the mobile telephone 100. The processor 103 further includes a microprocessor 113 coupled, by the common data and address bus 117, to the encoder/decoder 111 , a character Read Only Memory (ROM) 114, a Random Access Memory (RAM) 104, programmable memory 116 and a Subscriber Identity Module (SIM) interface 118. The programmable memory 116 and a SIM operatively coupled to the SIM interface 118 each can store, among other things, a telephone number database (TND) comprising a number field for telephone numbers and a name field for identifiers uniquely associated with the telephone numbers in the number field. The radio frequency communications unit 102 is a combined receiver and transmitter having a common antenna 107. The communications unit 102 has a transceiver 108 coupled to the antenna 107 via a radio frequency amplifier 109. The transceiver 108 is also coupled to a combined modulator/demodulator 110 that is coupled to the encoder/decoder 111. The microprocessor 113 has ports for coupling to the keypad 106 and to the display screen 105. The microprocessor 113 further has ports for coupling to an alert module 115 that typically contains an alert speaker, vibrator motor and associated drivers; to a microphone 120; and to a communications speaker 122. The character ROM 114 stores code for decoding or encoding data such as control channel messages that may be transmitted or received by the communications unit 102. In some embodiments of the present invention, the character ROM 114, the programmable memory 116, or a SIM also can store operating code (OC) for the microprocessor 113 and code for performing functions associated with the mobile t l h 100 F l th bl 116 i speech synthesis services program code components 125 configured to cause execution of a method for synthesizing speech from an input string.

Thus some embodiments of the present invention include a method of using the mobile telephone 100 for synthesizing speech from an input string. For example, an input string can be a text message or an email containing a text string received at the mobile telephone 100. The method includes processing the input string to provide a sequence of acoustic parameters. A sequence of sets of candidate micro-segments is then generated from a speech library using the sequence of acoustic parameters. A preferred micro-segment sequence is then determined from the sequence of sets of candidate micro-segments for the sequence of acoustic parameters. Finally, micro-segments in the preferred micro- segment sequence are then concatenated to produce synthesized speech.

Some embodiments of the present invention therefore enable speech synthesis using micro-segments and a sequence of acoustic parameters representing a target acoustic model, rather than using phones or diphones. A micro-segment can be any length of a speech segment, but is generally shorter than a phone or diphone. For example, a micro-segment can be a 20ms speech frame, whereas a speech segment of a phone usually comprises several such speech frames. Since speech segments synthesized by concatenating micro- segments can provide more frequency and prosodic variations than speech segments synthesized by concatenating phones or diphones, overall sound quality of Text To Speech (TTS) systems can be improved.

Referring to FIG. 2, a flow diagram illustrates a method 200 for synthesizing speech from an input string 205, according to some embodiments of the present invention. First, the input string 205 is processed to provide a sequence 230 of acoustic parameters. A sequence 240 of sets 235 of candidate micro-segments is then generated from a speech library using the sequence 230 of acoustic parameters. A preferred micro-segment sequence 245 is then determined f th 240 f th t 235 f did t i t f th sequence 230 of acoustic parameters. Finally, micro-segments in the preferred micro-segment sequence 245 are then concatenated to produce a synthesized speech signal 250. For example, speech frames 255 corresponding to the micro- segment descriptions in the preferred micro-segment sequence 245 can be loaded into the RAM 104 of the mobile telephone 100, and then concatenated and played over the communications speaker 122 to produce the synthesized speech signal 250.

Referring to FIG. 3, a flow diagram further illustrates a general method 300 for synthesizing speech from an input string, according to some embodiments of the present invention. At step 305, the input string is processed to provide a sequence of acoustic parameters. For example, an acoustic parameter of the sequence 230 of acoustic parameters can comprise a spectrum parameter, a pitch parameter, and an energy parameter.

According to some embodiments of the present invention, the acoustic parameters, also referred to as target speech units, are generated from the input string using prosodic positions. For example, a prosodic position can comprise a position of a syllable in a word and a position of the word in a sentence.

A spectrum parameter can be modeled using a known spectral feature representation method, such as a Linear Predictive Coding (LPC) method, a Line Spectral Pairs (LSP) method, or a Mel-Frequency Cepstral Coefficient (MFCC) method. Therefore, using prosodic positions, spectrum parameters of phones can be determined. For example, a positional spectrum model, such as a Gaussian Mixture Model (GMM), can be used to map acoustic features of phones, such as prosodic position, to spectrum parameters. The pitch parameter can be determined using a pitch model that defines pitch contours of syllables based on prosodic positions of the syllables. The pitch model can comprise pitch contour models, such as WO_stress, WO_unstress, WF_stress, WF_unstress, or WS.

For the energy parameter, different strategies can be used for a voiced part be defined for syllables. Different energy contour patterns can be defined using a position of a cv-like unit in the syllable and/or a condition concerning whether the syllable is a stress syllable. For an un- voiced part, energy contour patterns can be defined for phonemes. Each (un-voiced) phoneme can have one or more energy contour patterns. An energy contour of an un-voiced phoneme can depend on the position of the phoneme in a syllable and the position of the syllable in a word. To reduce an amount of memory required, some (un-voiced) phonemes can share a same energy contour pattern if the phonemes have similar positions and similar articulation manner. For example, phonemes "s", "sh", and "ch" can share a same energy contour pattern, and similarly, "g", "d", and "k" can share another same energy contour pattern.

At step 310, a sequence of sets of candidate micro-segments is generated from a speech library 315 using the sequence of acoustic parameters. According to some embodiments of the present invention, the sets of the candidate micro- segments can be generated using a target cost function and a duration model. For example, the target cost function can be a weighted sum of a spectrum cost, a pitch cost, and an energy cost. Lower target cost can mean that acoustic characteristics of a candidate micro-segment closely match an acoustic parameter. For example, for each acoustic parameter in the sequence 230 of acoustic parameters, the mobile telephone 100 can search the speech library 315 to find a set of candidate micro-segments (e.g., speech frames) having acoustic characteristics that closely match the acoustic parameter and an estimated duration of the acoustic parameter. Such closely matching speech frames then can be selected to generate the sequence 240 of the sets 235 of candidate micro-segments. In order to reduce processing time, speech frames in the speech library 315 can be classified into several sets of speech frames using prosodic positions of the speech frames, and the candidate micro-segments can be searched in one of the sets of speech frames that closely matches a prosodic position of the acoustic parameter. At step 320, a preferred micro-segment sequence is determined from the sets of candidate micro-segments for the sequence of acoustic parameters. For example, a Viterbi algorithm can be used to determine the preferred micro- segment sequence 245, and a path cost function of the Viterbi algorithm can be a sum of a target cost function and a concatenation cost function.

According to some embodiments of the present invention, the target cost function can be a weighted sum of a spectrum cost function, a pitch cost function, and an energy cost function. For example, the spectrum cost function can be a measure of a degree of difference in spectral features between a candidate micro- segment and an acoustic parameter, also referred to as a target micro-segment, in the sequence 230 of acoustic parameters. Similarly, the pitch cost function and the energy cost function can measure degrees of difference in pitch and energy features between an acoustic parameter and a candidate micro-segment, respectively. For example, the target cost function can be defined as follows:

C^τ(u_lΛ) = K_s ^τC_s ^τ (u_lΛ ) + K_p ^τC/(u_lΛ) + K_E ^τC_E ^τ (u_lΛ) (Eq. 1) where u_l:k is a £-th candidate micro-segment of an z-th acoustic parameter in the sequence 230 of acoustic parameters, C^τ(u_1:k) is the target cost function, C^Ts (u_1:k) is a spectrum cost function, C^Tp (u,,_k) is a pitch cost function, C^T _E (wα) is an energy cost function, and K^Ts, K^Tp, and K^T _E are weight values. The concatenation cost function can be a weighted sum of a spectrum difference function, a pitch difference function, and an energy difference function. The spectrum difference function can measure a degree of difference in spectral features between two adjacent micro-segments. Similarly, the pitch difference function and the energy difference function can measure degrees of difference in pitch and energy features between two adjacent micro-segments, respectively. For example, the concatenation cost function can be defined as follows: C^c(μ,__lιJ ,u_hk) = Ks^CC_s ^C (u₁__hj ,u_{l k} ) + K_p ^cC_p ^c (u₁__hj ,u_{l k} ) + K_E ^cC_E ^c (u₁__hj ,u_{l k} ) the sequence 230 of acoustic parameters, Uφ is a £-th candidate micro-segment of an /-th acoustic parameter in the sequence 230 of acoustic parameters, Cf(U₁-_Ij , u_l:k) is the concatenation cost, (f_s (U₁-_Ij , Uφ) is a spectrum difference function between U₁-_Ij and Uφ, Cf _P(U₁-_Ij , Uφ) is a pitch difference function between U₁-_Ij and Uφ, C?d}*i-i_j , Uφ) is an energy difference function between U₁-I_j and Uφ, and Kp_s, KF_P, and Kp_E are weight values.

At step 325, micro-segments in the preferred micro-segment sequence are then concatenated to produce synthesized speech.

Referring to FIG. 4, a general flow diagram illustrates sub-steps of the step 305 of the method 300 of processing the input string to provide the sequence of acoustic parameters, according to some embodiments of the present invention. At step 405, the input string is processed to provide a phoneme sequence. For example, the input string 205 can be a text message or an email message received at the mobile telephone 100, and the phoneme sequence can be a string representing pronunciation of the text message in a phonetic alphabet.

At step 410, syllable boundaries are then determined in the phoneme sequence to provide a syllable sequence. For example, an English word can comprise several syllables, and boundaries of such syllables in the word then can be determined to provide the syllable sequence. For example, a phoneme sequence "ihksplehn" concerning the English word "explain" can be divided into a syllable sequence comprising two syllables, such as "ihk" and "splehn".

At step 415, sub-syllable units are then identified in the syllable sequence to provide a sub-syllable sequence. The sub-syllable units can be equal or smaller than syllables, and can be cv-like speech units, which can comprise a consonant and a vowel. Thus, the sub-syllable sequence can comprise cv-like speech units and consonants. For example, two cv-like speech units ("ih" and "lehn") can be identified in a syllable sequence ("ihk" + "splehn"). A corresponding sub-syllable sequence then can be ("ih" + "k" + "s" + "p" + "lehn"). According to some embodiments of the present invention, using cv-like speech units to represent pronunciation of an input text can reduce a number of basic units needed to describe words. For example, a lexicon comprising 202,000 words may comprise 24,980 syllables, and only 6,707 cv-like units. At step 420, the sub-syllable sequence is then processed to provide a micro-segment description sequence. For example, by estimating a duration of each element in the sub-syllable sequence using a duration model, a number of micro-segments needed to synthesize speech for each element can be estimated. For example, consider the following cv-like speech unit (a sub-syllable): ih. If an estimated duration of the cv-like speech unit is approximately equal to five micro-segments, the sub-syllable can be mapped to five micro-segment descriptions as follows: ih_f ih_f ih_f ih_f ih_f; where ih_f is a micro-segment description.

According to some embodiments of the present invention, an estimated duration of a sub-syllable can be obtained by applying a duration model comprising average durations of phones and prosodic attributes of phones. For example, an estimated duration of a phoney can be obtained according to the following equation:

L_p = k x L_avg (Eq. 3) where L_p is the estimated duration of the phone p, L_avg is an average phone duration of the phoney, and Hs a prosodic attribute coefficient obtained from factors comprising a number of phones in a syllable containing the phoney, a number of syllables in a word containing the syllable, and a type of the phoney.

At step 425, the micro-segment description sequence is then processed to provide the sequence of acoustic parameters. For example, each micro-segment description in the micro-segment description sequence can be mapped to an acoustic parameter describing the micro-segment description's acoustic characteristics such as spectrum (frequency characteristics) and prosodic sequence can comprise micro-segment descriptions, each of which is a description of a speech micro-segment that is usually smaller than a phone. For each micro- segment description in the micro-segment description sequence, an acoustic parameter can be estimated using an acoustic model. For example, the acoustic parameter can comprise a spectrum parameter S_n, a pitch parameter p_n, and an energy parameter G_n.

Referring to FIG. 5, a diagram illustrates a pitch model comprising five normalized pitch contour models: WO_stress 505, WO_unstress 510, WF_stress 515, WF_unstress 520, and WS 525, as used according to some embodiments of the present invention. The WO_stress 505 pitch contour model defines a pitch contour for stressed syllables positioned at a beginning or middle of words having multiple syllables. The WO unstress 510 pitch contour model defines a pitch contour for unstressed syllables positioned at a beginning or middle of words having multiple syllables. The WF_stress 515 pitch contour model defines a pitch contour for stressed syllables positioned at an end of words having multiple syllables. The WF_unstress 520 pitch contour model defines a pitch contour for unstressed syllables positioned at an end of words having multiple syllables. The WS 525 pitch contour model defines a pitch contour for syllables in words having only one syllable. Advantages of some embodiments of the present invention therefore include improved sound quality of synthesized speech. Speech segments synthesized by concatenating micro-segments can provide improved speech continuity and more prosodic variations than speech segments synthesized by concatenating phones or diphones. Overall sound quality of TTS systems therefore can be improved, particularly in resource constrained handheld devices such as mobile telephones and personal digital assistants (PDAs).

It will be appreciated that embodiments of the invention described herein may be comprised of one or more conventional processors and unique stored i t ti th t t l th t i l t i conjunction with certain non-processor circuits, some, most, or all of the functions of synthesizing speech from an input string as described herein. The non- processor circuits may include, but are not limited to, a radio receiver, a radio transmitter, signal drivers, clock circuits, power source circuits, and user input devices. As such, these functions may be interpreted as steps of a method for synthesizing speech from an input string. Alternatively, some or all functions could be implemented by a state machine that has no stored program instructions, or in one or more application specific integrated circuits (ASICs), in which each function or some combinations of certain of the functions are implemented as custom logic. Of course, a combination of the two approaches could be used.

Thus, methods and means for these functions have been described herein. Further, it is expected that one of ordinary skill, notwithstanding possibly significant effort and many design choices motivated by, for example, available time, current technology, and economic considerations, when guided by the concepts and principles disclosed herein will be readily capable of generating such software instructions and programs and ICs with minimal experimentation.

In the foregoing specification, specific embodiments of the present invention have been described. However, one of ordinary skill in the art appreciates that various modifications and changes can be made without departing from the scope of the present invention as set forth in the claims below.

Accordingly, the specification and figures are to be regarded in an illustrative rather than a restrictive sense, and all such modifications are intended to be included within the scope of present invention. The benefits, advantages, solutions to problems, and any element(s) that may cause any benefit, advantage, or solution to occur or become more pronounced are not to be construed as critical, required, or essential features or elements of any or all the claims. The invention is defined solely by the appended claims including any amendments made during the pendency of this application and all equivalents of those claims.

Claims

We claim:

1. A method for synthesizing speech from an input string, the method comprising: processing the input string to provide a sequence of acoustic parameters; generating from a speech library a sequence of sets of candidate micro- segments using the sequence of acoustic parameters; determining a preferred micro-segment sequence from the sequence of sets of candidate micro-segments for the sequence of acoustic parameters; and concatenating micro-segments in the preferred micro-segment sequence to produce synthesized speech.

2. The method of claim 1, wherein processing the input string to provide the sequence of acoustic parameters comprises: processing the input string to provide a phoneme sequence; determine syllable boundaries in the phoneme sequence to provide a syllable sequence; identifying sub-syllable units in the syllable sequence to provide a sub- syllable sequence; generating a micro-segment description sequence from the sub-syllable sequence; and processing the micro-segment description sequence to provide the sequence of acoustic parameters.

3. The method of claim 2, wherein the micro-segment description sequence is generated from the sub-syllable sequence using a duration model comprising

4. The method of claim 2, wherein the sub-syllable sequence comprises one or more of a cv-like speech unit or a phone.

5. The method of claim 1, wherein an acoustic parameter of the sequence of acoustic parameters comprises a spectrum parameter, a pitch parameter, and an energy parameter.

6. The method of claim 1, wherein sets of candidate micro-segments are selected from the speech library using a target cost function and a duration model.

7. The method of claim 6, wherein the target cost function is a weighted sum of a spectrum cost, a pitch cost, and an energy cost.

8. The method of claim 1, wherein the preferred micro-segment sequence is determined from the sets of candidate micro-segments for the sequence of acoustic parameters using a Viterbi algorithm.

9. The method of claim 8, wherein the Viterbi algorithm comprises a path cost function that is a sum of a target cost function and a concatenation cost function.

10. The method of claim 9, wherein the target cost function is a weighted sum of a spectrum cost function, a pitch cost function, and an energy cost function.

11. The method of claim 9, wherein the concatenation cost function is a weighted sum of a spectrum difference function, a pitch difference function, and an energy difference function.

12. The method of claim 10, wherein the target cost function is defined as follows:

C^T(u_lΛ) = K_s ^τC_s ^τ(u_lΛ) + K_p ^τC/(u_lΛ) + K_E ^τC_E ^τ(u_lΛ)

where Uφ is a £-th candidate micro-segment of an z-th acoustic parameter in the sequence of acoustic parameters, C^τ(u_lrk) is the target cost function, C^Ts (u_hk) is a spectrum cost function, C^Tp (uφ) is a pitch cost function, C^T _E (uφ) is an energy cost function, and K^τ _s, K^T _P, and K^T _E are weight values.

13. The method of claim 11, wherein the concatenation cost function is defined according to the following equation:

K_s ^CC_s ^c(u₁__hj ,u_lΛ) + K_p ^cC_P ^c (u₁__hj ,u_lΛ ) + K_E ^cC_E ^c(u₁__hj ,u_lΛ )

where U₁-_Ij is ay-th candidate micro-segment of an (i-l)-th acoustic parameter in the sequence of acoustic parameters, u_hk is a k-th candidate micro-segment of an i- th acoustic parameter in the sequence of acoustic parameters, Cf(U₁-_Ij , Uφ) is the concatenation cost for concatenating U₁-_Ij and Uφ, Cfs (U₁-_Ij , u_hk) is a spectrum difference function between U₁-_Ij and u_1:k, Cf _P(U₁-_Ij , u_1:k) is a pitch difference function between U₁-_Ij, u_hk, C^c Ε(Ui-i_j , u_hø is an energy difference function between U₁-I_j and u_1:k, and K⁰S, K⁰P, and K°_E are weight values.

14. The method of claim 5, wherein the pitch parameter is one of following pitch models: WO stress, WO unstress, WF stress, WF unstress, or WS.

15. The method of claim 5, wherein the energy parameter comprises a voiced part and an un-voiced part.

16. The method of claim 3, wherein the duration model is defined by the following equation:

where L_p is an estimated duration of a phoney, L_avg is an average phone duration of the phone p, and A: is a prosodic attribute coefficient obtained from factors comprising a number of phones in a syllable containing the phoney, a number of syllables in a word containing the phoney, and a type of the phoney.