WO2008147649A1 - Method for synthesizing speech - Google Patents

Method for synthesizing speech Download PDF

Info

Publication number
WO2008147649A1
WO2008147649A1 PCT/US2008/062822 US2008062822W WO2008147649A1 WO 2008147649 A1 WO2008147649 A1 WO 2008147649A1 US 2008062822 W US2008062822 W US 2008062822W WO 2008147649 A1 WO2008147649 A1 WO 2008147649A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequence
micro
cost function
speech
segment
Prior art date
Application number
PCT/US2008/062822
Other languages
French (fr)
Other versions
WO2008147649A8 (en
Inventor
Yi-Qing Zu
Zhen-Hai Cao
Original Assignee
Motorola, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Motorola, Inc. filed Critical Motorola, Inc.
Publication of WO2008147649A1 publication Critical patent/WO2008147649A1/en
Publication of WO2008147649A8 publication Critical patent/WO2008147649A8/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/06Elementary speech units used in speech synthesisers; Concatenation rules

Definitions

  • the present invention relates generally to Text-To-Speech (TTS) synthesis and in particular to synthesizing speech from a text string using micro-segments.
  • TTS Text-To-Speech
  • Text-To-Speech (TTS) conversion often referred to as concatenative text to speech synthesis, enables electronic devices to receive an input text string and provide an audio signal representation of the string in the form of synthesized speech.
  • TTS Text-To-Speech
  • basic speech units such as phones or diphones are concatenated.
  • a device that synthesizes speech using phone-based speech units from a non-deterministic number of received text strings can have difficulty providing high quality, realistic synthesized speech. That is because the sound of a phone, syllable, or word is often context dependent.
  • a speech library such as an utterance waveform corpus.
  • a phone-based concatenation such as diphone-to-diphone
  • the phone-based concatenation of phones within a syllable may produce unnatural sounds. That is because concatenation points between voiced-voiced segments often cause unnatural sounding transitions.
  • a typical diphone speech library for the English language may have around 1200 diphones, but to reduce the concatenations within voiced-to-voiced phone boundaries, a speech library would require n-phone clusters. Thus a speech library of all pronunciations of all characters can be prohibitively large. In most TTS systems there is therefore a need to estimate the appropriate pronunciation of The size of such a speech library can be particularly limited when the speech library is embedded in a handheld electronic device having limited memory capacity.
  • FIG. 1 is a schematic diagram illustrating an electronic device in the form of a mobile telephone, according to some embodiments of the present invention
  • FIG. 2 is a flow diagram illustrating a method for synthesizing speech from an input text string, according to some embodiments of the present invention
  • FIG. 3 is a general flow diagram illustrating a method for synthesizing speech from an input string, according to some embodiments of the present invention
  • FIG. 4 is a general flow diagram illustrating a method for processing an input string to provide a sequence of acoustic parameters, according to some embodiments of the present invention
  • FIG. 5 is a diagram illustrating a pitch model comprising five normalized pitch contour models, according to some embodiments of the present invention.
  • FIG. 1 a schematic diagram illustrates an electronic device in the form of a mobile telephone 100, according to some embodiments of the present invention.
  • the mobile telephone 100 comprises a radio frequency and address bus 117 of a processor 103.
  • the telephone 100 also has a keypad 106 and a display screen 105, such as a touch screen coupled to be in communication with the processor 103.
  • the processor 103 also includes an encoder/decoder 111 with an associated code Read Only Memory (ROM) 112 for storing data for encoding and decoding voice or other signals that may be transmitted or received by the mobile telephone 100.
  • the processor 103 further includes a microprocessor 113 coupled, by the common data and address bus 117, to the encoder/decoder 111 , a character Read Only Memory (ROM) 114, a Random Access Memory (RAM) 104, programmable memory 116 and a Subscriber Identity Module (SIM) interface 118.
  • ROM Read Only Memory
  • RAM Random Access Memory
  • SIM Subscriber Identity Module
  • the programmable memory 116 and a SIM operatively coupled to the SIM interface 118 each can store, among other things, a telephone number database (TND) comprising a number field for telephone numbers and a name field for identifiers uniquely associated with the telephone numbers in the number field.
  • TDD telephone number database
  • the radio frequency communications unit 102 is a combined receiver and transmitter having a common antenna 107.
  • the communications unit 102 has a transceiver 108 coupled to the antenna 107 via a radio frequency amplifier 109.
  • the transceiver 108 is also coupled to a combined modulator/demodulator 110 that is coupled to the encoder/decoder 111.
  • the microprocessor 113 has ports for coupling to the keypad 106 and to the display screen 105.
  • the microprocessor 113 further has ports for coupling to an alert module 115 that typically contains an alert speaker, vibrator motor and associated drivers; to a microphone 120; and to a communications speaker 122.
  • the character ROM 114 stores code for decoding or encoding data such as control channel messages that may be transmitted or received by the communications unit 102.
  • the character ROM 114, the programmable memory 116, or a SIM also can store operating code (OC) for the microprocessor 113 and code for performing functions associated with the mobile t l h 100 F l th bl 116 i speech synthesis services program code components 125 configured to cause execution of a method for synthesizing speech from an input string.
  • OC operating code
  • an input string can be a text message or an email containing a text string received at the mobile telephone 100.
  • the method includes processing the input string to provide a sequence of acoustic parameters.
  • a sequence of sets of candidate micro-segments is then generated from a speech library using the sequence of acoustic parameters.
  • a preferred micro-segment sequence is then determined from the sequence of sets of candidate micro-segments for the sequence of acoustic parameters.
  • micro-segments in the preferred micro- segment sequence are then concatenated to produce synthesized speech.
  • Some embodiments of the present invention therefore enable speech synthesis using micro-segments and a sequence of acoustic parameters representing a target acoustic model, rather than using phones or diphones.
  • a micro-segment can be any length of a speech segment, but is generally shorter than a phone or diphone.
  • a micro-segment can be a 20ms speech frame, whereas a speech segment of a phone usually comprises several such speech frames. Since speech segments synthesized by concatenating micro- segments can provide more frequency and prosodic variations than speech segments synthesized by concatenating phones or diphones, overall sound quality of Text To Speech (TTS) systems can be improved.
  • TTS Text To Speech
  • a flow diagram illustrates a method 200 for synthesizing speech from an input string 205, according to some embodiments of the present invention.
  • the input string 205 is processed to provide a sequence 230 of acoustic parameters.
  • a sequence 240 of sets 235 of candidate micro-segments is then generated from a speech library using the sequence 230 of acoustic parameters.
  • a preferred micro-segment sequence 245 is then determined f th 240 f th t 235 f did t i t f th sequence 230 of acoustic parameters.
  • micro-segments in the preferred micro-segment sequence 245 are then concatenated to produce a synthesized speech signal 250.
  • speech frames 255 corresponding to the micro- segment descriptions in the preferred micro-segment sequence 245 can be loaded into the RAM 104 of the mobile telephone 100, and then concatenated and played over the communications speaker 122 to produce the synthesized speech signal 250.
  • a flow diagram further illustrates a general method 300 for synthesizing speech from an input string, according to some embodiments of the present invention.
  • the input string is processed to provide a sequence of acoustic parameters.
  • an acoustic parameter of the sequence 230 of acoustic parameters can comprise a spectrum parameter, a pitch parameter, and an energy parameter.
  • the acoustic parameters are generated from the input string using prosodic positions.
  • a prosodic position can comprise a position of a syllable in a word and a position of the word in a sentence.
  • a spectrum parameter can be modeled using a known spectral feature representation method, such as a Linear Predictive Coding (LPC) method, a Line Spectral Pairs (LSP) method, or a Mel-Frequency Cepstral Coefficient (MFCC) method. Therefore, using prosodic positions, spectrum parameters of phones can be determined.
  • a positional spectrum model such as a Gaussian Mixture Model (GMM)
  • GMM Gaussian Mixture Model
  • the pitch parameter can be determined using a pitch model that defines pitch contours of syllables based on prosodic positions of the syllables.
  • the pitch model can comprise pitch contour models, such as WO_stress, WO_unstress, WF_stress, WF_unstress, or WS.
  • a voiced part be defined for syllables.
  • Different energy contour patterns can be defined using a position of a cv-like unit in the syllable and/or a condition concerning whether the syllable is a stress syllable.
  • energy contour patterns can be defined for phonemes. Each (un-voiced) phoneme can have one or more energy contour patterns. An energy contour of an un-voiced phoneme can depend on the position of the phoneme in a syllable and the position of the syllable in a word.
  • some (un-voiced) phonemes can share a same energy contour pattern if the phonemes have similar positions and similar articulation manner. For example, phonemes “s”, “sh”, and “ch” can share a same energy contour pattern, and similarly, “g”, “d”, and “k” can share another same energy contour pattern.
  • a sequence of sets of candidate micro-segments is generated from a speech library 315 using the sequence of acoustic parameters.
  • the sets of the candidate micro- segments can be generated using a target cost function and a duration model.
  • the target cost function can be a weighted sum of a spectrum cost, a pitch cost, and an energy cost.
  • Lower target cost can mean that acoustic characteristics of a candidate micro-segment closely match an acoustic parameter.
  • the mobile telephone 100 can search the speech library 315 to find a set of candidate micro-segments (e.g., speech frames) having acoustic characteristics that closely match the acoustic parameter and an estimated duration of the acoustic parameter.
  • a set of candidate micro-segments e.g., speech frames
  • Such closely matching speech frames then can be selected to generate the sequence 240 of the sets 235 of candidate micro-segments.
  • speech frames in the speech library 315 can be classified into several sets of speech frames using prosodic positions of the speech frames, and the candidate micro-segments can be searched in one of the sets of speech frames that closely matches a prosodic position of the acoustic parameter.
  • a preferred micro-segment sequence is determined from the sets of candidate micro-segments for the sequence of acoustic parameters.
  • a Viterbi algorithm can be used to determine the preferred micro- segment sequence 245, and a path cost function of the Viterbi algorithm can be a sum of a target cost function and a concatenation cost function.
  • the target cost function can be a weighted sum of a spectrum cost function, a pitch cost function, and an energy cost function.
  • the spectrum cost function can be a measure of a degree of difference in spectral features between a candidate micro- segment and an acoustic parameter, also referred to as a target micro-segment, in the sequence 230 of acoustic parameters.
  • the pitch cost function and the energy cost function can measure degrees of difference in pitch and energy features between an acoustic parameter and a candidate micro-segment, respectively.
  • the target cost function can be defined as follows:
  • C ⁇ (u l ⁇ ) K s ⁇ C s ⁇ (u l ⁇ ) + K p ⁇ C/(u l ⁇ ) + K E ⁇ C E ⁇ (u l ⁇ ) (Eq. 1)
  • u l:k is a £-th candidate micro-segment of an z-th acoustic parameter in the sequence 230 of acoustic parameters
  • C ⁇ (u 1:k ) is the target cost function
  • C T s (u 1:k ) is a spectrum cost function
  • C T p (u,, k ) is a pitch cost function
  • C T E (w ⁇ ) is an energy cost function
  • K T s, K T p, and K T E are weight values.
  • the concatenation cost function can be a weighted sum of a spectrum difference function, a pitch difference function, and an energy difference function.
  • the spectrum difference function can measure a degree of difference in spectral features between two adjacent micro-segments.
  • the pitch difference function and the energy difference function can measure degrees of difference in pitch and energy features between two adjacent micro-segments, respectively.
  • micro-segments in the preferred micro-segment sequence are then concatenated to produce synthesized speech.
  • a general flow diagram illustrates sub-steps of the step 305 of the method 300 of processing the input string to provide the sequence of acoustic parameters, according to some embodiments of the present invention.
  • the input string is processed to provide a phoneme sequence.
  • the input string 205 can be a text message or an email message received at the mobile telephone 100
  • the phoneme sequence can be a string representing pronunciation of the text message in a phonetic alphabet.
  • syllable boundaries are then determined in the phoneme sequence to provide a syllable sequence.
  • an English word can comprise several syllables, and boundaries of such syllables in the word then can be determined to provide the syllable sequence.
  • a phoneme sequence "ihksplehn" concerning the English word "explain” can be divided into a syllable sequence comprising two syllables, such as "ihk” and "splehn”.
  • sub-syllable units are then identified in the syllable sequence to provide a sub-syllable sequence.
  • the sub-syllable units can be equal or smaller than syllables, and can be cv-like speech units, which can comprise a consonant and a vowel.
  • the sub-syllable sequence can comprise cv-like speech units and consonants.
  • two cv-like speech units (“ih” and "lehn" can be identified in a syllable sequence ("ihk” + "splehn”).
  • a corresponding sub-syllable sequence then can be ("ih” + "k” + "s” + "p” + "lehn”).
  • using cv-like speech units to represent pronunciation of an input text can reduce a number of basic units needed to describe words.
  • a lexicon comprising 202,000 words may comprise 24,980 syllables, and only 6,707 cv-like units.
  • the sub-syllable sequence is then processed to provide a micro-segment description sequence. For example, by estimating a duration of each element in the sub-syllable sequence using a duration model, a number of micro-segments needed to synthesize speech for each element can be estimated. For example, consider the following cv-like speech unit (a sub-syllable): ih.
  • the sub-syllable can be mapped to five micro-segment descriptions as follows: ih f ih f ih f ih f ih f ih f; where ih f is a micro-segment description.
  • an estimated duration of a sub-syllable can be obtained by applying a duration model comprising average durations of phones and prosodic attributes of phones.
  • a duration model comprising average durations of phones and prosodic attributes of phones.
  • an estimated duration of a phoney can be obtained according to the following equation:
  • L p k x L avg (Eq. 3)
  • L p is the estimated duration of the phone p
  • L avg is an average phone duration of the phoney
  • Hs a prosodic attribute coefficient obtained from factors comprising a number of phones in a syllable containing the phoney, a number of syllables in a word containing the syllable, and a type of the phoney.
  • the micro-segment description sequence is then processed to provide the sequence of acoustic parameters.
  • each micro-segment description in the micro-segment description sequence can be mapped to an acoustic parameter describing the micro-segment description's acoustic characteristics such as spectrum (frequency characteristics) and prosodic sequence can comprise micro-segment descriptions, each of which is a description of a speech micro-segment that is usually smaller than a phone.
  • an acoustic parameter can be estimated using an acoustic model.
  • the acoustic parameter can comprise a spectrum parameter S n , a pitch parameter p n , and an energy parameter G n .
  • a diagram illustrates a pitch model comprising five normalized pitch contour models: WO_stress 505, WO_unstress 510, WF_stress 515, WF_unstress 520, and WS 525, as used according to some embodiments of the present invention.
  • the WO_stress 505 pitch contour model defines a pitch contour for stressed syllables positioned at a beginning or middle of words having multiple syllables.
  • the WO unstress 510 pitch contour model defines a pitch contour for unstressed syllables positioned at a beginning or middle of words having multiple syllables.
  • the WF_stress 515 pitch contour model defines a pitch contour for stressed syllables positioned at an end of words having multiple syllables.
  • the WF_unstress 520 pitch contour model defines a pitch contour for unstressed syllables positioned at an end of words having multiple syllables.
  • the WS 525 pitch contour model defines a pitch contour for syllables in words having only one syllable.
  • Advantages of some embodiments of the present invention therefore include improved sound quality of synthesized speech. Speech segments synthesized by concatenating micro-segments can provide improved speech continuity and more prosodic variations than speech segments synthesized by concatenating phones or diphones. Overall sound quality of TTS systems therefore can be improved, particularly in resource constrained handheld devices such as mobile telephones and personal digital assistants (PDAs).
  • PDAs personal digital assistants
  • embodiments of the invention described herein may be comprised of one or more conventional processors and unique stored i t ti th t t l th t i l t i conjunction with certain non-processor circuits, some, most, or all of the functions of synthesizing speech from an input string as described herein.
  • the non- processor circuits may include, but are not limited to, a radio receiver, a radio transmitter, signal drivers, clock circuits, power source circuits, and user input devices. As such, these functions may be interpreted as steps of a method for synthesizing speech from an input string.
  • some or all functions could be implemented by a state machine that has no stored program instructions, or in one or more application specific integrated circuits (ASICs), in which each function or some combinations of certain of the functions are implemented as custom logic.
  • ASICs application specific integrated circuits
  • a combination of the two approaches could be used.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephone Function (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

A method for synthesizing speech from an input string enables improved sound quality of Text-To-Speech synthesis. The method includes processing the input string to provide a sequence of acoustic parameters (step 305). For each acoustic parameter in the sequence of acoustic parameters a set of candidate micro-segments is generated from a speech library (step 310). A preferred micro-segment sequence is then determined from the sets of candidate micro-segments for the sequence of acoustic parameters (step 320). Micro-segments in the preferred micro-segment sequence are then concatenated to produce synthesized speech (step 325).

Description

METHOD FOR SYNTHESIZING SPEECH
Field of the Invention
The present invention relates generally to Text-To-Speech (TTS) synthesis and in particular to synthesizing speech from a text string using micro-segments.
Background
Text-To-Speech (TTS) conversion, often referred to as concatenative text to speech synthesis, enables electronic devices to receive an input text string and provide an audio signal representation of the string in the form of synthesized speech. For concatenative speech synthesis, basic speech units such as phones or diphones are concatenated. However, a device that synthesizes speech using phone-based speech units from a non-deterministic number of received text strings can have difficulty providing high quality, realistic synthesized speech. That is because the sound of a phone, syllable, or word is often context dependent.
Due to limited memory and processing power in many devices, not all desired prosodic variations of phones, syllables, or words can be included in a speech library such as an utterance waveform corpus. For example, although a phone-based concatenation, such as diphone-to-diphone, might be acceptable for inter-syllable concatenation, the phone-based concatenation of phones within a syllable may produce unnatural sounds. That is because concatenation points between voiced-voiced segments often cause unnatural sounding transitions.
A typical diphone speech library for the English language may have around 1200 diphones, but to reduce the concatenations within voiced-to-voiced phone boundaries, a speech library would require n-phone clusters. Thus a speech library of all pronunciations of all characters can be prohibitively large. In most TTS systems there is therefore a need to estimate the appropriate pronunciation of The size of such a speech library can be particularly limited when the speech library is embedded in a handheld electronic device having limited memory capacity.
Brief Description of the Figures
In order that the invention may be readily understood and put into practical effect, reference will now be made to exemplary embodiments as illustrated with reference to the accompanying figures, wherein like reference numbers refer to identical or functionally similar elements throughout the separate views. The figures together with a detailed description below, are incorporated in and form part of the specification, and serve to further illustrate the embodiments and explain various principles and advantages, in accordance with the present invention, where:
FIG. 1 is a schematic diagram illustrating an electronic device in the form of a mobile telephone, according to some embodiments of the present invention;
FIG. 2 is a flow diagram illustrating a method for synthesizing speech from an input text string, according to some embodiments of the present invention;
FIG. 3 is a general flow diagram illustrating a method for synthesizing speech from an input string, according to some embodiments of the present invention;
FIG. 4 is a general flow diagram illustrating a method for processing an input string to provide a sequence of acoustic parameters, according to some embodiments of the present invention; and FIG. 5 is a diagram illustrating a pitch model comprising five normalized pitch contour models, according to some embodiments of the present invention.
Skilled artisans will appreciate that elements in the figures are illustrated example, the dimensions of some of the elements in the figures may be exaggerated relative to other elements to help to improve understanding of embodiments of the present invention.
Detailed Description
Before describing in detail embodiments that are in accordance with the present invention, it should be observed that the embodiments reside primarily in combinations of method steps and apparatus components related to synthesizing speech from an input string. Accordingly, the apparatus components and method steps have been represented where appropriate by conventional symbols in the drawings, showing only those specific details that are pertinent to understanding the embodiments of the present invention so as not to obscure the disclosure with details that will be readily apparent to those of ordinary skill in the art having the benefit of the description herein. In this document, relational terms such as first and second, top and bottom, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non- exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. An element preceded by "comprises a ..." does not, without more constraints, preclude the existence of additional identical elements in the process, method, article, or apparatus that comprises the element.
Referring to FIG. 1, a schematic diagram illustrates an electronic device in the form of a mobile telephone 100, according to some embodiments of the present invention. The mobile telephone 100 comprises a radio frequency and address bus 117 of a processor 103. The telephone 100 also has a keypad 106 and a display screen 105, such as a touch screen coupled to be in communication with the processor 103.
The processor 103 also includes an encoder/decoder 111 with an associated code Read Only Memory (ROM) 112 for storing data for encoding and decoding voice or other signals that may be transmitted or received by the mobile telephone 100. The processor 103 further includes a microprocessor 113 coupled, by the common data and address bus 117, to the encoder/decoder 111 , a character Read Only Memory (ROM) 114, a Random Access Memory (RAM) 104, programmable memory 116 and a Subscriber Identity Module (SIM) interface 118. The programmable memory 116 and a SIM operatively coupled to the SIM interface 118 each can store, among other things, a telephone number database (TND) comprising a number field for telephone numbers and a name field for identifiers uniquely associated with the telephone numbers in the number field. The radio frequency communications unit 102 is a combined receiver and transmitter having a common antenna 107. The communications unit 102 has a transceiver 108 coupled to the antenna 107 via a radio frequency amplifier 109. The transceiver 108 is also coupled to a combined modulator/demodulator 110 that is coupled to the encoder/decoder 111. The microprocessor 113 has ports for coupling to the keypad 106 and to the display screen 105. The microprocessor 113 further has ports for coupling to an alert module 115 that typically contains an alert speaker, vibrator motor and associated drivers; to a microphone 120; and to a communications speaker 122. The character ROM 114 stores code for decoding or encoding data such as control channel messages that may be transmitted or received by the communications unit 102. In some embodiments of the present invention, the character ROM 114, the programmable memory 116, or a SIM also can store operating code (OC) for the microprocessor 113 and code for performing functions associated with the mobile t l h 100 F l th bl 116 i speech synthesis services program code components 125 configured to cause execution of a method for synthesizing speech from an input string.
Thus some embodiments of the present invention include a method of using the mobile telephone 100 for synthesizing speech from an input string. For example, an input string can be a text message or an email containing a text string received at the mobile telephone 100. The method includes processing the input string to provide a sequence of acoustic parameters. A sequence of sets of candidate micro-segments is then generated from a speech library using the sequence of acoustic parameters. A preferred micro-segment sequence is then determined from the sequence of sets of candidate micro-segments for the sequence of acoustic parameters. Finally, micro-segments in the preferred micro- segment sequence are then concatenated to produce synthesized speech.
Some embodiments of the present invention therefore enable speech synthesis using micro-segments and a sequence of acoustic parameters representing a target acoustic model, rather than using phones or diphones. A micro-segment can be any length of a speech segment, but is generally shorter than a phone or diphone. For example, a micro-segment can be a 20ms speech frame, whereas a speech segment of a phone usually comprises several such speech frames. Since speech segments synthesized by concatenating micro- segments can provide more frequency and prosodic variations than speech segments synthesized by concatenating phones or diphones, overall sound quality of Text To Speech (TTS) systems can be improved.
Referring to FIG. 2, a flow diagram illustrates a method 200 for synthesizing speech from an input string 205, according to some embodiments of the present invention. First, the input string 205 is processed to provide a sequence 230 of acoustic parameters. A sequence 240 of sets 235 of candidate micro-segments is then generated from a speech library using the sequence 230 of acoustic parameters. A preferred micro-segment sequence 245 is then determined f th 240 f th t 235 f did t i t f th sequence 230 of acoustic parameters. Finally, micro-segments in the preferred micro-segment sequence 245 are then concatenated to produce a synthesized speech signal 250. For example, speech frames 255 corresponding to the micro- segment descriptions in the preferred micro-segment sequence 245 can be loaded into the RAM 104 of the mobile telephone 100, and then concatenated and played over the communications speaker 122 to produce the synthesized speech signal 250.
Referring to FIG. 3, a flow diagram further illustrates a general method 300 for synthesizing speech from an input string, according to some embodiments of the present invention. At step 305, the input string is processed to provide a sequence of acoustic parameters. For example, an acoustic parameter of the sequence 230 of acoustic parameters can comprise a spectrum parameter, a pitch parameter, and an energy parameter.
According to some embodiments of the present invention, the acoustic parameters, also referred to as target speech units, are generated from the input string using prosodic positions. For example, a prosodic position can comprise a position of a syllable in a word and a position of the word in a sentence.
A spectrum parameter can be modeled using a known spectral feature representation method, such as a Linear Predictive Coding (LPC) method, a Line Spectral Pairs (LSP) method, or a Mel-Frequency Cepstral Coefficient (MFCC) method. Therefore, using prosodic positions, spectrum parameters of phones can be determined. For example, a positional spectrum model, such as a Gaussian Mixture Model (GMM), can be used to map acoustic features of phones, such as prosodic position, to spectrum parameters. The pitch parameter can be determined using a pitch model that defines pitch contours of syllables based on prosodic positions of the syllables. The pitch model can comprise pitch contour models, such as WO_stress, WO_unstress, WF_stress, WF_unstress, or WS.
For the energy parameter, different strategies can be used for a voiced part be defined for syllables. Different energy contour patterns can be defined using a position of a cv-like unit in the syllable and/or a condition concerning whether the syllable is a stress syllable. For an un- voiced part, energy contour patterns can be defined for phonemes. Each (un-voiced) phoneme can have one or more energy contour patterns. An energy contour of an un-voiced phoneme can depend on the position of the phoneme in a syllable and the position of the syllable in a word. To reduce an amount of memory required, some (un-voiced) phonemes can share a same energy contour pattern if the phonemes have similar positions and similar articulation manner. For example, phonemes "s", "sh", and "ch" can share a same energy contour pattern, and similarly, "g", "d", and "k" can share another same energy contour pattern.
At step 310, a sequence of sets of candidate micro-segments is generated from a speech library 315 using the sequence of acoustic parameters. According to some embodiments of the present invention, the sets of the candidate micro- segments can be generated using a target cost function and a duration model. For example, the target cost function can be a weighted sum of a spectrum cost, a pitch cost, and an energy cost. Lower target cost can mean that acoustic characteristics of a candidate micro-segment closely match an acoustic parameter. For example, for each acoustic parameter in the sequence 230 of acoustic parameters, the mobile telephone 100 can search the speech library 315 to find a set of candidate micro-segments (e.g., speech frames) having acoustic characteristics that closely match the acoustic parameter and an estimated duration of the acoustic parameter. Such closely matching speech frames then can be selected to generate the sequence 240 of the sets 235 of candidate micro-segments. In order to reduce processing time, speech frames in the speech library 315 can be classified into several sets of speech frames using prosodic positions of the speech frames, and the candidate micro-segments can be searched in one of the sets of speech frames that closely matches a prosodic position of the acoustic parameter. At step 320, a preferred micro-segment sequence is determined from the sets of candidate micro-segments for the sequence of acoustic parameters. For example, a Viterbi algorithm can be used to determine the preferred micro- segment sequence 245, and a path cost function of the Viterbi algorithm can be a sum of a target cost function and a concatenation cost function.
According to some embodiments of the present invention, the target cost function can be a weighted sum of a spectrum cost function, a pitch cost function, and an energy cost function. For example, the spectrum cost function can be a measure of a degree of difference in spectral features between a candidate micro- segment and an acoustic parameter, also referred to as a target micro-segment, in the sequence 230 of acoustic parameters. Similarly, the pitch cost function and the energy cost function can measure degrees of difference in pitch and energy features between an acoustic parameter and a candidate micro-segment, respectively. For example, the target cost function can be defined as follows:
Cτ(u) = Ks τCs τ (u ) + Kp τC/(u) + KE τCE τ (u) (Eq. 1) where ul:k is a £-th candidate micro-segment of an z-th acoustic parameter in the sequence 230 of acoustic parameters, Cτ(u1:k) is the target cost function, CTs (u1:k) is a spectrum cost function, CTp (u,,k) is a pitch cost function, CT E (wα) is an energy cost function, and KTs, KTp, and KT E are weight values. The concatenation cost function can be a weighted sum of a spectrum difference function, a pitch difference function, and an energy difference function. The spectrum difference function can measure a degree of difference in spectral features between two adjacent micro-segments. Similarly, the pitch difference function and the energy difference function can measure degrees of difference in pitch and energy features between two adjacent micro-segments, respectively. For example, the concatenation cost function can be defined as follows: Cc(μ,_lιJ ,uhk) = KsCCs C (u1_hj ,ul k ) + Kp cCp c (u1_hj ,ul k ) + KE cCE c (u1_hj ,ul k ) the sequence 230 of acoustic parameters, Uφ is a £-th candidate micro-segment of an /-th acoustic parameter in the sequence 230 of acoustic parameters, Cf(U1-Ij , ul:k) is the concatenation cost, (fs (U1-Ij , Uφ) is a spectrum difference function between U1-Ij and Uφ, Cf P(U1-Ij , Uφ) is a pitch difference function between U1-Ij and Uφ, C?d}*i-ij , Uφ) is an energy difference function between U1-Ij and Uφ, and Kps, KFP, and KpE are weight values.
At step 325, micro-segments in the preferred micro-segment sequence are then concatenated to produce synthesized speech.
Referring to FIG. 4, a general flow diagram illustrates sub-steps of the step 305 of the method 300 of processing the input string to provide the sequence of acoustic parameters, according to some embodiments of the present invention. At step 405, the input string is processed to provide a phoneme sequence. For example, the input string 205 can be a text message or an email message received at the mobile telephone 100, and the phoneme sequence can be a string representing pronunciation of the text message in a phonetic alphabet.
At step 410, syllable boundaries are then determined in the phoneme sequence to provide a syllable sequence. For example, an English word can comprise several syllables, and boundaries of such syllables in the word then can be determined to provide the syllable sequence. For example, a phoneme sequence "ihksplehn" concerning the English word "explain" can be divided into a syllable sequence comprising two syllables, such as "ihk" and "splehn".
At step 415, sub-syllable units are then identified in the syllable sequence to provide a sub-syllable sequence. The sub-syllable units can be equal or smaller than syllables, and can be cv-like speech units, which can comprise a consonant and a vowel. Thus, the sub-syllable sequence can comprise cv-like speech units and consonants. For example, two cv-like speech units ("ih" and "lehn") can be identified in a syllable sequence ("ihk" + "splehn"). A corresponding sub-syllable sequence then can be ("ih" + "k" + "s" + "p" + "lehn"). According to some embodiments of the present invention, using cv-like speech units to represent pronunciation of an input text can reduce a number of basic units needed to describe words. For example, a lexicon comprising 202,000 words may comprise 24,980 syllables, and only 6,707 cv-like units. At step 420, the sub-syllable sequence is then processed to provide a micro-segment description sequence. For example, by estimating a duration of each element in the sub-syllable sequence using a duration model, a number of micro-segments needed to synthesize speech for each element can be estimated. For example, consider the following cv-like speech unit (a sub-syllable): ih. If an estimated duration of the cv-like speech unit is approximately equal to five micro-segments, the sub-syllable can be mapped to five micro-segment descriptions as follows: ihf ihf ihf ihf ihf; where ihf is a micro-segment description.
According to some embodiments of the present invention, an estimated duration of a sub-syllable can be obtained by applying a duration model comprising average durations of phones and prosodic attributes of phones. For example, an estimated duration of a phoney can be obtained according to the following equation:
Lp = k x Lavg (Eq. 3) where Lp is the estimated duration of the phone p, Lavg is an average phone duration of the phoney, and Hs a prosodic attribute coefficient obtained from factors comprising a number of phones in a syllable containing the phoney, a number of syllables in a word containing the syllable, and a type of the phoney.
At step 425, the micro-segment description sequence is then processed to provide the sequence of acoustic parameters. For example, each micro-segment description in the micro-segment description sequence can be mapped to an acoustic parameter describing the micro-segment description's acoustic characteristics such as spectrum (frequency characteristics) and prosodic sequence can comprise micro-segment descriptions, each of which is a description of a speech micro-segment that is usually smaller than a phone. For each micro- segment description in the micro-segment description sequence, an acoustic parameter can be estimated using an acoustic model. For example, the acoustic parameter can comprise a spectrum parameter Sn, a pitch parameter pn, and an energy parameter Gn.
Referring to FIG. 5, a diagram illustrates a pitch model comprising five normalized pitch contour models: WO_stress 505, WO_unstress 510, WF_stress 515, WF_unstress 520, and WS 525, as used according to some embodiments of the present invention. The WO_stress 505 pitch contour model defines a pitch contour for stressed syllables positioned at a beginning or middle of words having multiple syllables. The WO unstress 510 pitch contour model defines a pitch contour for unstressed syllables positioned at a beginning or middle of words having multiple syllables. The WF_stress 515 pitch contour model defines a pitch contour for stressed syllables positioned at an end of words having multiple syllables. The WF_unstress 520 pitch contour model defines a pitch contour for unstressed syllables positioned at an end of words having multiple syllables. The WS 525 pitch contour model defines a pitch contour for syllables in words having only one syllable. Advantages of some embodiments of the present invention therefore include improved sound quality of synthesized speech. Speech segments synthesized by concatenating micro-segments can provide improved speech continuity and more prosodic variations than speech segments synthesized by concatenating phones or diphones. Overall sound quality of TTS systems therefore can be improved, particularly in resource constrained handheld devices such as mobile telephones and personal digital assistants (PDAs).
It will be appreciated that embodiments of the invention described herein may be comprised of one or more conventional processors and unique stored i t ti th t t l th t i l t i conjunction with certain non-processor circuits, some, most, or all of the functions of synthesizing speech from an input string as described herein. The non- processor circuits may include, but are not limited to, a radio receiver, a radio transmitter, signal drivers, clock circuits, power source circuits, and user input devices. As such, these functions may be interpreted as steps of a method for synthesizing speech from an input string. Alternatively, some or all functions could be implemented by a state machine that has no stored program instructions, or in one or more application specific integrated circuits (ASICs), in which each function or some combinations of certain of the functions are implemented as custom logic. Of course, a combination of the two approaches could be used.
Thus, methods and means for these functions have been described herein. Further, it is expected that one of ordinary skill, notwithstanding possibly significant effort and many design choices motivated by, for example, available time, current technology, and economic considerations, when guided by the concepts and principles disclosed herein will be readily capable of generating such software instructions and programs and ICs with minimal experimentation.
In the foregoing specification, specific embodiments of the present invention have been described. However, one of ordinary skill in the art appreciates that various modifications and changes can be made without departing from the scope of the present invention as set forth in the claims below.
Accordingly, the specification and figures are to be regarded in an illustrative rather than a restrictive sense, and all such modifications are intended to be included within the scope of present invention. The benefits, advantages, solutions to problems, and any element(s) that may cause any benefit, advantage, or solution to occur or become more pronounced are not to be construed as critical, required, or essential features or elements of any or all the claims. The invention is defined solely by the appended claims including any amendments made during the pendency of this application and all equivalents of those claims.

Claims

We claim:
1. A method for synthesizing speech from an input string, the method comprising: processing the input string to provide a sequence of acoustic parameters; generating from a speech library a sequence of sets of candidate micro- segments using the sequence of acoustic parameters; determining a preferred micro-segment sequence from the sequence of sets of candidate micro-segments for the sequence of acoustic parameters; and concatenating micro-segments in the preferred micro-segment sequence to produce synthesized speech.
2. The method of claim 1, wherein processing the input string to provide the sequence of acoustic parameters comprises: processing the input string to provide a phoneme sequence; determine syllable boundaries in the phoneme sequence to provide a syllable sequence; identifying sub-syllable units in the syllable sequence to provide a sub- syllable sequence; generating a micro-segment description sequence from the sub-syllable sequence; and processing the micro-segment description sequence to provide the sequence of acoustic parameters.
3. The method of claim 2, wherein the micro-segment description sequence is generated from the sub-syllable sequence using a duration model comprising
4. The method of claim 2, wherein the sub-syllable sequence comprises one or more of a cv-like speech unit or a phone.
5. The method of claim 1, wherein an acoustic parameter of the sequence of acoustic parameters comprises a spectrum parameter, a pitch parameter, and an energy parameter.
6. The method of claim 1, wherein sets of candidate micro-segments are selected from the speech library using a target cost function and a duration model.
7. The method of claim 6, wherein the target cost function is a weighted sum of a spectrum cost, a pitch cost, and an energy cost.
8. The method of claim 1, wherein the preferred micro-segment sequence is determined from the sets of candidate micro-segments for the sequence of acoustic parameters using a Viterbi algorithm.
9. The method of claim 8, wherein the Viterbi algorithm comprises a path cost function that is a sum of a target cost function and a concatenation cost function.
10. The method of claim 9, wherein the target cost function is a weighted sum of a spectrum cost function, a pitch cost function, and an energy cost function.
11. The method of claim 9, wherein the concatenation cost function is a weighted sum of a spectrum difference function, a pitch difference function, and an energy difference function.
12. The method of claim 10, wherein the target cost function is defined as follows:
CT(u) = Ks τCs τ(u) + Kp τC/(u) + KE τCE τ(u)
where Uφ is a £-th candidate micro-segment of an z-th acoustic parameter in the sequence of acoustic parameters, Cτ(ulrk) is the target cost function, CTs (uhk) is a spectrum cost function, CTp (uφ) is a pitch cost function, CT E (uφ) is an energy cost function, and Kτ s, KT P, and KT E are weight values.
13. The method of claim 11, wherein the concatenation cost function is defined according to the following equation:
Ks CCs c(u1_hj ,u) + Kp cCP c (u1_hj ,u ) + KE cCE c(u1_hj ,u )
where U1-Ij is ay-th candidate micro-segment of an (i-l)-th acoustic parameter in the sequence of acoustic parameters, uhk is a k-th candidate micro-segment of an i- th acoustic parameter in the sequence of acoustic parameters, Cf(U1-Ij , Uφ) is the concatenation cost for concatenating U1-Ij and Uφ, Cfs (U1-Ij , uhk) is a spectrum difference function between U1-Ij and u1:k, Cf P(U1-Ij , u1:k) is a pitch difference function between U1-Ij, uhk, Cc Ε(Ui-ij , uhø is an energy difference function between U1-Ij and u1:k, and K0S, K0P, and K°E are weight values.
14. The method of claim 5, wherein the pitch parameter is one of following pitch models: WO stress, WO unstress, WF stress, WF unstress, or WS.
15. The method of claim 5, wherein the energy parameter comprises a voiced part and an un-voiced part.
16. The method of claim 3, wherein the duration model is defined by the following equation:
where Lp is an estimated duration of a phoney, Lavg is an average phone duration of the phone p, and A: is a prosodic attribute coefficient obtained from factors comprising a number of phones in a syllable containing the phoney, a number of syllables in a word containing the phoney, and a type of the phoney.
PCT/US2008/062822 2007-05-25 2008-05-07 Method for synthesizing speech WO2008147649A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN2007101045813A CN101312038B (en) 2007-05-25 2007-05-25 Method for synthesizing voice
CN200710104581.3 2007-05-25

Publications (2)

Publication Number Publication Date
WO2008147649A1 true WO2008147649A1 (en) 2008-12-04
WO2008147649A8 WO2008147649A8 (en) 2010-03-04

Family

ID=39564770

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2008/062822 WO2008147649A1 (en) 2007-05-25 2008-05-07 Method for synthesizing speech

Country Status (2)

Country Link
CN (1) CN101312038B (en)
WO (1) WO2008147649A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011016761A1 (en) 2009-08-07 2011-02-10 Khitrov Mikhail Vasil Evich A method of speech synthesis
CN113409759A (en) * 2021-07-07 2021-09-17 浙江工业大学 End-to-end real-time speech synthesis method

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101510424B (en) * 2009-03-12 2012-07-04 孟智平 Method and system for encoding and synthesizing speech based on speech primitive
DE102012202391A1 (en) * 2012-02-16 2013-08-22 Continental Automotive Gmbh Method and device for phononizing text-containing data records
CN102779508B (en) * 2012-03-31 2016-11-09 科大讯飞股份有限公司 Sound bank generates Apparatus for () and method therefor, speech synthesis system and method thereof
WO2018209556A1 (en) * 2017-05-16 2018-11-22 Beijing Didi Infinity Technology And Development Co., Ltd. System and method for speech synthesis
CN113314096A (en) * 2020-02-25 2021-08-27 阿里巴巴集团控股有限公司 Speech synthesis method, apparatus, device and storage medium
CN113192522B (en) * 2021-04-22 2023-02-21 北京达佳互联信息技术有限公司 Audio synthesis model generation method and device and audio synthesis method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2313530A (en) * 1996-05-15 1997-11-26 Atr Interpreting Telecommunica Speech Synthesizer
US6173263B1 (en) * 1998-08-31 2001-01-09 At&T Corp. Method and system for performing concatenative speech synthesis using half-phonemes
US6308156B1 (en) * 1996-03-14 2001-10-23 G Data Software Gmbh Microsegment-based speech-synthesis process
US7082396B1 (en) * 1999-04-30 2006-07-25 At&T Corp Methods and apparatus for rapid acoustic unit selection from a large speech corpus

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4080989B2 (en) * 2003-11-28 2008-04-23 株式会社東芝 Speech synthesis method, speech synthesizer, and speech synthesis program

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6308156B1 (en) * 1996-03-14 2001-10-23 G Data Software Gmbh Microsegment-based speech-synthesis process
GB2313530A (en) * 1996-05-15 1997-11-26 Atr Interpreting Telecommunica Speech Synthesizer
US6173263B1 (en) * 1998-08-31 2001-01-09 At&T Corp. Method and system for performing concatenative speech synthesis using half-phonemes
US7082396B1 (en) * 1999-04-30 2006-07-25 At&T Corp Methods and apparatus for rapid acoustic unit selection from a large speech corpus

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
BENZMULLER R ET AL: "Microsegment synthesis-economic principles in a low-cost solution", ICSLP 96. PROCEEDINGS., FOURTH INTERNATIONAL CONFERENCE, vol. 4, 3 October 1996 (1996-10-03), PHILADELPHIA, PA, USA, pages 2383 - 2386, XP010238145, ISBN: 978-0-7803-3555-4 *
BLACK A W ET AL: "OPTIMISING SELECTION OF UNITS FROM SPEECH DATABASES FOR CONCATENATIVE SYNTHESIS", 4TH EUROPEAN CONFERENCE ON SPEECH COMMUNICATION AND TECHNOLOGY. EUROSPEECH '95, vol. 1, 18 September 1995 (1995-09-18), MADRID, SPAIN, pages 581 - 584, XP000854776 *
BLOUIN C ET AL: "Concatenation cost calculation and optimisation for unit selection in tts", SPEECH SYNTHESIS, 2002. PROCEEDINGS OF 2002 IEEE WORKSHOP ON 11-13 SEPT. 2002, PISCATAWAY, NJ, USA,IEEE, 11 September 2002 (2002-09-11), pages 231 - 234, XP010653653, ISBN: 978-0-7803-7395-2 *
EL-IMAM Y A: "AN UNRESTRICTED VOCABULARY ARABIC SPEECH SYNTHESIS SYSTEM", IEEE TRANSACTIONS ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, IEEE INC. NEW YORK, vol. 37, no. 12, 1 December 1989 (1989-12-01), pages 1829 - 1845, XP000099485, ISSN: 0096-3518 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011016761A1 (en) 2009-08-07 2011-02-10 Khitrov Mikhail Vasil Evich A method of speech synthesis
US8942983B2 (en) 2009-08-07 2015-01-27 Speech Technology Centre, Limited Method of speech synthesis
CN113409759A (en) * 2021-07-07 2021-09-17 浙江工业大学 End-to-end real-time speech synthesis method

Also Published As

Publication number Publication date
CN101312038A (en) 2008-11-26
CN101312038B (en) 2012-01-04
WO2008147649A8 (en) 2010-03-04

Similar Documents

Publication Publication Date Title
CN108573693B (en) Text-to-speech system and method, and storage medium therefor
US11990118B2 (en) Text-to-speech (TTS) processing
CN106920547B (en) Voice conversion method and device
CN112309366B (en) Speech synthesis method, speech synthesis device, storage medium and electronic equipment
US7979274B2 (en) Method and system for preventing speech comprehension by interactive voice response systems
US20200410981A1 (en) Text-to-speech (tts) processing
US20160379638A1 (en) Input speech quality matching
US20040073428A1 (en) Apparatus, methods, and programming for speech synthesis via bit manipulations of compressed database
US11763797B2 (en) Text-to-speech (TTS) processing
EP2179414A1 (en) Synthesis by generation and concatenation of multi-form segments
WO2008147649A1 (en) Method for synthesizing speech
US10699695B1 (en) Text-to-speech (TTS) processing
KR20160058470A (en) Speech synthesis apparatus and control method thereof
WO2006106182A1 (en) Improving memory usage in text-to-speech system
WO2023035261A1 (en) An end-to-end neural system for multi-speaker and multi-lingual speech synthesis
Balyan et al. Speech synthesis: a review
US6502073B1 (en) Low data transmission rate and intelligible speech communication
US9484014B1 (en) Hybrid unit selection / parametric TTS system
US20070055524A1 (en) Speech dialog method and device
Mullah A comparative study of different text-to-speech synthesis techniques
CN116129859A (en) Prosody labeling method, acoustic model training method, voice synthesis method and voice synthesis device
Deketelaere et al. Speech Processing for Communications: what's new?
EP1589524A1 (en) Method and device for speech synthesis
Juergen Text-to-Speech (TTS) Synthesis
Pagarkar et al. Language Independent Speech Compression using Devanagari Phonetics

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 08755097

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 08755097

Country of ref document: EP

Kind code of ref document: A1