WO2004012183A2 - Concatenative text-to-speech conversion - Google Patents

Concatenative text-to-speech conversion Download PDF

Info

Publication number
WO2004012183A2
WO2004012183A2 PCT/IB2003/002965 IB0302965W WO2004012183A2 WO 2004012183 A2 WO2004012183 A2 WO 2004012183A2 IB 0302965 W IB0302965 W IB 0302965W WO 2004012183 A2 WO2004012183 A2 WO 2004012183A2
Authority
WO
WIPO (PCT)
Prior art keywords
acoustic
text
unit
speech
prosodic
Prior art date
Application number
PCT/IB2003/002965
Other languages
French (fr)
Other versions
WO2004012183A3 (en
Inventor
Jian Cheng Huang
Fang Chen
Original Assignee
Motorola Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Motorola Inc filed Critical Motorola Inc
Priority to AU2003249493A priority Critical patent/AU2003249493A1/en
Priority to JP2004524006A priority patent/JP2005534070A/en
Publication of WO2004012183A2 publication Critical patent/WO2004012183A2/en
Publication of WO2004012183A3 publication Critical patent/WO2004012183A3/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • G10L13/047Architecture of speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation

Definitions

  • the present invention relates to a concatenative text-to-speech (TTS) conversion.
  • TTS text-to-speech
  • the invention is particularly useful for, but not necessarily limited to, concatenative TTS synthesis with prosodic control.
  • Reading large volumes of text documents stored in the computers, mobile telephones, or personal data assistants (PDA) may easily cause vision tiredness. And sometimes, reading the data on the electronic screen in a moving vehicle is not convenient. Therefore, it is desired to transform the text documents into speech being played for the reader to listen so as to solve those problems.
  • the desired utterance waveforms are usually derived from an utterance waveform corpus, where the utterance waveform corpus stores various sentences, phrases, and their corresponding utterance waveforms.
  • the quality of desired synthesized utterance depends on the size of such as corpus.
  • Figure 1 shows an existing typical concatenative TTS system.
  • the system includes three portions, that is, a text processing portion, acoustic segment base, and a speech synthesizer.
  • the system first breaks up sentences and words into word segments, and then it assigns the corresponding characters with phonetic symbols with assistance of a Lexicon. Then, the sequence of segmented phonetic symbols will be matched with acoustic segments from the utterance or phrase waveform corpus, whereby obtaining the most matched acoustic segments. Finally, the selected acoustic segments will be concatenated with insertion of proper breaks to obtain the output speech.
  • Such an existing TTS system normally stores the utterance waveform directly.
  • it would require storing large volumes of utterance waveforms in all kinds of speech environments to cover the speech characteristics of most of situations.
  • the storage of the huge amount of utterance waveform requires lots of memory space.
  • a high quality text-to-speech system requires normally a memory capacity of hundreds of mega bytes.
  • the memory capacity is usually only few mega bytes due to the limitation of hardware and cost. Therefore, on those portable devices, it is hard to have high quality text-speech. This limits the use of text-to-speech conversion in these technical fields.
  • the present invention provides a method for text to speech conversion, the method including: partitioning text into a segmented phonetic units; identifying a suitable acoustic unit for each of the phonetic units, each acoustic unit being representative of acoustic segments forming a phonetic cluster determined by their acoustic similarity; determining variances between prosodic parameters of an acoustic unit and each of the phonetic units; generating acoustic parameters from the prosodic parameters of the acoustic unit and associated variances to select an acoustic segment ;and providing an output speech signal based on the acoustic segment.
  • the prosodic parameters includes pitch, duration or energy.
  • the determining is based on position of the acoustic unit in a phrase or a sentence, co-articulation, phrase length or adjacent characters of the acoustic unit.
  • the partitioning may be characterized by partitioning sentences of text into syllables.
  • the phonetic units are syllables.
  • the phonetic units may be assigned a phonetic symbol.
  • the phonetic symbol is a pinyin representation.
  • a text-to-speech converting system comprising a text processor for forming a sequence of phonetic symbols after the word segmentation on the basis of input text.
  • the text-to-speech converting system further comprises an acoustic and prosodic controller that includes at least an utterance annotation corpus, and acoustic unit index (AU index) and prosodic vectors (PV) selection device.
  • the utterance annotation corpus includes at least acoustic unit (AU) indices and prosodic vectors (PV).
  • the acoustic unit index (AU index) and prosodic vector (PV) selection device receives the sequence of phonetic symbols after the word segmentation, and generates a series of control data including the acoustic unit (AU) indices and prosodic vectors (PV).
  • the text-to-speech converting system also comprises a synthesizer that includes at least an acoustic parameter base, and the synthesizer responds to the control data from the acoustic/prosodic controller, thereby synthesizing the speech.
  • the present invention also provides a method of converting a text entry into a corresponding synthetic speech through a concatenative text-to-speech system.
  • the method comprises the steps of processing and converting a text input to generate a sequence of segmented phonetic symbols; searching an utterance annotation corpus including at least acoustic unit (AU) indices to find a maximum match to fetch a matched annotation context; substituting the matched portions of the sequence of segmented phonetic symbols with AU indices and prosodic vector; generating a sequence of control data having at least AU indices and prosodic vectors; and generating a synthetic speech in response to the control data.
  • AU acoustic unit
  • the present invention further provides a method of forming a symbolic corpus.
  • the method comprises the steps of slicing utterances into acoustic segments (AS); grouping said AS into clusters in consideration of phonetic classification and acoustic similarity; selecting an acoustic unit (AU) in representation of all acoustic segments in a cluster; converting the acoustic units into respective sequences of parameters frame-by-frame; vector-quantifying the frame parameters of each AU into a sequence of vector indices; forming an AU parameter base containing frame-based scalar parameters and vector indices; finding matched AU for all AS and determining the respective prosodic vectors between AU and AS; and substituting the acoustic segments with the phonetic symbols, AU indices, and prosodic vectors to form an utterance annotation corpus in place of an original AS waveform corpus.
  • the present invention groups the utterance or acoustic segments, saves only an acoustic unit as representative of all acoustic segments in a cluster and the difference between the acoustic segments and the acoustic unit, and uses parameters in representation of the original utterance waveforms, thereby reducing efficiently the amount of data stored in the utterance annotation corpus.
  • the phonetic symbols are used to replace any acoustic segments of each cluster, thereby reducing efficiently the number of desired dada of memory and saving the memory space.
  • the present invention converts each acoustic unit waveform into a series of parameters to form an acoustic unit parameter base, using such parameters in place of the acoustic unit waveform, thereby further reducing the memory space required for storing the acoustic units.
  • the present invention represents the acoustic segments by using the difference between the acoustic units and acoustic segments, and replaces the waveform of the acoustic segments with the phonetic symbols of each acoustic segment and its corresponding acoustic unit parameters and the difference therebetween. This can express utterance information of a syllable corresponding to each acoustic segment, thereby reducing the distortion.
  • the present invention provides a high efficient text-to-speech converting method and apparatus, and provides the high quality synthetic speech.
  • the required system performance and memory space make it suitable not only for normal computers, but also for small portable devices.
  • Figure 1 is an illustration of a prior art text-to-speech conversion system
  • FIG. 2 is an illustration of the text-to-speech conversion system in accordance with the present invention.
  • Figure 3 is a flow diagram illustrating a method for text to speech conversion in accordance with the present invention.
  • FIG. 1 there is illustrated a prior art TTS conversion system.
  • the system includes three main portions: a text processor 100, an acoustic segment base 200, and a synthesizer 300.
  • the primary function of the text processor 100 is to have input text normalized and segmented, and then to assign characters of the text with corresponding phonetic symbols.
  • the system uses the obtained phonetic symbol sequence to match with the phonetic symbol sequence stored in the acoustic segment base 200, and then replaces the phonetic symbols corresponding acoustic segments of corresponding utterance or phrases.
  • the synthesizer 300 concatenates these acoustic segments according to the text with insertion of proper breaks, thereby obtaining the desired speech output.
  • the acoustic segment base 200 stores a huge amount of text content information and utterances of the text content. The more the utterance information, the closer the synthetic speech is to an actual person's utterance. If a sentence of the input text were matched completely and directly with a sentence stored in the acoustic segment base, the waveform of this stored sentence could be used directly for speech output, that is, the actual utterance recorded. However, in most of situations, the system cannot find such a completely matched sentence. In this case, partial matching of words and phrases of this sentence are required and therefore it is necessary to conduct word segmentation. Then, corresponding acoustic segments are identified to provide TTS conversion.
  • the input text is first normalized using a text normalization unit 110.
  • a word segmentation unit 130 guided by a lexicon 120, carries out sentence partitioning, by punctuation identification and word segmentation procedures.
  • a phonetic symbol assignment unit 140 and acoustic segment selection unit 250 utilizes an utterance or phrase corpus 260 to search and select acoustic segments in the acoustic segment base 200.
  • the selected segments are sent to a break generation unit 380 and to the acoustic segment concatenation unit 370.
  • the break generation unit 380 generates break information provided to the acoustic segment concatenation unit 370.
  • the acoustic segment concatenation unit 370 concatenates and adds the proper breaks, and ⁇ outputs the speech signals to the waveform post-processing device.
  • a waveform post-processing unit 390 then outputs synthesized converted speech signals.
  • the quality of natural pronunciation is dependent on the size of the utterance waveform corpus and selection of appropriate acoustic segments.
  • the present invention mainly stores parameters of utterance waveforms, and then utilizes these parameters to synthesize the desired speech, thereby reducing the memory storage overheads.
  • the present invention provides a method of forming utterance annotation corpus. This method comprises the following steps of forming an utterance waveform corpus. It first records a person's utterances whilst reading various texts, and stores these utterances in a raw utterance waveform corpus. These utterances were chosen carefully to build the raw utterance waveform corpus with a good phonetic and prosodic balance.
  • the utterance waveforms are partitioned into a plurality of acoustic segments (AS).
  • AS acoustic segment
  • Each acoustic segment AS corresponds usually to the utterance of a character in a certain language environment.
  • Each acoustic segment is a detailed representation of a syllable or sub-syllable in a particular text, and has a definite phonetic meaning.
  • the phonetic symbol of each character in different language environment may correspond to many different acoustic segments.
  • the object of acoustic concatenation is to find out desired proper acoustic segment of each character, word, or phrase in detailed language environment, and then concatenates the acoustic segments together.
  • the acoustic segments AS are grouped into clusters CR determined by their acoustic similarity.
  • each cluster CR one acoustic segment AS, termed an acoustic unit (AU), is selected as a representation of all acoustic segments AS in that cluster CR.
  • All acoustic units AU form an Acoustic Unit Parameter Base 231.
  • the present invention uses an acoustic unit AU to represent a cluster CR, all other acoustic segments AS in a cluster CR are stored by offset parameters indicating prosodic variances compared to the acoustic segment of that cluster CR.
  • each acoustic unit AU is therefore converted into a sequence of parameters frame-by-frame and stored in the Acoustic Unit Parameter Base 231.
  • the "frame parameters" of each acoustic unit will be vector-quantified as a sequence of vector indices and acoustic unit parameters.
  • the acoustic unit indices are used to replace the actual acoustic unit data, thereby reducing the necessary stored data.
  • the frames representing the acoustic units AU where for example in the Chinese language an AU has an implied tone (1 to 5), are stored in the Acoustic Unit Parameter Base 231 in the following format:
  • pitch has a range of 180 ⁇ 330(Hz)
  • duration has a range of 165 ⁇ 452 ms
  • energy has a range of 770 ⁇ 7406 derived from processed and digitized utterances of varying measured RMS (Root Mean Square) power value.
  • an acoustic unit AU for the phonetic or Pinyin "Yu (2)” may be stored as: Frame_AU_51_(254,251 ,3142); and "Mao (1 )” may be stored as Frame_AU_1001_(280,190,2519).
  • Each acoustic segment AS of each cluster CR of the utterance waveform corpus is mapped with the corresponding acoustic unit indices of the acoustic unit parameter base.
  • Each acoustic segment can be obtained through the acoustic unit AU representing one of the clusters CR of acoustic segments AS.
  • the prosodic vector between the acoustic segment and its corresponding acoustic unit can be derived.
  • the prosodic vector indicates the difference of parameters between the acoustic segments of each cluster and the acoustic unit representing the cluster. Such parameter difference is based on their difference of physical instance. Therefore, an acoustic segment can be found through the representative acoustic unit and the certain prosodic vector.
  • the utterance annotation corpus is thereby created by the phonetic symbols of each segment, its corresponding acoustic unit indices and its prosodic vector in place of the acoustic segment waveforms.
  • the concatenation of text-to-speech includes three main portions: text processing, acoustic and prosodic control, and the speech synthesis.
  • text processing the input text is converted into phonetic symbols used for acoustic and prosodic control.
  • the acoustic and prosodic control portion uses the utterance annotation corpus to match the phonetic symbols to convert them into acoustic unit indices and prosodic vectors, and then through the rule-driven control, the unmatched phonetic symbols from the acoustic annotation corpus will be converted into the desired acoustic unit indices and prosodic vectors.
  • the obtained acoustic unit indices and prosodic vectors will be converted into frame parameters of the natural utterance waveform through the acoustic unit parameter base and the frame vector codebook, and then concatenated into a synthetic speech.
  • the input text of the present invention is first processed in a text processor 201.
  • a text normalization unit 211 input irregular text is classified and converted into a normalized text format of the system.
  • a word segmentation unit 212 divides the normalized text into series of word segments in accordance with a Lexicon 213 and relevant rule base (not shown).
  • a phonetic symbol assignment unit 214 converts the characters and words of the input text into a sequence of phonetic symbols.
  • the phonetic symbols would be represented by a Pinyin representation.
  • a character 'W the Chinese character for Fish
  • An acoustic and prosodic controller 202 of the present invention carries out the analysis and process of the obtained sequence of phonetic symbols.
  • the acoustic and prosodic controller 202 comprises an utterance annotation corpus 221 , an acoustic unit index and a prosodic vector selection unit 222, a prosodic rule base 223, and a prosodic refinement unit 224.
  • the present invention uses multiple controls of acoustic and prosodic to generate acoustic and prosodic information.
  • the control includes two stages, that is, a data-driven control and a rule-driven control.
  • the present invention does not use directly the utterance waveform corpus, but the utterance annotation corpus to search the parameters of the matching acoustic segments.
  • the acoustic unit index and prosodic vector selection unit 222 first finds a match from the utterance annotation corpus 221 by utilizing the text relationship or prosodic relationship. A matching phonetic symbol is replaced by the corresponding acoustic unit index and prosodic vector in the utterance annotation corpus. If the matched portion contains one or more breaks, a special acoustic unit representing the break is inserted accordingly such that the parameters of the acoustic unit include the break information.
  • an approximate (the closest) sequence in the utterance annotation corpus is used.
  • the rule-driven control stage of the present invention is used to process the unmatched sequence.
  • the phonetic symbols are used as a basis, and the unmatched phonetic symbols are determined through the corresponding acoustic unit indices, prosodic vectors, and break acoustic units in accordance with the rules or tables in a prosodic rule base 223.
  • An output of the acoustic and prosodic controller 202 includes a series of control data reflecting the utterance characteristics of the acoustic unit, and the prosodic vectors and necessary break symbols. For instance, for the Pinyin "Yu” the output control data includes an acoustic unit index of "Frame_AU_51"
  • the system also has a speech waveform synthesizer 203 that includes the acoustic unit parameter base 231 , the frame vector codebook 232, an acoustic unit parameter array generation unit 233, an acoustic unit parameter array modification unit 234, an acoustic segment array concatenation unit 235, and a waveform synthesizer 236.
  • the speech waveform synthesis of the present invention converts the obtained acoustic unit indices and prosodic vectors into frame parameters of natural utterance waveforms by utilizing the acoustic unit parameter base 231 and frame vector codebook 232, and then concatenates them into speech. The detail procedure is described hereinafter.
  • the speech waveform synthesizer 203 of the present invention forms speech waveform outputs one acoustic segment AS after another acoustic segment AS.
  • the speech waveform synthesizer 203 works primarily from three aspects of acoustic unit indices, prosodic vectors and break symbols.
  • the acoustic unit parameter base 231 of the present invention maps the composition of vector index and the frame parameter with an acoustic unit index. This is achieved by using the acoustic unit indices, the vector indices and corresponding scalar parameters can be obtained from the acoustic unit parameter base 231.
  • a series of vector indices is mapped with the acoustic unit frame parameters and scalar parameters.
  • the vector indices and the frame vector codebook obtained from the acoustic parameter base 231 may be used to acquire the frame parameters of the original utterance waveform.
  • Pinyin "Yu” acoustic unit index of "Frame_AU_51_(254, 251,3142) is accessed.
  • the acoustic unit parameter array generation unit 233 forms a vector array by using the output of the acoustic unit parameter base 231 and the frame vector codebook 232, that is, the array of acoustic unit parameters.
  • the components of each group of the vector array are the acoustic unit parameters based on the frame.
  • the size of the array depends on the number of frame of the acoustic units. This array of acoustic unit parameters describes completely all of the acoustic characteristics of the acoustic units.
  • the desired array of parameters of acoustic segments can be obtained using the difference between the acoustic segment and the acoustic unit on the basis of acoustic characteristic parameters that are prosodic variances represented by a frame format: Frame_AU_51_(offset1 ,offset2,offset3).
  • acoustic characteristic parameters that are prosodic variances represented by a frame format: Frame_AU_51_(offset1 ,offset2,offset3).
  • offsetl is a variance indicative of pitch
  • offset2 is a variance indicative of duration
  • offset2 is a variance indicative of energy.
  • the acoustic unit parameter modification unit 234 is used to accomplish this operation. During the above stated data-driven and rule-driven stages, it is obtained the prosodic vectors between the acoustic segments and the corresponding acoustic unit.
  • the acoustic unit parameter modification device 234 uses the prosodic vectors to modify the output array of the acoustic unit parameter array generation device, thereby obtaining the acoustic segment parameter array.
  • the acoustic segment parameter array is based on the frame to describe the prosodic characteristics of the acoustic segment, and may extend to include lexical tone, pitch contour, duration, and root mean square of amplitude and phonetic/co-articulator environment identity.
  • the purpose of synthesizing speech is to reproduce the acoustic segments in the utterance waveform corpus, or to generate acoustic segments by way of low distortion based on the prosodic rule base 223.
  • the acoustic segment parameter array concatenation device 235 concatenates sequentially the frame vector parameters obtained in the acoustic segment parameter array. And when a break symbol (including break information) is detected, a zero vector is inserted thereto. Finally, the arranged frame vector parameters are outputted to the utterance waveform synthesizer 236.
  • the waveform synthesizer 236 uses each frame vector to generate an acoustic segment waveform of a fixed duration, that is, the frame of the acoustic segment. Concatenation of all frames of acoustic waveforms will obtain the desired speech output.
  • Data-driven in the prior art permits a TTS system to select acoustic and prosodic information from a set of natural utterance.
  • the existing TTS system uses waveform corpus, and thus requires lots of memory space.
  • the present invention also uses data-driven control. The difference is that the present invention does not use directly the waveform corpus of huge memory space, but uses the utterance annotation corpus to save the memory space. In the utterance annotation corpus, only the description of syllables and the acoustic unit base are stored.
  • FIG 3 there is illustrated a method S300 for text to speech conversion implemented on the system of Figure 2.
  • the method performs a receiving text step S302 that is normalized by the text normalization unit 110.
  • the method S300 then performs partitioning S303 the received text into segmented phonetic units. This is effected by the word segmentation unit 140 and the phonetic units are assigned a phonetic symbol by the phonetic symbol assignment unit 140.
  • a segmented phonetic unit of text is typically a single phoneme that in Chinese text is a single character such as 'W.
  • This phonetic unit is assigned the phonetic symbol or Pinyin "Yu (2)" and at an identifying step S304 a suitable acoustic unit for each of the phonetic units is identified. For instance for “Yu (2)” the acoustic unit AU is "Frame_AU_51".
  • a determining step S305 determines variances between the prosodic parameters of the acoustic unit AU identified by Frame_AU_51 and the required prosodic parameters of the acoustic unit for "Yu(2)". This is effected by the rule base 223 and prosodic refinement unit 224 and is based on position of the acoustic unit character for "Yu(2)" in a phrase or sentence of the received text.
  • the prosodic parameters may also be based on co-articulation, phrase length and adjacent characters.
  • the method determines variances by offset values or indexes typically in the following format "Frame_AU_51_(offset1 ,offset2,offset3)".
  • a generating step S306 generates acoustic parameters from the prosodic parameters of the acoustic unit AU and associated variances (offsetl ,offset2,offset3). This is achieved by the addressing the AU parameter base 231 with Frame_AU_51 and the codebook 232. The output from the codebook 232 and AU parameter base 231 are combined to generate a vector matrix at an output of unit 233. The unit 234 determines the appropriate acoustic segment AS based on the variances (offsetl ,offset2,offset3) and the vector matrix for the acoustic unit Frame_AU_51.
  • the selected acoustic segment AS is identified at the output of the unit 234 a concatenative utterance signal results in a providing step S307 for providing an output speech signal based on the acoustic parameters of selected acoustic segment AS.
  • the concatenative utterance signal is based on the selected acoustic segment (accessing the required speech waveform in the corpus) and break information provided by unit 224.
  • the method then effects a test step S308 to determine if there is any more text to be proceeds and either terminates at a step S309 or returns to step S302.
  • the present invention provides for allowing a relatively small number of acoustic units representing clusters. This therefore provides for memory overheads.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)
  • Electrically Operated Instructional Devices (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The present invention provides a method for text to speech conversion (S300) including partitioning (S303) text into segmented phonetic units and then identifying (S304) a suitable acoustic unit for each of the phonetic units. Each acoustic unit AU being representative of acoustic segments forming a phonetic cluster determined by their acoustic similarity. The method (S300) then performs determining (S305) variances between prosodic parameters of an acoustic unit AU and each of the phonetic units. A step of generating (S306) acoustic parameters from the prosodic parameters of the acoustic unit and associated variances is then performed and thereafter a step of providing (S307) an output speech signal based on the acoustic parameters is effected. The invention may provide improved synthesized speech quality, system performance and reduced memory overheads suitable for portable devices.

Description

CONCATENATIVE TEXT-TO-SPEECH CONVERSION
FIELD OF THE INVENTION The present invention relates to a concatenative text-to-speech (TTS) conversion. The invention is particularly useful for, but not necessarily limited to, concatenative TTS synthesis with prosodic control.
BACKGROUND OF THE INVENTION Reading large volumes of text documents stored in the computers, mobile telephones, or personal data assistants (PDA) may easily cause vision tiredness. And sometimes, reading the data on the electronic screen in a moving vehicle is not convenient. Therefore, it is desired to transform the text documents into speech being played for the reader to listen so as to solve those problems.
At present, almost all high quality text-to-speech TTS synthesis technologies are based on utterance waveform concatenation of each corresponding character, word, or phrase. The desired utterance waveforms are usually derived from an utterance waveform corpus, where the utterance waveform corpus stores various sentences, phrases, and their corresponding utterance waveforms. The quality of desired synthesized utterance depends on the size of such as corpus.
Figure 1 shows an existing typical concatenative TTS system. The system includes three portions, that is, a text processing portion, acoustic segment base, and a speech synthesizer. The system first breaks up sentences and words into word segments, and then it assigns the corresponding characters with phonetic symbols with assistance of a Lexicon. Then, the sequence of segmented phonetic symbols will be matched with acoustic segments from the utterance or phrase waveform corpus, whereby obtaining the most matched acoustic segments. Finally, the selected acoustic segments will be concatenated with insertion of proper breaks to obtain the output speech.
Such an existing TTS system normally stores the utterance waveform directly. However, in order to obtain the speech effect that is very close to a person's utterance, it would require storing large volumes of utterance waveforms in all kinds of speech environments to cover the speech characteristics of most of situations. The storage of the huge amount of utterance waveform requires lots of memory space. A high quality text-to-speech system requires normally a memory capacity of hundreds of mega bytes. For a hand-held device, such as a mobile telephone or PDA, the memory capacity is usually only few mega bytes due to the limitation of hardware and cost. Therefore, on those portable devices, it is hard to have high quality text-speech. This limits the use of text-to-speech conversion in these technical fields.
SUMMARY OF THE INVENTION
The present invention provides a method for text to speech conversion, the method including: partitioning text into a segmented phonetic units; identifying a suitable acoustic unit for each of the phonetic units, each acoustic unit being representative of acoustic segments forming a phonetic cluster determined by their acoustic similarity; determining variances between prosodic parameters of an acoustic unit and each of the phonetic units; generating acoustic parameters from the prosodic parameters of the acoustic unit and associated variances to select an acoustic segment ;and providing an output speech signal based on the acoustic segment.
Suitably, the prosodic parameters includes pitch, duration or energy. Preferably, the determining is based on position of the acoustic unit in a phrase or a sentence, co-articulation, phrase length or adjacent characters of the acoustic unit.
The partitioning may be characterized by partitioning sentences of text into syllables. Suitably, the phonetic units are syllables. The phonetic units may be assigned a phonetic symbol. Suitably, the phonetic symbol is a pinyin representation.
In another form, there is provided a text-to-speech converting system. The system comprises a text processor for forming a sequence of phonetic symbols after the word segmentation on the basis of input text. The text-to-speech converting system further comprises an acoustic and prosodic controller that includes at least an utterance annotation corpus, and acoustic unit index (AU index) and prosodic vectors (PV) selection device. The utterance annotation corpus includes at least acoustic unit (AU) indices and prosodic vectors (PV). The acoustic unit index (AU index) and prosodic vector (PV) selection device receives the sequence of phonetic symbols after the word segmentation, and generates a series of control data including the acoustic unit (AU) indices and prosodic vectors (PV). The text-to-speech converting system also comprises a synthesizer that includes at least an acoustic parameter base, and the synthesizer responds to the control data from the acoustic/prosodic controller, thereby synthesizing the speech. The present invention also provides a method of converting a text entry into a corresponding synthetic speech through a concatenative text-to-speech system. The method comprises the steps of processing and converting a text input to generate a sequence of segmented phonetic symbols; searching an utterance annotation corpus including at least acoustic unit (AU) indices to find a maximum match to fetch a matched annotation context; substituting the matched portions of the sequence of segmented phonetic symbols with AU indices and prosodic vector; generating a sequence of control data having at least AU indices and prosodic vectors; and generating a synthetic speech in response to the control data.
The present invention further provides a method of forming a symbolic corpus. The method comprises the steps of slicing utterances into acoustic segments (AS); grouping said AS into clusters in consideration of phonetic classification and acoustic similarity; selecting an acoustic unit (AU) in representation of all acoustic segments in a cluster; converting the acoustic units into respective sequences of parameters frame-by-frame; vector-quantifying the frame parameters of each AU into a sequence of vector indices; forming an AU parameter base containing frame-based scalar parameters and vector indices; finding matched AU for all AS and determining the respective prosodic vectors between AU and AS; and substituting the acoustic segments with the phonetic symbols, AU indices, and prosodic vectors to form an utterance annotation corpus in place of an original AS waveform corpus. In this way, on the basis of the collection of real persons' utterance for the corpus, the present invention groups the utterance or acoustic segments, saves only an acoustic unit as representative of all acoustic segments in a cluster and the difference between the acoustic segments and the acoustic unit, and uses parameters in representation of the original utterance waveforms, thereby reducing efficiently the amount of data stored in the utterance annotation corpus.
According to the present invention, the phonetic symbols are used to replace any acoustic segments of each cluster, thereby reducing efficiently the number of desired dada of memory and saving the memory space. Besides, the present invention converts each acoustic unit waveform into a series of parameters to form an acoustic unit parameter base, using such parameters in place of the acoustic unit waveform, thereby further reducing the memory space required for storing the acoustic units. The present invention represents the acoustic segments by using the difference between the acoustic units and acoustic segments, and replaces the waveform of the acoustic segments with the phonetic symbols of each acoustic segment and its corresponding acoustic unit parameters and the difference therebetween. This can express utterance information of a syllable corresponding to each acoustic segment, thereby reducing the distortion.
The present invention provides a high efficient text-to-speech converting method and apparatus, and provides the high quality synthetic speech. The required system performance and memory space make it suitable not only for normal computers, but also for small portable devices.
BRIEF DESCRIPTION OF THE DRAWING
Figure 1 is an illustration of a prior art text-to-speech conversion system;
Figure 2 is an illustration of the text-to-speech conversion system in accordance with the present invention; and
Figure 3 is a flow diagram illustrating a method for text to speech conversion in accordance with the present invention.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT OF
THE INVENTION Referring to Figure 1 there is illustrated a prior art TTS conversion system. The system includes three main portions: a text processor 100, an acoustic segment base 200, and a synthesizer 300. The primary function of the text processor 100 is to have input text normalized and segmented, and then to assign characters of the text with corresponding phonetic symbols. The system uses the obtained phonetic symbol sequence to match with the phonetic symbol sequence stored in the acoustic segment base 200, and then replaces the phonetic symbols corresponding acoustic segments of corresponding utterance or phrases. Finally, the synthesizer 300 concatenates these acoustic segments according to the text with insertion of proper breaks, thereby obtaining the desired speech output. The acoustic segment base 200 stores a huge amount of text content information and utterances of the text content. The more the utterance information, the closer the synthetic speech is to an actual person's utterance. If a sentence of the input text were matched completely and directly with a sentence stored in the acoustic segment base, the waveform of this stored sentence could be used directly for speech output, that is, the actual utterance recorded. However, in most of situations, the system cannot find such a completely matched sentence. In this case, partial matching of words and phrases of this sentence are required and therefore it is necessary to conduct word segmentation. Then, corresponding acoustic segments are identified to provide TTS conversion.
In Figure 1 , the input text is first normalized using a text normalization unit 110. Then, a word segmentation unit 130, guided by a lexicon 120, carries out sentence partitioning, by punctuation identification and word segmentation procedures. After the word segmentation, a phonetic symbol assignment unit 140 and acoustic segment selection unit 250 utilizes an utterance or phrase corpus 260 to search and select acoustic segments in the acoustic segment base 200. The selected segments are sent to a break generation unit 380 and to the acoustic segment concatenation unit 370. The break generation unit 380 generates break information provided to the acoustic segment concatenation unit 370. The acoustic segment concatenation unit 370 concatenates and adds the proper breaks, and < outputs the speech signals to the waveform post-processing device. A waveform post-processing unit 390 then outputs synthesized converted speech signals.
For a concatenative TTS converting method or system, the quality of natural pronunciation is dependent on the size of the utterance waveform corpus and selection of appropriate acoustic segments. To save the memory space, the present invention mainly stores parameters of utterance waveforms, and then utilizes these parameters to synthesize the desired speech, thereby reducing the memory storage overheads. The present invention provides a method of forming utterance annotation corpus. This method comprises the following steps of forming an utterance waveform corpus. It first records a person's utterances whilst reading various texts, and stores these utterances in a raw utterance waveform corpus. These utterances were chosen carefully to build the raw utterance waveform corpus with a good phonetic and prosodic balance.
The utterance waveforms are partitioned into a plurality of acoustic segments (AS). Each acoustic segment AS corresponds usually to the utterance of a character in a certain language environment. Each acoustic segment is a detailed representation of a syllable or sub-syllable in a particular text, and has a definite phonetic meaning. Usually, the phonetic symbol of each character in different language environment may correspond to many different acoustic segments. The object of acoustic concatenation is to find out desired proper acoustic segment of each character, word, or phrase in detailed language environment, and then concatenates the acoustic segments together. According to the phonetic classification and acoustic similarity of the acoustic segments AS, the acoustic segments AS are grouped into clusters CR determined by their acoustic similarity. In each cluster CR, one acoustic segment AS, termed an acoustic unit (AU), is selected as a representation of all acoustic segments AS in that cluster CR. All acoustic units AU form an Acoustic Unit Parameter Base 231. In comparison with the prior art, the present invention uses an acoustic unit AU to represent a cluster CR, all other acoustic segments AS in a cluster CR are stored by offset parameters indicating prosodic variances compared to the acoustic segment of that cluster CR. In this regard, there is a relatively small variance between all acoustic segments AS in a cluster CR. Each acoustic unit AU is therefore converted into a sequence of parameters frame-by-frame and stored in the Acoustic Unit Parameter Base 231. Using a frame vector codebook 232, the "frame parameters" of each acoustic unit will be vector-quantified as a sequence of vector indices and acoustic unit parameters. In this case, the acoustic unit indices are used to replace the actual acoustic unit data, thereby reducing the necessary stored data. During the speech concatenation and synthesis, using acoustic unit indices will lead to the vector indices and acoustic unit parameters, and then the vector indices will lead to the frame parameters of the original utterance waveforms. Then, using the frame parameters the original utterance waveforms of a person can be synthesized. The frames representing the acoustic units AU, where for example in the Chinese language an AU has an implied tone (1 to 5), are stored in the Acoustic Unit Parameter Base 231 in the following format:
Frame_AU_n_(pitch,duration, energy); where in this embodiment pitch has a range of 180 ~ 330(Hz); duration has a range of 165 ~ 452 ms; and energy has a range of 770 ~ 7406 derived from processed and digitized utterances of varying measured RMS (Root Mean Square) power value.
As will be apparent to a person skilled in the art, pitch, energy and duration are prosodic features represented as prosodic vectors or parameters. Hence, an acoustic unit AU for the phonetic or Pinyin "Yu (2)" may be stored as: Frame_AU_51_(254,251 ,3142); and "Mao (1 )" may be stored as Frame_AU_1001_(280,190,2519).
Each acoustic segment AS of each cluster CR of the utterance waveform corpus is mapped with the corresponding acoustic unit indices of the acoustic unit parameter base. Each acoustic segment can be obtained through the acoustic unit AU representing one of the clusters CR of acoustic segments AS. The prosodic vector between the acoustic segment and its corresponding acoustic unit can be derived. The prosodic vector indicates the difference of parameters between the acoustic segments of each cluster and the acoustic unit representing the cluster. Such parameter difference is based on their difference of physical instance. Therefore, an acoustic segment can be found through the representative acoustic unit and the certain prosodic vector. The utterance annotation corpus is thereby created by the phonetic symbols of each segment, its corresponding acoustic unit indices and its prosodic vector in place of the acoustic segment waveforms.
Referring to Figure 2, a concatenation synthesis of the text-to-speech is explained. The concatenation of text-to-speech includes three main portions: text processing, acoustic and prosodic control, and the speech synthesis. Through the text processing, the input text is converted into phonetic symbols used for acoustic and prosodic control. Through data-driven control, the acoustic and prosodic control portion uses the utterance annotation corpus to match the phonetic symbols to convert them into acoustic unit indices and prosodic vectors, and then through the rule-driven control, the unmatched phonetic symbols from the acoustic annotation corpus will be converted into the desired acoustic unit indices and prosodic vectors. In the speech synthesizer, the obtained acoustic unit indices and prosodic vectors will be converted into frame parameters of the natural utterance waveform through the acoustic unit parameter base and the frame vector codebook, and then concatenated into a synthetic speech.
First, the text processing is briefly explained. Similar to the existing concatenative text-to-speech conversion, the input text of the present invention is first processed in a text processor 201. Through the text normalization unit 211 , input irregular text is classified and converted into a normalized text format of the system. Then, a word segmentation unit 212 divides the normalized text into series of word segments in accordance with a Lexicon 213 and relevant rule base (not shown). After the segmentation, a phonetic symbol assignment unit 214 converts the characters and words of the input text into a sequence of phonetic symbols. When considering the Chinese language, the phonetic symbols would be represented by a Pinyin representation. Thus if a character 'W (the Chinese character for Fish) was received at unit 211 then this would be converted into the Pinyin "Yu (2)" at unit 214 where (2) denotes the second tonal pronunciation of Yu.
An acoustic and prosodic controller 202 of the present invention carries out the analysis and process of the obtained sequence of phonetic symbols. The acoustic and prosodic controller 202 comprises an utterance annotation corpus 221 , an acoustic unit index and a prosodic vector selection unit 222, a prosodic rule base 223, and a prosodic refinement unit 224. The present invention uses multiple controls of acoustic and prosodic to generate acoustic and prosodic information. The control includes two stages, that is, a data-driven control and a rule-driven control.
In the prior art, for each phonetic symbol of the input text, it must
> first search the matching acoustic segment in the utterance waveform corpus as an output. The present invention does not use directly the utterance waveform corpus, but the utterance annotation corpus to search the parameters of the matching acoustic segments.
In the data-driven control stage, for the sequence of phonetic symbols obtained from the word segmentation, the acoustic unit index and prosodic vector selection unit 222 first finds a match from the utterance annotation corpus 221 by utilizing the text relationship or prosodic relationship. A matching phonetic symbol is replaced by the corresponding acoustic unit index and prosodic vector in the utterance annotation corpus. If the matched portion contains one or more breaks, a special acoustic unit representing the break is inserted accordingly such that the parameters of the acoustic unit include the break information.
For an unmatched phonetic symbol during the data-driven stage, an approximate (the closest) sequence in the utterance annotation corpus is used. Alternatively, the rule-driven control stage of the present invention is used to process the unmatched sequence. During this stage, the phonetic symbols are used as a basis, and the unmatched phonetic symbols are determined through the corresponding acoustic unit indices, prosodic vectors, and break acoustic units in accordance with the rules or tables in a prosodic rule base 223.
An output of the acoustic and prosodic controller 202 includes a series of control data reflecting the utterance characteristics of the acoustic unit, and the prosodic vectors and necessary break symbols. For instance, for the Pinyin "Yu" the output control data includes an acoustic unit index of "Frame_AU_51"
The system also has a speech waveform synthesizer 203 that includes the acoustic unit parameter base 231 , the frame vector codebook 232, an acoustic unit parameter array generation unit 233, an acoustic unit parameter array modification unit 234, an acoustic segment array concatenation unit 235, and a waveform synthesizer 236. The speech waveform synthesis of the present invention converts the obtained acoustic unit indices and prosodic vectors into frame parameters of natural utterance waveforms by utilizing the acoustic unit parameter base 231 and frame vector codebook 232, and then concatenates them into speech. The detail procedure is described hereinafter.
Based on the acoustic and prosodic control data output from the acoustic and prosodic controller 202, the speech waveform synthesizer 203 of the present invention forms speech waveform outputs one acoustic segment AS after another acoustic segment AS. For each acoustic segment AS, the speech waveform synthesizer 203 works primarily from three aspects of acoustic unit indices, prosodic vectors and break symbols.
As stated above, the acoustic unit parameter base 231 of the present invention maps the composition of vector index and the frame parameter with an acoustic unit index. This is achieved by using the acoustic unit indices, the vector indices and corresponding scalar parameters can be obtained from the acoustic unit parameter base 231.
In the frame vector codebook 232, a series of vector indices is mapped with the acoustic unit frame parameters and scalar parameters.
Therefore, the vector indices and the frame vector codebook obtained from the acoustic parameter base 231 may be used to acquire the frame parameters of the original utterance waveform. Hence, for the
Pinyin "Yu" acoustic unit index of "Frame_AU_51_(254, 251,3142) is accessed.
The acoustic unit parameter array generation unit 233 forms a vector array by using the output of the acoustic unit parameter base 231 and the frame vector codebook 232, that is, the array of acoustic unit parameters. The components of each group of the vector array are the acoustic unit parameters based on the frame. The size of the array depends on the number of frame of the acoustic units. This array of acoustic unit parameters describes completely all of the acoustic characteristics of the acoustic units.
At this point, the acoustic characteristics representative of the acoustic segments are obtained. The desired array of parameters of acoustic segments can be obtained using the difference between the acoustic segment and the acoustic unit on the basis of acoustic characteristic parameters that are prosodic variances represented by a frame format: Frame_AU_51_(offset1 ,offset2,offset3). In this regard, for each acustic segment in a cluster CR having Frame_AU_51_(254, 251 ,3142) as its representative acoustic unit AU, offsetl is a variance indicative of pitch, offset2 is a variance indicative of duration, and, offset2 is a variance indicative of energy.
The acoustic unit parameter modification unit 234 is used to accomplish this operation. During the above stated data-driven and rule-driven stages, it is obtained the prosodic vectors between the acoustic segments and the corresponding acoustic unit. The acoustic unit parameter modification device 234 uses the prosodic vectors to modify the output array of the acoustic unit parameter array generation device, thereby obtaining the acoustic segment parameter array. The acoustic segment parameter array is based on the frame to describe the prosodic characteristics of the acoustic segment, and may extend to include lexical tone, pitch contour, duration, and root mean square of amplitude and phonetic/co-articulator environment identity.
The purpose of synthesizing speech is to reproduce the acoustic segments in the utterance waveform corpus, or to generate acoustic segments by way of low distortion based on the prosodic rule base 223. The acoustic segment parameter array concatenation device 235 concatenates sequentially the frame vector parameters obtained in the acoustic segment parameter array. And when a break symbol (including break information) is detected, a zero vector is inserted thereto. Finally, the arranged frame vector parameters are outputted to the utterance waveform synthesizer 236. The waveform synthesizer 236 uses each frame vector to generate an acoustic segment waveform of a fixed duration, that is, the frame of the acoustic segment. Concatenation of all frames of acoustic waveforms will obtain the desired speech output.
Data-driven in the prior art permits a TTS system to select acoustic and prosodic information from a set of natural utterance. In order to obtain the natural utterance, the existing TTS system uses waveform corpus, and thus requires lots of memory space. To acquire the natural utterance effect, the present invention also uses data-driven control. The difference is that the present invention does not use directly the waveform corpus of huge memory space, but uses the utterance annotation corpus to save the memory space. In the utterance annotation corpus, only the description of syllables and the acoustic unit base are stored.
Referring to Figure 3, the present invention is further explained. In Figure 3, there is illustrated a method S300 for text to speech conversion implemented on the system of Figure 2. After a start step S301 the method performs a receiving text step S302 that is normalized by the text normalization unit 110. The method S300 then performs partitioning S303 the received text into segmented phonetic units. This is effected by the word segmentation unit 140 and the phonetic units are assigned a phonetic symbol by the phonetic symbol assignment unit 140. A segmented phonetic unit of text is typically a single phoneme that in Chinese text is a single character such as 'W. This phonetic unit is assigned the phonetic symbol or Pinyin "Yu (2)" and at an identifying step S304 a suitable acoustic unit for each of the phonetic units is identified. For instance for "Yu (2)" the acoustic unit AU is "Frame_AU_51". A determining step S305 determines variances between the prosodic parameters of the acoustic unit AU identified by Frame_AU_51 and the required prosodic parameters of the acoustic unit for "Yu(2)". This is effected by the rule base 223 and prosodic refinement unit 224 and is based on position of the acoustic unit character for "Yu(2)" in a phrase or sentence of the received text. The prosodic parameters may also be based on co-articulation, phrase length and adjacent characters. The method determines variances by offset values or indexes typically in the following format "Frame_AU_51_(offset1 ,offset2,offset3)".
After step S305, a generating step S306 generates acoustic parameters from the prosodic parameters of the acoustic unit AU and associated variances (offsetl ,offset2,offset3). This is achieved by the addressing the AU parameter base 231 with Frame_AU_51 and the codebook 232. The output from the codebook 232 and AU parameter base 231 are combined to generate a vector matrix at an output of unit 233. The unit 234 determines the appropriate acoustic segment AS based on the variances (offsetl ,offset2,offset3) and the vector matrix for the acoustic unit Frame_AU_51. The selected acoustic segment AS is identified at the output of the unit 234 a concatenative utterance signal results in a providing step S307 for providing an output speech signal based on the acoustic parameters of selected acoustic segment AS. The concatenative utterance signal is based on the selected acoustic segment (accessing the required speech waveform in the corpus) and break information provided by unit 224. The method then effects a test step S308 to determine if there is any more text to be proceeds and either terminates at a step S309 or returns to step S302. Advantageously, the present invention provides for allowing a relatively small number of acoustic units representing clusters. This therefore provides for memory overheads.
The detailed description provides a preferred exemplary embodiment only, and is not intended to limit the scope, applicability, or configuration of the invention. Rather, the detailed description of the preferred exemplary embodiment provides those skilled in the art with an enabling description for implementing preferred exemplary embodiment of the invention. It should be understood that various changes may be made in the function and arrangement of elements without departing from the spirit and scope of the invention as set forth in the appended claims.

Claims

WE CLAIM:
1. A method for text to speech conversion, the method including: partitioning text into a segmented phonetic units; identifying a suitable acoustic unit for each of the phonetic units, each acoustic unit being representative of acoustic segments forming a phonetic cluster determined by their acoustic similarity; determining variances between prosodic parameters of an acoustic unit and each of the phonetic units; generating acoustic parameters from the prosodic parameters of the acoustic unit and associated variances to select an acoustic segment; and providing an output speech signal based on the acoustic segment.
2. A method for text to speech conversion, as claimed in claim 1 , wherein the prosodic parameters includes pitch.
3. A method for text to speech conversion, as claimed in claim 1 , wherein the prosodic parameters includes duration.
4. A method for text to speech conversion, as claimed in claim 1 , wherein the prosodic parameters includes energy.
5. . A method for text to speech conversion, as claimed in claim 1 , wherein the determining is based on position of the acoustic unit in a phrase or sentence.
6. A method for text to speech conversion, as claimed in claim 1 , wherein the determining is based on co-articulation.
7. A method for text to speech conversion, as claimed in claim 1 , wherein the determining is based on phrase length.
8. A method for text to speech conversion, as claimed in claim 1 , wherein the determining is based on adjacent characters of the acoustic unit.
9. A method for text to speech conversion, as claimed in claim 1 , wherein the partitioning is characterized by partitioning sentences of text into syllables.
10. A method for text to speech conversion, as claimed in claim 1 , wherein the phonetic units are syllables.
11. A method for text to speech conversion, as claimed in claim 1 , wherein the phonetic units are assigned a phonetic symbol.
12. A method for text to speech conversion, as claimed in claim 11 , wherein the phonetic symbol is a pinyin representation.
13. A method for text to speech conversion, as claimed in claim 1 , wherein the output speech signal is a selection of concatenated selected acoustic segments.
14. A method for text to speech conversion, as claimed in claim 11, wherein the output speech signal is a selection of concatenated selected acoustic segments.
PCT/IB2003/002965 2002-07-25 2003-07-24 Concatenative text-to-speech conversion WO2004012183A2 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
AU2003249493A AU2003249493A1 (en) 2002-07-25 2003-07-24 Concatenative text-to-speech conversion
JP2004524006A JP2005534070A (en) 2002-07-25 2003-07-24 Concatenated text-to-speech conversion

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN02127007.4 2002-07-25
CN 02127007 CN1259631C (en) 2002-07-25 2002-07-25 Chinese test to voice joint synthesis system and method using rhythm control

Publications (2)

Publication Number Publication Date
WO2004012183A2 true WO2004012183A2 (en) 2004-02-05
WO2004012183A3 WO2004012183A3 (en) 2004-05-13

Family

ID=30121481

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2003/002965 WO2004012183A2 (en) 2002-07-25 2003-07-24 Concatenative text-to-speech conversion

Country Status (4)

Country Link
JP (1) JP2005534070A (en)
CN (1) CN1259631C (en)
AU (1) AU2003249493A1 (en)
WO (1) WO2004012183A2 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006106741A (en) * 2004-10-01 2006-04-20 At & T Corp Method and apparatus for preventing speech comprehension by interactive voice response system
EP1668630A1 (en) * 2003-09-29 2006-06-14 Motorola, Inc. Improvements to an utterance waveform corpus
CN1811912B (en) * 2005-01-28 2011-06-15 北京捷通华声语音技术有限公司 Minor sound base phonetic synthesis method
GB2501062A (en) * 2012-03-14 2013-10-16 Toshiba Res Europ Ltd A Text to Speech System
US9361722B2 (en) 2013-08-08 2016-06-07 Kabushiki Kaisha Toshiba Synthetic audiovisual storyteller

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100416651C (en) * 2005-01-28 2008-09-03 凌阳科技股份有限公司 Mixed parameter mode type speech sounds synthetizing system and method
CN1979636B (en) * 2005-12-07 2010-12-29 凌阳科技股份有限公司 Method for converting phonetic symbol to speech
JP2007334144A (en) * 2006-06-16 2007-12-27 Oki Electric Ind Co Ltd Speech synthesis method, speech synthesizer, and speech synthesis program
US8600447B2 (en) * 2010-03-30 2013-12-03 Flextronics Ap, Llc Menu icons with descriptive audio
CN102164318A (en) * 2011-03-11 2011-08-24 深圳创维数字技术股份有限公司 Voice prompting method, device and digital television receiving terminal
CN103577148A (en) * 2013-11-28 2014-02-12 南京奇幻通信科技有限公司 Voice reading method and device
CN105989833B (en) * 2015-02-28 2019-11-15 讯飞智元信息科技有限公司 Multilingual mixed this making character fonts of Chinese language method and system
GB2581032B (en) * 2015-06-22 2020-11-04 Time Machine Capital Ltd System and method for onset detection in a digital signal
CN105632484B (en) * 2016-02-19 2019-04-09 云知声(上海)智能科技有限公司 Speech database for speech synthesis pause information automatic marking method and system
CN107871495A (en) * 2016-09-27 2018-04-03 晨星半导体股份有限公司 Text-to-speech method and system
CN110797006B (en) * 2020-01-06 2020-05-19 北京海天瑞声科技股份有限公司 End-to-end speech synthesis method, device and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0805433A2 (en) * 1996-04-30 1997-11-05 Microsoft Corporation Method and system of runtime acoustic unit selection for speech synthesis
EP0880127A2 (en) * 1997-05-21 1998-11-25 Nippon Telegraph and Telephone Corporation Method and apparatus for editing/creating synthetic speech message and recording medium with the method recorded thereon
EP1037195A2 (en) * 1999-03-15 2000-09-20 Matsushita Electric Industrial Co., Ltd. Generation and synthesis of prosody templates
US20010051872A1 (en) * 1997-09-16 2001-12-13 Takehiko Kagoshima Clustered patterns for text-to-speech synthesis

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0805433A2 (en) * 1996-04-30 1997-11-05 Microsoft Corporation Method and system of runtime acoustic unit selection for speech synthesis
EP0880127A2 (en) * 1997-05-21 1998-11-25 Nippon Telegraph and Telephone Corporation Method and apparatus for editing/creating synthetic speech message and recording medium with the method recorded thereon
US20010051872A1 (en) * 1997-09-16 2001-12-13 Takehiko Kagoshima Clustered patterns for text-to-speech synthesis
EP1037195A2 (en) * 1999-03-15 2000-09-20 Matsushita Electric Industrial Co., Ltd. Generation and synthesis of prosody templates

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
HIROKAWA T ET AL: "HIGH QUALITY SPEECH SYNTHESIS SYSTEM BASED ON WAVEFORM CONCATENATION OF PHONEME SEGMENT" IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS, COMMUNICATIONS AND COMPUTER SCIENCES, INSTITUTE OF ELECTRONICS INFORMATION AND COMM. ENG. TOKYO, JP, vol. 76A, no. 11, 1 November 1993 (1993-11-01), pages 1964-1970, XP000420615 ISSN: 0916-8508 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1668630A1 (en) * 2003-09-29 2006-06-14 Motorola, Inc. Improvements to an utterance waveform corpus
EP1668630A4 (en) * 2003-09-29 2008-04-23 Motorola Inc Improvements to an utterance waveform corpus
JP2006106741A (en) * 2004-10-01 2006-04-20 At & T Corp Method and apparatus for preventing speech comprehension by interactive voice response system
CN1811912B (en) * 2005-01-28 2011-06-15 北京捷通华声语音技术有限公司 Minor sound base phonetic synthesis method
GB2501062A (en) * 2012-03-14 2013-10-16 Toshiba Res Europ Ltd A Text to Speech System
GB2501062B (en) * 2012-03-14 2014-08-13 Toshiba Res Europ Ltd A text to speech method and system
US9454963B2 (en) 2012-03-14 2016-09-27 Kabushiki Kaisha Toshiba Text to speech method and system using voice characteristic dependent weighting
US9361722B2 (en) 2013-08-08 2016-06-07 Kabushiki Kaisha Toshiba Synthetic audiovisual storyteller

Also Published As

Publication number Publication date
CN1259631C (en) 2006-06-14
WO2004012183A3 (en) 2004-05-13
JP2005534070A (en) 2005-11-10
CN1471025A (en) 2004-01-28
AU2003249493A1 (en) 2004-02-16
AU2003249493A8 (en) 2004-02-16

Similar Documents

Publication Publication Date Title
KR100769033B1 (en) Method for synthesizing speech
US9761219B2 (en) System and method for distributed text-to-speech synthesis and intelligibility
US5636325A (en) Speech synthesis and analysis of dialects
JP4247564B2 (en) System, program, and control method
JP4536323B2 (en) Speech-speech generation system and method
US8015011B2 (en) Generating objectively evaluated sufficiently natural synthetic speech from text by using selective paraphrases
EP1100072A1 (en) Speech synthesizing system and speech synthesizing method
WO2004012183A2 (en) Concatenative text-to-speech conversion
CN1315809A (en) Apparatus and method for spelling speech recognition in mobile communication
JP2002530703A (en) Speech synthesis using concatenation of speech waveforms
US7069216B2 (en) Corpus-based prosody translation system
US10699695B1 (en) Text-to-speech (TTS) processing
US6477495B1 (en) Speech synthesis system and prosodic control method in the speech synthesis system
WO2006106182A1 (en) Improving memory usage in text-to-speech system
CN104899192A (en) Apparatus and method for automatic interpretation
CN112242134A (en) Speech synthesis method and device
JP3576066B2 (en) Speech synthesis system and speech synthesis method
WO2003098601A1 (en) Method and apparatus for processing numbers in a text to speech application
KR20050021567A (en) Concatenative text-to-speech conversion
CN1238805C (en) Method and apparatus for compressing voice library
Hendessi et al. A speech synthesizer for Persian text using a neural network with a smooth ergodic HMM
Adeeba et al. Comparison of Urdu text to speech synthesis using unit selection and HMM based techniques
Dong et al. A Unit Selection-based Speech Synthesis Approach for Mandarin Chinese.
Narupiyakul et al. A stochastic knowledge-based Thai text-to-speech system
CN1979636A (en) Method for converting phonetic symbol to speech

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NI NO NZ OM PH PL PT RO RU SC SD SE SG SK SL TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
121 Ep: the epo has been informed by wipo that ep was designated in this application
WWE Wipo information: entry into national phase

Ref document number: 2004524006

Country of ref document: JP

WWE Wipo information: entry into national phase

Ref document number: 1020057001367

Country of ref document: KR

WWP Wipo information: published in national office

Ref document number: 1020057001367

Country of ref document: KR

122 Ep: pct application non-entry in european phase