WO2005034082A1 - Method for synthesizing speech - Google Patents

Method for synthesizing speech Download PDF

Info

Publication number
WO2005034082A1
WO2005034082A1 PCT/US2004/030467 US2004030467W WO2005034082A1 WO 2005034082 A1 WO2005034082 A1 WO 2005034082A1 US 2004030467 W US2004030467 W US 2004030467W WO 2005034082 A1 WO2005034082 A1 WO 2005034082A1
Authority
WO
WIPO (PCT)
Prior art keywords
match
speech
pitch
speech segment
segment
Prior art date
Application number
PCT/US2004/030467
Other languages
French (fr)
Inventor
Fang Chen
Gui-Lin Chen
Original Assignee
Motorola, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Motorola, Inc. filed Critical Motorola, Inc.
Priority to MXPA06003431A priority Critical patent/MXPA06003431A/en
Priority to EP04784355A priority patent/EP1668628A4/en
Publication of WO2005034082A1 publication Critical patent/WO2005034082A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/06Elementary speech units used in speech synthesisers; Concatenation rules
    • G10L13/07Concatenation rules
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/06Elementary speech units used in speech synthesisers; Concatenation rules

Definitions

  • the present invention relates generally to Text-To-Speech (TTS) synthesis.
  • TTS Text-To-Speech
  • the invention is particularly useful for, but not necessarily limited to, determining an appropriate synthesized pronunciation of a text segment using a non-exhaustive utterance corpus.
  • TTS Text to Speech
  • concatenated text to speech synthesis allows electronic devices to receive an input text string and provide a converted representation of the string in the form of synthesized speech.
  • a device that may be required to synthesize speech originating from a non-deterministic number of received text strings will have difficulty in providing high quality realistic synthesized speech. That is because the pronunciation of each word or syllable (for Chinese characters and the like) to be synthesized is context and location dependent. For example, a pronunciation of a word at the beginning of a sentence
  • input text string may be drawn out or lengthened.
  • the pronunciation of the same word may be lengthened even more if it occurs in the middle of a sentence where emphasis is required.
  • the pronunciation of a word depends on at least tone (pitch), volume and duration.
  • many languages include numerous possible pronunciations of individual syllables.
  • a single syllable represented by a Chinese character (or other similar character based script) may have up to 6 different pronunciations.
  • a large pre-recorded utterance waveform corpus of sentences is required. This corpus typically requires on average about 500 variations of each pronunciation if realistic speech synthesis is to be achieved.
  • an utterance waveform corpus of all pronunciations for every character would be prohibitively large.
  • the size of the utterance waveform corpus may be particularly limited when it is embedded in a small electronic device having a low memory capacity such as a radio telephone or a personal digital assistant.
  • the algorithms used to compare the input text strings with the audio database also need to be efficient and fast so that the resulting synthesized and concatenated speech flows naturally and smoothly. Due to memory and processing speed limitations, existing TTS methods for embedded applications often result in speech that is unnatural or robotic sounding. There is therefore a need for an improved method for performing TTS to provide a natural sounding synthesized speech whilst using a non-exhaustive utterance corpus.
  • the present invention is a method of performing speech synthesis that includes comparing an input text segment with an utterance waveform corpus that contains numerous speech samples. The method determines whether there is a contextual best match between the text segment and one speech sample included in the utterance waveform corpus. If there is not a contextual best match, the method determines whether there is a contextual phonetic hybrid match between the text segment and a speech sample included in the utterance waveform corpus.
  • a contextual phonetic hybrid match requires a match of all implicit prosodic features in a defined prosodic feature group.
  • the prosodic feature group is redefined by deleting one of the implicit prosodic features from the prosodic feature group so as to redefine the prosodic feature group.
  • the prosodic feature group is successively redefined by deleting one implicit prosodic feature from the group until a match is found between the input text segment and a speech sample. When a match is found, the matched speech sample is used to generate concatenative speech.
  • Fig. 1 is a block diagram of an electronic device upon which the invention may be implemented
  • Fig. 2 is a flow chart illustrating a specific embodiment of the present invention used to generate concatenative speech in the Chinese language
  • Fig. 3 is a flowchart illustrating the process of determining whether a contextual phonetic hybrid match exists by successively relaxing the constraints used to define a match.
  • Fig. 1 there is illustrated a block diagram of an electronic device 10 upon which the invention may be implemented.
  • the device 10 includes a processor 30 operatively coupled, by a common bus 15, to a text memory module 20, a Read only Memory (ROM) 40, a Random Access Memory (RAM) 50 and a waveform corpus 60.
  • the processor 30 is also operatively coupled to a touch screen display 90 and an input of a speech synthesizer 70.
  • An output of the speech synthesizer 70 is operatively coupled to a speaker 80.
  • the text memory module is a store for storing text obtained by any receiving means possible such as by radio reception, internet, or plug in portable memory cards etc.
  • the ROM stores operating code for performing the invention as described in Figures 2 and 3.
  • the Corpus 60 is a essentially a conventional corpus as is the speech synthesizer 70 and speaker 80 and the touch screen display 90 is a user interface and allows for display of text stored in the a text memory module 20.
  • Fig. 2 is a flow chart illustrating a specific embodiment of the present invention used to generate concatenative speech 110 from an input text segment 120 in the Chinese language.
  • the text segment 120 is compared with an utterance waveform corpus 60, which includes a plurality of speech samples 140, to determine whether there is a contextual best match (step S110). If a contextual best match is found between a text segment 120 and a specific speech sample 140, that specific speech sample 140 is sent to a concatenating algorithm 150 for generating the concatenative speech 110. If no contextual best match is found between the text segment 120 and a specific speech sample 140, then the text segment 120 is compared again with the utterance waveform corpus 130 to determine whether there is a contextual phonetic hybrid match (step S120).
  • Fig. 3 is a flowchart illustrating the process of determining whether a contextual phonetic hybrid match exists by successively relaxing the constraints used to define a match.
  • a contextual phonetic hybrid match requires a match between a text segment 120 and all of the implicit prosodic features 210 included in a defined prosodic feature group 220. If no match is found, one of the implicit prosodic features 210 is deleted from the defined prosodic feature group 220 and the group 220 is redefined as including all of the previously included features 210 less the deleted feature 210 (e.g., Step 130). The redefined prosodic feature group 220 is then compared with the text segment 120 to determine whether there is a match. The process of deleting an implicit prosodic feature 210, redefining the prosodic feature group 220, and then redetermining whether there is a contextual phonetic hybrid match, continues until a match is found (Steps S130, S140, etc.
  • the matched speech sample 140 which matches the text segment 120, is sent to the concatenating algorithm 150 for generating concatenative speech 110.
  • the concatenating algorithm 150 for generating concatenative speech 110.
  • a basic phonetic match is performed matching only pinyin (Step S180).
  • the utterance waveform corpus 60 is designed so that there is always at least one syllable included with the correct pinyin to match all possible input text segments 120. That basic phonetic match is then input into the concatenating algorithm 150.
  • the invention is thus a multi-layer, data-driven method for controlling the prosody (rhythm and intonation) of the resulting synthesized, concatenative speech 110.
  • each layer of the method includes a redefined prosodic feature group 220.
  • a text segment 120 means any type of input text string or segment of coded language. It should not be limited to only visible text that is scaned or otherwise entered into a TTS system.
  • the utterance waveform corpus 130 of the present invention is annotated with information concerning each speech sample 140 (usually a word) that is included in the corpus 130.
  • the speech samples 140 themselves are generally recordings of actual human speech, usually digitized or analog waveforms. Annotations are thus required to identify the samples 140.
  • Such annotations may include the specific letters or characters (depending on the language) that define the sample 140 as well as the implicit prosodic features 210 of the speech sample 140.
  • the implicit prosodic features 210 include context information concerning how the speech sample 140 is used in a sentence.
  • a speech sample 140 in the Chinese language may include the following implicit prosodic features 210: Text context: the Chinese characters immediately preceding and immediately following the annotated text of a speech sample 140.
  • Pinyin the phonetic representation of a speech sample 140. Pinyin is a standard romanization of the Chinese language using the Western alphabet.
  • Tone context the tone context of the Chinese characters immediately preceding and immediately following the annotated text of a speech sample 140.
  • Co-articulation the phonetic level representatives that immediately precede and immediately follow the annotated text of a speech sample 140, such as phonemes or sub-syllables.
  • Syllable position the position of a syllable in a prosodic phrase.
  • Phrase position the position of a prosodic phrase in a sentence. Usually the phrase position is identified as one of the three positions of sentence initial, sentence medial and sentence final.
  • Character symbol the code (e.g., ASCII code) representing the Chinese character that defines a speech sample 140.
  • Length of phrase the number of Chinese characters included in a prosodic phrase.
  • each character's sound could represent a speech sample 140 and could be annotated with the above implicit prosodic features 210.
  • the character "H” as found in the above sentence could be annotated as follows: Text context: ⁇ , nowadays ; Pinyin: guo2; Tone context: 1 , 3; Co-articulation: ong, h ; Syllable position: 2 ; Phrase position: 1 ; Character symbol: ASCII code for H ; and Length of phrase: 2.
  • step S110 determines whether there is a contextual best match between a text segment 120 and a speech sample 140.
  • a contextual best match is generally defined as the closest, or an exact, match of both 1 ) the letters or characters (depending on the language) of an input text segment 120 with the corresponding letters or characters of an annotated speech sample 140, and 2) the implicit prosodic features 210 of the input text segment 120 with the implicit prosodic features 210 of the annotated speech sample 140.
  • a best match is being determined by identifying the greatest number of consecutive syllables in the input text segment that are identical to attributes and attribute positions in each of the waveform utterances (speech sample) in the waveform corpus 60.
  • a speech sample 140 selected immediately as an element for use in the concatenating algorithm 150.
  • the method of the present invention determines whether there is a contextual phonetic hybrid match between an input text segment 120 and a speech sample 140.
  • a contextual phonetic hybrid match requires a match between a text segment 120 and all of the implicit prosodic features 210 included in a defined prosodic feature group 220. As shown in Fig.
  • one embodiment of the present invention used to synthesize speech in the Chinese language uses a first defined prosodic feature group 220 that includes the implicit prosodic features 210 of pinyin, tone context, co- articulation, syllable position, phrase position, character symbol, and length of phrase (Step S120). If none of the annotated speech samples 140 found in the utterance waveform corpus 130 have identical values for each of the above features 210 as found in the input text segment 120, then the corpus 130 does not contain a speech sample 140 that is close enough to the input text segment 120 based on the matching rules as applied in Step S120. Therefore the constraints of the matching rules must be relaxed and thus broadened to include other speech samples 140 that possess the next most preferable features 210 found in the input text segment 120.
  • the matching rules are broadened by deleting the one feature 210 found in the defined prosodic feature group 220 that is least likely to affect the natural prosody of the input text segment 120.
  • the next most preferable features 210 found in the illustrated embodiment of the present invention include all of the features 210 defined above less the length of phrase feature 210.
  • the order in which the implicit prosodic features 210 are deleted from the defined prosodic feature group 220 is determined empirically. When the features 210 are deleted in a proper order, the method of the present invention results in efficient and fast speech synthesis. The output speech therefore sounds more natural even though the utterance waveform corpus 130 may be relatively limited in size.
  • the variable BestPitch may be determined based on a statistical analysis of the utterance waveform corpus 130.
  • a corpus 130 may include five tones, each having an average pitch.
  • Each annotated speech sample 140 in the corpus 130 may also include individual prosody information represented by the values of pitch, duration and energy. So the average values of pitch, duration and energy of the entire corpus 130 are available.
  • the best pitch for a particular context may then be determined using the following formula: BestPitch - pitch tone — nlndexx empiricalvalue (Eq.
  • pitch t one the average pitch including tone of the utterance waveform corpus
  • nlndex the index of the text segment 120 in a prosody phrase
  • empircalvalue an empirical value based on the utterance waveform corpus.
  • the empirical value of 4 is used in one particular embodiment of the present invention that synthesizes the Chinese language; however this number could vary depending on the content of a particular utterance waveform corpus 130.
  • the empirical value of 4 is used in one particular embodiment of the present invention that synthesizes the Chinese language, however this number could vary depending on the content of a particular utterance waveform corpus 130.
  • the invention is suitable for many languages.
  • the implicit prosodic features 210 would need to be deleted or redefined from the examples given hereinabove.
  • the feature 210 identified above as tone context would be deleted in an application of the present invention for the English language because English is not a tonal language.
  • the feature 210 identified above as pinyin would likely be redefined as simply a phonetic symbol when the present invention is applied to English.
  • the present invention is therefore a multi-layer, data-driven prosodic control scheme that utilizes the implicit prosodic information in an utterance waveform corpus 130.
  • the method of the present invention When searching for an appropriate speech sample 140 to match with a given input text segment 120, the method of the present invention employs a strategy based on multi-layer matching, where each layer is tried in turn until a sufficiently good match is found. By successively relaxing the constraints of each layer, the method efficiently determines whether the utterance waveform corpus 130 contains a match.
  • the method is therefore particularly appropriate for embedded TTS systems where the size of the utterance waveform corpus 130 and the processing power of the system may be limited.

Abstract

A method of performing speech synthesis that includes comparing a text segment (120) with an utterance waveform corpus (60) that contains numerous speech samples (140). The method determines whether there is a contextual best match between the text segment (120) and one speech sample (140). If there is not a contextual best match, the method determines whether there is a contextual phonetic hybrid match between the text segment (120) and a speech sample (140). A contextual phonetic hybrid match requires a match of all implicit prosodic features (210) in a defined prosodic feature group (220). If a match is still not found, the prosodic feature group (220) is redefined by deleting one of the implicit prosodic features (210) from the prosodic feature group (220). The prosodic feature group (220) is successively redefined by deleting one implicit prosodic feature (210) from the group (220) until a match is found between the input text segment (120) and a speech sample (140). When a match is found, the matched speech sample (140) is used to generate concatenative speech (110).

Description

METHOD FOR SYNTHESIZING SPEECH
FIELD OF THE INVENTION The present invention relates generally to Text-To-Speech (TTS) synthesis. The invention is particularly useful for, but not necessarily limited to, determining an appropriate synthesized pronunciation of a text segment using a non-exhaustive utterance corpus.
BACKGROUND OF THE INVENTION Text to Speech (TTS) conversion, often referred to as concatenated text to speech synthesis, allows electronic devices to receive an input text string and provide a converted representation of the string in the form of synthesized speech. However, a device that may be required to synthesize speech originating from a non-deterministic number of received text strings will have difficulty in providing high quality realistic synthesized speech. That is because the pronunciation of each word or syllable (for Chinese characters and the like) to be synthesized is context and location dependent. For example, a pronunciation of a word at the beginning of a sentence
(input text string) may be drawn out or lengthened. The pronunciation of the same word may be lengthened even more if it occurs in the middle of a sentence where emphasis is required. In most languages the pronunciation of a word depends on at least tone (pitch), volume and duration. Furthermore many languages include numerous possible pronunciations of individual syllables. Typically a single syllable represented by a Chinese character (or other similar character based script) may have up to 6 different pronunciations. Furthermore, in order to provide a realistic synthesized utterance of each pronunciation, a large pre-recorded utterance waveform corpus of sentences is required. This corpus typically requires on average about 500 variations of each pronunciation if realistic speech synthesis is to be achieved. Thus an utterance waveform corpus, of all pronunciations for every character would be prohibitively large. In most TTS systems there is a need to determine the appropriate pronunciation of an input text string based on comparisons with a limited size utterance waveform corpus. The size of the utterance waveform corpus may be particularly limited when it is embedded in a small electronic device having a low memory capacity such as a radio telephone or a personal digital assistant. The algorithms used to compare the input text strings with the audio database also need to be efficient and fast so that the resulting synthesized and concatenated speech flows naturally and smoothly. Due to memory and processing speed limitations, existing TTS methods for embedded applications often result in speech that is unnatural or robotic sounding. There is therefore a need for an improved method for performing TTS to provide a natural sounding synthesized speech whilst using a non-exhaustive utterance corpus.
SUMMARY OF THE INVENTION The present invention is a method of performing speech synthesis that includes comparing an input text segment with an utterance waveform corpus that contains numerous speech samples. The method determines whether there is a contextual best match between the text segment and one speech sample included in the utterance waveform corpus. If there is not a contextual best match, the method determines whether there is a contextual phonetic hybrid match between the text segment and a speech sample included in the utterance waveform corpus. A contextual phonetic hybrid match requires a match of all implicit prosodic features in a defined prosodic feature group. If a match is still not found, the prosodic feature group is redefined by deleting one of the implicit prosodic features from the prosodic feature group so as to redefine the prosodic feature group. The prosodic feature group is successively redefined by deleting one implicit prosodic feature from the group until a match is found between the input text segment and a speech sample. When a match is found, the matched speech sample is used to generate concatenative speech.
BRIEF DESCRIPTION OF THE DRAWINGS Other aspects of the present invention will become apparent from the following detailed description taken together with the drawings, wherein like reference characters designate like or corresponding elements or steps throughout the drawings, in which: Fig. 1 is a block diagram of an electronic device upon which the invention may be implemented; Fig. 2 is a flow chart illustrating a specific embodiment of the present invention used to generate concatenative speech in the Chinese language; and Fig. 3 is a flowchart illustrating the process of determining whether a contextual phonetic hybrid match exists by successively relaxing the constraints used to define a match.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT OF THE INVENTION Referring to Fig. 1 there is illustrated a block diagram of an electronic device 10 upon which the invention may be implemented. The device 10 includes a processor 30 operatively coupled, by a common bus 15, to a text memory module 20, a Read only Memory (ROM) 40, a Random Access Memory (RAM) 50 and a waveform corpus 60. The processor 30 is also operatively coupled to a touch screen display 90 and an input of a speech synthesizer 70. An output of the speech synthesizer 70 is operatively coupled to a speaker 80. As will be apparent to a person skilled in the art, the text memory module is a store for storing text obtained by any receiving means possible such as by radio reception, internet, or plug in portable memory cards etc. The ROM stores operating code for performing the invention as described in Figures 2 and 3. Also the Corpus 60 is a essentially a conventional corpus as is the speech synthesizer 70 and speaker 80 and the touch screen display 90 is a user interface and allows for display of text stored in the a text memory module 20. Fig. 2 is a flow chart illustrating a specific embodiment of the present invention used to generate concatenative speech 110 from an input text segment 120 in the Chinese language. The text segment 120 is compared with an utterance waveform corpus 60, which includes a plurality of speech samples 140, to determine whether there is a contextual best match (step S110). If a contextual best match is found between a text segment 120 and a specific speech sample 140, that specific speech sample 140 is sent to a concatenating algorithm 150 for generating the concatenative speech 110. If no contextual best match is found between the text segment 120 and a specific speech sample 140, then the text segment 120 is compared again with the utterance waveform corpus 130 to determine whether there is a contextual phonetic hybrid match (step S120). Fig. 3 is a flowchart illustrating the process of determining whether a contextual phonetic hybrid match exists by successively relaxing the constraints used to define a match. A contextual phonetic hybrid match requires a match between a text segment 120 and all of the implicit prosodic features 210 included in a defined prosodic feature group 220. If no match is found, one of the implicit prosodic features 210 is deleted from the defined prosodic feature group 220 and the group 220 is redefined as including all of the previously included features 210 less the deleted feature 210 (e.g., Step 130). The redefined prosodic feature group 220 is then compared with the text segment 120 to determine whether there is a match. The process of deleting an implicit prosodic feature 210, redefining the prosodic feature group 220, and then redetermining whether there is a contextual phonetic hybrid match, continues until a match is found (Steps S130, S140, etc. to S170). When a contextual phonetic hybrid match is found, the matched speech sample 140, which matches the text segment 120, is sent to the concatenating algorithm 150 for generating concatenative speech 110. As shown in Fig. 3, if all of the implicit prosodic features 210 except pinyin are successively deleted from the prosodic feature group 220 and still no match is found, then a basic phonetic match is performed matching only pinyin (Step S180). In one embodiment of the present invention the utterance waveform corpus 60 is designed so that there is always at least one syllable included with the correct pinyin to match all possible input text segments 120. That basic phonetic match is then input into the concatenating algorithm 150. The invention is thus a multi-layer, data-driven method for controlling the prosody (rhythm and intonation) of the resulting synthesized, concatenative speech 110. Where each layer of the method includes a redefined prosodic feature group 220. For purposes of the present invention a text segment 120 means any type of input text string or segment of coded language. It should not be limited to only visible text that is scaned or otherwise entered into a TTS system. The utterance waveform corpus 130 of the present invention is annotated with information concerning each speech sample 140 (usually a word) that is included in the corpus 130. The speech samples 140 themselves are generally recordings of actual human speech, usually digitized or analog waveforms. Annotations are thus required to identify the samples 140. Such annotations may include the specific letters or characters (depending on the language) that define the sample 140 as well as the implicit prosodic features 210 of the speech sample 140. The implicit prosodic features 210 include context information concerning how the speech sample 140 is used in a sentence. For example, a speech sample 140 in the Chinese language may include the following implicit prosodic features 210: Text context: the Chinese characters immediately preceding and immediately following the annotated text of a speech sample 140. Pinyin: the phonetic representation of a speech sample 140. Pinyin is a standard romanization of the Chinese language using the Western alphabet. Tone context: the tone context of the Chinese characters immediately preceding and immediately following the annotated text of a speech sample 140. Co-articulation: the phonetic level representatives that immediately precede and immediately follow the annotated text of a speech sample 140, such as phonemes or sub-syllables. Syllable position: the position of a syllable in a prosodic phrase. Phrase position: the position of a prosodic phrase in a sentence. Usually the phrase position is identified as one of the three positions of sentence initial, sentence medial and sentence final. Character symbol: the code (e.g., ASCII code) representing the Chinese character that defines a speech sample 140. Length of phrase: the number of Chinese characters included in a prosodic phrase. For an example of the specific values of the above implicit prosodic features 210, consider the following sentence in Chinese: "Φ Hfl^ ." If a spoken audio recording of that sentence were stored in an utterance waveform corpus 130, each character's sound could represent a speech sample 140 and could be annotated with the above implicit prosodic features 210. For example, the character "H" as found in the above sentence could be annotated as follows: Text context: ψ, ?! ; Pinyin: guo2; Tone context: 1 , 3; Co-articulation: ong, h ; Syllable position: 2 ; Phrase position: 1 ; Character symbol: ASCII code for H ; and Length of phrase: 2. In Fig. 2, step S110 determines whether there is a contextual best match between a text segment 120 and a speech sample 140. A contextual best match is generally defined as the closest, or an exact, match of both 1 ) the letters or characters (depending on the language) of an input text segment 120 with the corresponding letters or characters of an annotated speech sample 140, and 2) the implicit prosodic features 210 of the input text segment 120 with the implicit prosodic features 210 of the annotated speech sample 140. In more general terms a best match is being determined by identifying the greatest number of consecutive syllables in the input text segment that are identical to attributes and attribute positions in each of the waveform utterances (speech sample) in the waveform corpus 60. Only when both the letters or characters and the implicit prosodic features 210 match exactly is a speech sample 140 selected immediately as an element for use in the concatenating algorithm 150. When a contextual best match is not found, the method of the present invention then determines whether there is a contextual phonetic hybrid match between an input text segment 120 and a speech sample 140. As described above, a contextual phonetic hybrid match requires a match between a text segment 120 and all of the implicit prosodic features 210 included in a defined prosodic feature group 220. As shown in Fig. 3, one embodiment of the present invention used to synthesize speech in the Chinese language uses a first defined prosodic feature group 220 that includes the implicit prosodic features 210 of pinyin, tone context, co- articulation, syllable position, phrase position, character symbol, and length of phrase (Step S120). If none of the annotated speech samples 140 found in the utterance waveform corpus 130 have identical values for each of the above features 210 as found in the input text segment 120, then the corpus 130 does not contain a speech sample 140 that is close enough to the input text segment 120 based on the matching rules as applied in Step S120. Therefore the constraints of the matching rules must be relaxed and thus broadened to include other speech samples 140 that possess the next most preferable features 210 found in the input text segment 120. In other words, the matching rules are broadened by deleting the one feature 210 found in the defined prosodic feature group 220 that is least likely to affect the natural prosody of the input text segment 120. For example, as shown in Step S130 in both Fig. 2 and Fig.3, the next most preferable features 210 found in the illustrated embodiment of the present invention include all of the features 210 defined above less the length of phrase feature 210. The order in which the implicit prosodic features 210 are deleted from the defined prosodic feature group 220 is determined empirically. When the features 210 are deleted in a proper order, the method of the present invention results in efficient and fast speech synthesis. The output speech therefore sounds more natural even though the utterance waveform corpus 130 may be relatively limited in size. According to the present invention, after the utterance waveform corpus 130 has been compared with a text segment 120 using a particular defined prosodic feature group 220, it is possible that the annotations of multiple speech samples 140 will be found to match the analyzed text segment 120. In such a case, an optimal contextual phonetic hybrid match may be selected by using the following equation: dW = Wp dur- BestDur 2 (E 1 " p
Figure imgf000009_0001
d BestDur ' where Wp = weight of the pitch of the text segment 120; Wd = weight of the duration of the text segment 120; cliff = differential value for selecting an optimal contextual phonetic hybrid match; pitch = pitch of the text segment 120; BestPitch = pitch of an ideal text segment 120; dur= duration of the text segment 120; and BestDur = duration of the ideal text segment 120.
In the above Equation 1, the variable BestPitch may be determined based on a statistical analysis of the utterance waveform corpus 130. For example, a corpus 130 may include five tones, each having an average pitch. Each annotated speech sample 140 in the corpus 130 may also include individual prosody information represented by the values of pitch, duration and energy. So the average values of pitch, duration and energy of the entire corpus 130 are available. The best pitch for a particular context may then be determined using the following formula: BestPitch - pitchtone — nlndexx empiricalvalue (Eq. 2) where pitchtone= the average pitch including tone of the utterance waveform corpus; nlndex= the index of the text segment 120 in a prosody phrase; and empircalvalue = an empirical value based on the utterance waveform corpus. The empirical value of 4 is used in one particular embodiment of the present invention that synthesizes the Chinese language; however this number could vary depending on the content of a particular utterance waveform corpus 130. Similarly, the duration of an ideal text segment 120 may be determined using the following equation: BestDur= durs fs - nlndexx empircαlvdue (Eq.3) where durs = the average duration of the text segment 120 without tone; nlndex = the index of the text segment 120 in a prosody phrase; fs = a coefficient for prosody position; and empircalvalue = an empirical value based on said utterance waveform corpus. Again, the empirical value of 4 is used in one particular embodiment of the present invention that synthesizes the Chinese language, however this number could vary depending on the content of a particular utterance waveform corpus 130. The differential value for a word diffW may be the summation of differential values for each syllable in the word. That may be represented in mathematical terms by the following equation: difw = /iffk (Eq.4) it
As described above, if several speech samples 140 are found to match a particular text segment 120, the system will choose the speech sample 140 whose differential value is lowest. That may be represented in mathematical terms by the following equation: diJWmm = ikfw(J W, (Eq. 5) I Further, the method of the present invention may include the use of preset thresholds for the differential value diffW. If the differential value for a matched speech sample 140 is below a particular threshold, the method will route the matched speech sample 140 to the concatenating algorithm 150 for generating the concatenative speech 110. Otherwise, the method may require relaxing the constraints on the contextual phonetic hybrid match by deleting one of the required implicit prosodic features 210 and continuing to search for a match. Although the above description concerns a specific example of the method of the present invention for the Chinese language, the invention is suitable for many languages. For some languages the implicit prosodic features 210 would need to be deleted or redefined from the examples given hereinabove. For example, the feature 210 identified above as tone context would be deleted in an application of the present invention for the English language because English is not a tonal language. Also, the feature 210 identified above as pinyin would likely be redefined as simply a phonetic symbol when the present invention is applied to English. The present invention is therefore a multi-layer, data-driven prosodic control scheme that utilizes the implicit prosodic information in an utterance waveform corpus 130. When searching for an appropriate speech sample 140 to match with a given input text segment 120, the method of the present invention employs a strategy based on multi-layer matching, where each layer is tried in turn until a sufficiently good match is found. By successively relaxing the constraints of each layer, the method efficiently determines whether the utterance waveform corpus 130 contains a match. The method is therefore particularly appropriate for embedded TTS systems where the size of the utterance waveform corpus 130 and the processing power of the system may be limited. Although exemplary embodiments of a method of the present invention has been illustrated in the accompanying drawings and described in the foregoing description, it is to be understood that the invention is not limited to the embodiments disclosed; rather the invention can be varied in numerous ways, particularly concerning applications in languages other than
Chinese. It should, therefore, be recognized that the invention should be limited only by the scope of the following claims.

Claims

WE CLAIM:
1. A method for performing speech synthesis on a test segment, the method being performed on an electronic device, the method comprising: comparing a text segment with an utterance waveform corpus, said utterance waveform corpus comprising a plurality of speech waveform samples; determining a best match between consecutive syllables in the text segment and attributes associated with sampled speech waveform utterances, the best match being determined by identifying the greatest number of consecutive syllables that are identical to the attributes and attribute positions in each of the waveform utterances, ascertaining a suitable match for each unmatched syllable in the text segment, each unmatched syllable being a syllable that is not one of the consecutive syllables and the suitable match being determined from a comparison of a prosodic features in prosodic features group with the attributes associated with sampled speech waveform utterances, wherein the ascertaining is characterized by successively removing the prosodic features from the prosodic feature group until there is said suitable match; and generating concatenated synthesized speech for the text segment by using the speech waveform samples in the corpus, the speech waveform samples being selected from best match between consecutive syllables and suitable match for each unmatched syllable.
2. The method of claim 1 , wherein the prosodic features includes features selected from the group consisting of text context, pinyin, tone context, co-articulation, syllable position, phrase position, character symbol, and length of phrase.
3. The method of claim 1 , wherein the prosodic features comprises tone context, co-articulation, syllable position, phrase position, and character symbol.
4. The method of claim 1 , further comprising the step of performing a basic phonetic match based on only pinyin after all of said other prosodic features have been successively removed.
5. The method of claim 1 , wherein the step of determining includes the step of selecting an optimal contextual phonetic hybrid match when numerous best matches are found by using the formula: pitch- BestPltcly dur - BestDur - 2 p BestPitch a BestDur where Wp = weight of the pitch of said speech segment; Wd = weight of the duration of said speech segment; diff= differential value for selecting said optimal contextual phonetic hybrid match; pitch = pitch of said speech segment; BestPitch = pitch of an ideal speech segment; dur= duration of said speech segment; and BestDur = duration of said ideal speech segment.
6. The method of claim 5, wherein the BestPitch is determined using the formula: BestPitch = pitchtone - nlndexx empiricalvalue where pitchtone = the average pitch including tone of said utterance waveform corpus; nlndex = the index of said speech segment in a prosody phrase; and empircalvalue = an empirical value based on said utterance waveform corpus.
7. The method of claim 5, wherein the Bestdur is determined using the formula: BestDur= durs xfs — nlndexx empircalvdue where durs = the average duration of said speech segment without tone; nlndex= the index of said speech segment in a prosody phrase; and fs = the coefficient for prosody position. empircalvalue = an empirical value based on said utterance waveform corpus.
8. The method of claim 1 , wherein the step of determining includes the step of selecting an optimal contextual phonetic hybrid match when numerous suitable matches are found by using the formula: pitch- BestPitclγ dur- BestDur 2 p BestPitch BestDur where Wp = weight of the pitch of said speech segment; Wd = weight of the duration of said speech segment; diff= differential value for selecting said optimal contextual phonetic hybrid match; pitch = pitch of said speech segment; BestPitch = pitch of an ideal speech segment; dur= duration of said speech segment; and BestDur = duration of said ideal speech segment.
9. The method of claim 8, wherein said optimal contextual phonetic hybrid match is the match having the lowest differential value (difή.
10. The method of claim 8, wherein said differential value (diff) for selecting said optimal contextual phonetic hybrid match is compared with a preset threshold.
11. The method of claim 8, wherein the BestPitch is determined using the formula: BestPitch = pitchtone - nlndex x empiricalvalue where pitchtone = the average pitch including tone of said utterance waveform corpus; nlndex = the index of said speech segment in a prosody phrase; and empircalvalue = an empirical value based on said utterance waveform corpus.
12. The method of claim 8, wherein the Bestdur \s determined using the formula:
BestDur= durs xfs - nlndexx empircαlvdue where durs = the average duration of said speech segment without tone; nlndex = the index of said speech segment in a prosody phrase; and fs = the coefficient for prosody position. empircalvalue = an empirical value based on said utterance waveform corpus.
PCT/US2004/030467 2003-09-29 2004-09-17 Method for synthesizing speech WO2005034082A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
MXPA06003431A MXPA06003431A (en) 2003-09-29 2004-09-17 Method for synthesizing speech.
EP04784355A EP1668628A4 (en) 2003-09-29 2004-09-17 Method for synthesizing speech

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN03132698.6 2003-09-29
CNB031326986A CN1260704C (en) 2003-09-29 2003-09-29 Method for voice synthesizing

Publications (1)

Publication Number Publication Date
WO2005034082A1 true WO2005034082A1 (en) 2005-04-14

Family

ID=34398359

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2004/030467 WO2005034082A1 (en) 2003-09-29 2004-09-17 Method for synthesizing speech

Country Status (5)

Country Link
EP (1) EP1668628A4 (en)
KR (1) KR100769033B1 (en)
CN (1) CN1260704C (en)
MX (1) MXPA06003431A (en)
WO (1) WO2005034082A1 (en)

Cited By (51)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI421857B (en) * 2009-12-29 2014-01-01 Ind Tech Res Inst Apparatus and method for generating a threshold for utterance verification and speech recognition system and utterance verification system
US9633660B2 (en) 2010-02-25 2017-04-25 Apple Inc. User profiling for voice input processing
US9668024B2 (en) 2014-06-30 2017-05-30 Apple Inc. Intelligent automated assistant for TV user interactions
CN107481713A (en) * 2017-07-17 2017-12-15 清华大学 A kind of hybrid language phoneme synthesizing method and device
US9865248B2 (en) 2008-04-05 2018-01-09 Apple Inc. Intelligent text-to-speech conversion
US9934775B2 (en) 2016-05-26 2018-04-03 Apple Inc. Unit-selection text-to-speech synthesis based on predicted concatenation parameters
US9966060B2 (en) 2013-06-07 2018-05-08 Apple Inc. System and method for user-specified pronunciation of words for speech synthesis and recognition
US9972304B2 (en) 2016-06-03 2018-05-15 Apple Inc. Privacy preserving distributed evaluation framework for embedded personalized systems
US9971774B2 (en) 2012-09-19 2018-05-15 Apple Inc. Voice-based media searching
US9986419B2 (en) 2014-09-30 2018-05-29 Apple Inc. Social reminders
US10043516B2 (en) 2016-09-23 2018-08-07 Apple Inc. Intelligent automated assistant
US10049663B2 (en) 2016-06-08 2018-08-14 Apple, Inc. Intelligent automated assistant for media exploration
US10049668B2 (en) 2015-12-02 2018-08-14 Apple Inc. Applying neural network language models to weighted finite state transducers for automatic speech recognition
US10067938B2 (en) 2016-06-10 2018-09-04 Apple Inc. Multilingual word prediction
US10079014B2 (en) 2012-06-08 2018-09-18 Apple Inc. Name recognition system
US10089072B2 (en) 2016-06-11 2018-10-02 Apple Inc. Intelligent device arbitration and control
US10134385B2 (en) 2012-03-02 2018-11-20 Apple Inc. Systems and methods for name pronunciation
US10169329B2 (en) 2014-05-30 2019-01-01 Apple Inc. Exemplar-based natural language processing
US10192552B2 (en) 2016-06-10 2019-01-29 Apple Inc. Digital assistant providing whispered speech
US10223066B2 (en) 2015-12-23 2019-03-05 Apple Inc. Proactive assistance based on dialog communication between devices
US10249300B2 (en) 2016-06-06 2019-04-02 Apple Inc. Intelligent list reading
US10269345B2 (en) 2016-06-11 2019-04-23 Apple Inc. Intelligent task discovery
US10283110B2 (en) 2009-07-02 2019-05-07 Apple Inc. Methods and apparatuses for automatic speech recognition
US10297253B2 (en) 2016-06-11 2019-05-21 Apple Inc. Application integration with a digital assistant
US10318871B2 (en) 2005-09-08 2019-06-11 Apple Inc. Method and apparatus for building an intelligent automated assistant
CN109948124A (en) * 2019-03-15 2019-06-28 腾讯科技(深圳)有限公司 Voice document cutting method, device and computer equipment
US10354011B2 (en) 2016-06-09 2019-07-16 Apple Inc. Intelligent automated assistant in a home environment
US10356243B2 (en) 2015-06-05 2019-07-16 Apple Inc. Virtual assistant aided communication with 3rd party service in a communication session
US10366158B2 (en) 2015-09-29 2019-07-30 Apple Inc. Efficient word encoding for recurrent neural network language models
US10410637B2 (en) 2017-05-12 2019-09-10 Apple Inc. User-specific acoustic models
US10446143B2 (en) 2016-03-14 2019-10-15 Apple Inc. Identification of voice inputs providing credentials
US10482874B2 (en) 2017-05-15 2019-11-19 Apple Inc. Hierarchical belief states for digital assistants
US10490187B2 (en) 2016-06-10 2019-11-26 Apple Inc. Digital assistant providing automated status report
US10509862B2 (en) 2016-06-10 2019-12-17 Apple Inc. Dynamic phrase expansion of language input
US10521466B2 (en) 2016-06-11 2019-12-31 Apple Inc. Data driven natural language event detection and classification
US10567477B2 (en) 2015-03-08 2020-02-18 Apple Inc. Virtual assistant continuity
US10593346B2 (en) 2016-12-22 2020-03-17 Apple Inc. Rank-reduced token representation for automatic speech recognition
US10671428B2 (en) 2015-09-08 2020-06-02 Apple Inc. Distributed personal assistant
US10691473B2 (en) 2015-11-06 2020-06-23 Apple Inc. Intelligent automated assistant in a messaging environment
US10706841B2 (en) 2010-01-18 2020-07-07 Apple Inc. Task flow identification based on user intent
US10733993B2 (en) 2016-06-10 2020-08-04 Apple Inc. Intelligent digital assistant in a multi-tasking environment
US10747498B2 (en) 2015-09-08 2020-08-18 Apple Inc. Zero latency digital assistant
US10755703B2 (en) 2017-05-11 2020-08-25 Apple Inc. Offline personal assistant
US10791176B2 (en) 2017-05-12 2020-09-29 Apple Inc. Synchronization and task delegation of a digital assistant
US10795541B2 (en) 2009-06-05 2020-10-06 Apple Inc. Intelligent organization of tasks items
US10810274B2 (en) 2017-05-15 2020-10-20 Apple Inc. Optimizing dialogue policy decisions for digital assistants using implicit feedback
US11010550B2 (en) 2015-09-29 2021-05-18 Apple Inc. Unified language modeling framework for word prediction, auto-completion and auto-correction
US11080012B2 (en) 2009-06-05 2021-08-03 Apple Inc. Interface for a virtual digital assistant
CN113393829A (en) * 2021-06-16 2021-09-14 哈尔滨工业大学(深圳) Chinese speech synthesis method integrating rhythm and personal information
WO2021225267A1 (en) * 2020-05-06 2021-11-11 Samsung Electronics Co., Ltd. Electronic device for generating speech signal corresponding to at least one text and operating method of the electronic device
US11217255B2 (en) 2017-05-16 2022-01-04 Apple Inc. Far-field extension for digital assistant services

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20140008870A (en) * 2012-07-12 2014-01-22 삼성전자주식회사 Method for providing contents information and broadcasting receiving apparatus thereof
CN105989833B (en) * 2015-02-28 2019-11-15 讯飞智元信息科技有限公司 Multilingual mixed this making character fonts of Chinese language method and system
CN106157948B (en) * 2015-04-22 2019-10-18 科大讯飞股份有限公司 A kind of fundamental frequency modeling method and system
CN105096934B (en) * 2015-06-30 2019-02-12 百度在线网络技术(北京)有限公司 Construct method, phoneme synthesizing method, device and the equipment in phonetic feature library
CN106534528A (en) * 2016-11-04 2017-03-22 广东欧珀移动通信有限公司 Processing method and device of text information and mobile terminal
CN110942765B (en) * 2019-11-11 2022-05-27 珠海格力电器股份有限公司 Method, device, server and storage medium for constructing corpus
CN111128116B (en) * 2019-12-20 2021-07-23 珠海格力电器股份有限公司 Voice processing method and device, computing equipment and storage medium
KR20210109222A (en) 2020-02-27 2021-09-06 주식회사 케이티 Device, method and computer program for synthesizing voice
CN112530406A (en) * 2020-11-30 2021-03-19 深圳市优必选科技股份有限公司 Voice synthesis method, voice synthesis device and intelligent equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5704007A (en) * 1994-03-11 1997-12-30 Apple Computer, Inc. Utilization of multiple voice sources in a speech synthesizer
US6134528A (en) * 1997-06-13 2000-10-17 Motorola, Inc. Method device and article of manufacture for neural-network based generation of postlexical pronunciations from lexical pronunciations

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS6449622A (en) * 1987-08-19 1989-02-27 Jsp Corp Resin foaming particle containing crosslinked polyolefin-based resin and manufacture thereof
US5970454A (en) * 1993-12-16 1999-10-19 British Telecommunications Public Limited Company Synthesizing speech by converting phonemes to digital waveforms
KR100259777B1 (en) * 1997-10-24 2000-06-15 정선종 Optimal synthesis unit selection method in text-to-speech system
US7283964B1 (en) * 1999-05-21 2007-10-16 Winbond Electronics Corporation Method and apparatus for voice controlled devices with improved phrase storage, use, conversion, transfer, and recognition
EP1345207B1 (en) * 2002-03-15 2006-10-11 Sony Corporation Method and apparatus for speech synthesis program, recording medium, method and apparatus for generating constraint information and robot apparatus
JP2003295882A (en) * 2002-04-02 2003-10-15 Canon Inc Text structure for speech synthesis, speech synthesizing method, speech synthesizer and computer program therefor
KR100883649B1 (en) * 2002-04-04 2009-02-18 삼성전자주식회사 Text to speech conversion apparatus and method thereof
GB2388286A (en) * 2002-05-01 2003-11-05 Seiko Epson Corp Enhanced speech data for use in a text to speech system
CN1320482C (en) * 2003-09-29 2007-06-06 摩托罗拉公司 Natural voice pause in identification text strings

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5704007A (en) * 1994-03-11 1997-12-30 Apple Computer, Inc. Utilization of multiple voice sources in a speech synthesizer
US6134528A (en) * 1997-06-13 2000-10-17 Motorola, Inc. Method device and article of manufacture for neural-network based generation of postlexical pronunciations from lexical pronunciations

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP1668628A4 *

Cited By (65)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10318871B2 (en) 2005-09-08 2019-06-11 Apple Inc. Method and apparatus for building an intelligent automated assistant
US9865248B2 (en) 2008-04-05 2018-01-09 Apple Inc. Intelligent text-to-speech conversion
US10795541B2 (en) 2009-06-05 2020-10-06 Apple Inc. Intelligent organization of tasks items
US11080012B2 (en) 2009-06-05 2021-08-03 Apple Inc. Interface for a virtual digital assistant
US10283110B2 (en) 2009-07-02 2019-05-07 Apple Inc. Methods and apparatuses for automatic speech recognition
TWI421857B (en) * 2009-12-29 2014-01-01 Ind Tech Res Inst Apparatus and method for generating a threshold for utterance verification and speech recognition system and utterance verification system
US11423886B2 (en) 2010-01-18 2022-08-23 Apple Inc. Task flow identification based on user intent
US10706841B2 (en) 2010-01-18 2020-07-07 Apple Inc. Task flow identification based on user intent
US9633660B2 (en) 2010-02-25 2017-04-25 Apple Inc. User profiling for voice input processing
US10049675B2 (en) 2010-02-25 2018-08-14 Apple Inc. User profiling for voice input processing
US11069336B2 (en) 2012-03-02 2021-07-20 Apple Inc. Systems and methods for name pronunciation
US10134385B2 (en) 2012-03-02 2018-11-20 Apple Inc. Systems and methods for name pronunciation
US10079014B2 (en) 2012-06-08 2018-09-18 Apple Inc. Name recognition system
US9971774B2 (en) 2012-09-19 2018-05-15 Apple Inc. Voice-based media searching
US9966060B2 (en) 2013-06-07 2018-05-08 Apple Inc. System and method for user-specified pronunciation of words for speech synthesis and recognition
US10169329B2 (en) 2014-05-30 2019-01-01 Apple Inc. Exemplar-based natural language processing
US10904611B2 (en) 2014-06-30 2021-01-26 Apple Inc. Intelligent automated assistant for TV user interactions
US9668024B2 (en) 2014-06-30 2017-05-30 Apple Inc. Intelligent automated assistant for TV user interactions
US9986419B2 (en) 2014-09-30 2018-05-29 Apple Inc. Social reminders
US10567477B2 (en) 2015-03-08 2020-02-18 Apple Inc. Virtual assistant continuity
US10356243B2 (en) 2015-06-05 2019-07-16 Apple Inc. Virtual assistant aided communication with 3rd party service in a communication session
US11500672B2 (en) 2015-09-08 2022-11-15 Apple Inc. Distributed personal assistant
US10747498B2 (en) 2015-09-08 2020-08-18 Apple Inc. Zero latency digital assistant
US10671428B2 (en) 2015-09-08 2020-06-02 Apple Inc. Distributed personal assistant
US11010550B2 (en) 2015-09-29 2021-05-18 Apple Inc. Unified language modeling framework for word prediction, auto-completion and auto-correction
US10366158B2 (en) 2015-09-29 2019-07-30 Apple Inc. Efficient word encoding for recurrent neural network language models
US10691473B2 (en) 2015-11-06 2020-06-23 Apple Inc. Intelligent automated assistant in a messaging environment
US11526368B2 (en) 2015-11-06 2022-12-13 Apple Inc. Intelligent automated assistant in a messaging environment
US10049668B2 (en) 2015-12-02 2018-08-14 Apple Inc. Applying neural network language models to weighted finite state transducers for automatic speech recognition
US10223066B2 (en) 2015-12-23 2019-03-05 Apple Inc. Proactive assistance based on dialog communication between devices
US10446143B2 (en) 2016-03-14 2019-10-15 Apple Inc. Identification of voice inputs providing credentials
US9934775B2 (en) 2016-05-26 2018-04-03 Apple Inc. Unit-selection text-to-speech synthesis based on predicted concatenation parameters
US9972304B2 (en) 2016-06-03 2018-05-15 Apple Inc. Privacy preserving distributed evaluation framework for embedded personalized systems
US10249300B2 (en) 2016-06-06 2019-04-02 Apple Inc. Intelligent list reading
US10049663B2 (en) 2016-06-08 2018-08-14 Apple, Inc. Intelligent automated assistant for media exploration
US11069347B2 (en) 2016-06-08 2021-07-20 Apple Inc. Intelligent automated assistant for media exploration
US10354011B2 (en) 2016-06-09 2019-07-16 Apple Inc. Intelligent automated assistant in a home environment
US10490187B2 (en) 2016-06-10 2019-11-26 Apple Inc. Digital assistant providing automated status report
US10733993B2 (en) 2016-06-10 2020-08-04 Apple Inc. Intelligent digital assistant in a multi-tasking environment
US10067938B2 (en) 2016-06-10 2018-09-04 Apple Inc. Multilingual word prediction
US10192552B2 (en) 2016-06-10 2019-01-29 Apple Inc. Digital assistant providing whispered speech
US11037565B2 (en) 2016-06-10 2021-06-15 Apple Inc. Intelligent digital assistant in a multi-tasking environment
US10509862B2 (en) 2016-06-10 2019-12-17 Apple Inc. Dynamic phrase expansion of language input
US10089072B2 (en) 2016-06-11 2018-10-02 Apple Inc. Intelligent device arbitration and control
US10269345B2 (en) 2016-06-11 2019-04-23 Apple Inc. Intelligent task discovery
US10521466B2 (en) 2016-06-11 2019-12-31 Apple Inc. Data driven natural language event detection and classification
US10297253B2 (en) 2016-06-11 2019-05-21 Apple Inc. Application integration with a digital assistant
US11152002B2 (en) 2016-06-11 2021-10-19 Apple Inc. Application integration with a digital assistant
US10553215B2 (en) 2016-09-23 2020-02-04 Apple Inc. Intelligent automated assistant
US10043516B2 (en) 2016-09-23 2018-08-07 Apple Inc. Intelligent automated assistant
US10593346B2 (en) 2016-12-22 2020-03-17 Apple Inc. Rank-reduced token representation for automatic speech recognition
US10755703B2 (en) 2017-05-11 2020-08-25 Apple Inc. Offline personal assistant
US10791176B2 (en) 2017-05-12 2020-09-29 Apple Inc. Synchronization and task delegation of a digital assistant
US10410637B2 (en) 2017-05-12 2019-09-10 Apple Inc. User-specific acoustic models
US11405466B2 (en) 2017-05-12 2022-08-02 Apple Inc. Synchronization and task delegation of a digital assistant
US10810274B2 (en) 2017-05-15 2020-10-20 Apple Inc. Optimizing dialogue policy decisions for digital assistants using implicit feedback
US10482874B2 (en) 2017-05-15 2019-11-19 Apple Inc. Hierarchical belief states for digital assistants
US11217255B2 (en) 2017-05-16 2022-01-04 Apple Inc. Far-field extension for digital assistant services
CN107481713B (en) * 2017-07-17 2020-06-02 清华大学 Mixed language voice synthesis method and device
CN107481713A (en) * 2017-07-17 2017-12-15 清华大学 A kind of hybrid language phoneme synthesizing method and device
CN109948124A (en) * 2019-03-15 2019-06-28 腾讯科技(深圳)有限公司 Voice document cutting method, device and computer equipment
CN109948124B (en) * 2019-03-15 2022-12-23 腾讯科技(深圳)有限公司 Voice file segmentation method and device and computer equipment
WO2021225267A1 (en) * 2020-05-06 2021-11-11 Samsung Electronics Co., Ltd. Electronic device for generating speech signal corresponding to at least one text and operating method of the electronic device
CN113393829A (en) * 2021-06-16 2021-09-14 哈尔滨工业大学(深圳) Chinese speech synthesis method integrating rhythm and personal information
CN113393829B (en) * 2021-06-16 2023-08-29 哈尔滨工业大学(深圳) Chinese speech synthesis method integrating rhythm and personal information

Also Published As

Publication number Publication date
EP1668628A4 (en) 2007-01-10
KR20060066121A (en) 2006-06-15
CN1260704C (en) 2006-06-21
EP1668628A1 (en) 2006-06-14
KR100769033B1 (en) 2007-10-22
CN1604182A (en) 2005-04-06
MXPA06003431A (en) 2006-06-20

Similar Documents

Publication Publication Date Title
KR100769033B1 (en) Method for synthesizing speech
US5949961A (en) Word syllabification in speech synthesis system
US6823309B1 (en) Speech synthesizing system and method for modifying prosody based on match to database
US6029132A (en) Method for letter-to-sound in text-to-speech synthesis
US6684187B1 (en) Method and system for preselection of suitable units for concatenative speech
US6505158B1 (en) Synthesis-based pre-selection of suitable units for concatenative speech
US6243680B1 (en) Method and apparatus for obtaining a transcription of phrases through text and spoken utterances
US6910012B2 (en) Method and system for speech recognition using phonetically similar word alternatives
EP0833304B1 (en) Prosodic databases holding fundamental frequency templates for use in speech synthesis
KR100403293B1 (en) Speech synthesizing method, speech synthesis apparatus, and computer-readable medium recording speech synthesis program
JP3481497B2 (en) Method and apparatus using a decision tree to generate and evaluate multiple pronunciations for spelled words
EP1213705A2 (en) Method and apparatus for speech synthesis without prosody modification
WO1996023298A2 (en) System amd method for generating and using context dependent sub-syllable models to recognize a tonal language
JPH0916602A (en) Translation system and its method
JP5198046B2 (en) Voice processing apparatus and program thereof
WO2006106182A1 (en) Improving memory usage in text-to-speech system
JP3576066B2 (en) Speech synthesis system and speech synthesis method
Hendessi et al. A speech synthesizer for Persian text using a neural network with a smooth ergodic HMM
Akinwonmi Development of a prosodic read speech syllabic corpus of the Yoruba language
Kaur et al. BUILDING AText-TO-SPEECH SYSTEM FOR PUNJABI LANGUAGE
JP2005534968A (en) Deciding to read kanji
Bharthi et al. Unit selection based speech synthesis for converting short text message into voice message in mobile phones
GB2292235A (en) Word syllabification.
CN114999447A (en) Speech synthesis model based on confrontation generation network and training method
JP2003345372A (en) Method and device for synthesizing voice

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 2004784355

Country of ref document: EP

AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NA NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): BW GH GM KE LS MW MZ NA SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
WWE Wipo information: entry into national phase

Ref document number: PA/a/2006/003431

Country of ref document: MX

WWE Wipo information: entry into national phase

Ref document number: 1020067006170

Country of ref document: KR

WWP Wipo information: published in national office

Ref document number: 2004784355

Country of ref document: EP

WWP Wipo information: published in national office

Ref document number: 1020067006170

Country of ref document: KR