WO2005034083A1 - Letter to sound conversion for synthesized pronounciation of a text segment - Google Patents

Letter to sound conversion for synthesized pronounciation of a text segment Download PDF

Info

Publication number
WO2005034083A1
WO2005034083A1 PCT/US2004/030468 US2004030468W WO2005034083A1 WO 2005034083 A1 WO2005034083 A1 WO 2005034083A1 US 2004030468 W US2004030468 W US 2004030468W WO 2005034083 A1 WO2005034083 A1 WO 2005034083A1
Authority
WO
WIPO (PCT)
Prior art keywords
sub
word
text
words
speech synthesis
Prior art date
Application number
PCT/US2004/030468
Other languages
French (fr)
Inventor
Gui-Lin Chen
Jian-Cheng Huang
Original Assignee
Motorola, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Motorola, Inc. filed Critical Motorola, Inc.
Priority to EP04784356A priority Critical patent/EP1668629B1/en
Priority to DE602004019949T priority patent/DE602004019949D1/en
Publication of WO2005034083A1 publication Critical patent/WO2005034083A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/06Elementary speech units used in speech synthesisers; Concatenation rules

Definitions

  • the present invention relates generally to Text-To-Speech (TTS) synthesis.
  • the invention is particularly useful for letter to sound conversion for synthesized pronunciation of a text segment.
  • BACKGROUND OF THE INVENTION Text to Speech (TTS) conversion often referred to as concatenated text to speech synthesis, allows electronic devices to receive an input text string and provide a converted representation of the string in the form of synthesized speech.
  • TTS Text to Speech
  • a device that may be required to synthesize speech originating from a non-deterministic number of received text strings will have difficulty in providing high quality realistic synthesized speech.
  • a method for text to speech synthesis including: receiving a text string and selecting at least one word therefrom; segmenting the word into a sub-words the sub-words forming a sub-word sequence with at least one of sub-words comprising at least two letters; identifying phonemes for the sub-words; concatenating the phonemes into a phoneme sequence; and performing speech synthesis on the phoneme sequence.
  • the sub-word sequence is determined by analysis of possible sub-words that could comprise the word
  • each one of the possible sub-words has an associated predefined weight.
  • the sub-words with the maximum combined weights that form the selected word are chosen to provide the sub-word sequence.
  • the sub-word sequence is suitably determined from analysis of a Direct Acyclic Graph.
  • the identifying phonemes use a phoneme identifier table comprising a phonemes corresponding to at least one said sub-word.
  • the identifier table also comprises a position relevance indicator that indicates the relevance of the position of the sub-word in the word. There may also suitably be a phoneme weight associated with the position relevance indicator.
  • Fig. 1 is a schematic block diagram of an electronic device in accordance with the present invention
  • Fig. 2 is flow diagram illustrating a method for text to speech synthesis
  • Fig. 3 illustrates a Direct Acyclic Graph (DAG)
  • Fig 4 is part of a mapping table that maps symbols with phonemes
  • Fig 5 is part of a phoneme identifier table
  • Fig 6 is part of a vowel pair table.
  • an electronic device 100 in the form of a radio-telephone, comprising a device processor 102 operatively coupled by a bus 103 to a user interface 104 that is typically a touch screen or alternatively a display screen and keypad.
  • the electronic device 100 also has an utterance corpus 106, a speech synthesizer 110, Non Volatile memory 120, Read Only Memory 118 and Radio communications module 116 all operatively coupled to the processor 102 by the bus 103.
  • the speech synthesizer 110 has an output coupled to drive a speaker 112.
  • the corpus 106 includes representations of words or phonemes and associated sampled, digitized and processed utterance waveforms PUWs.
  • the Non Volatile memory 120 in use for Text-To-Speech (TTS) synthesis (the text may be received by module 116 or otherwise).
  • the waveform utterance corpus comprises sampled and digitized utterance waveforms in the form of phonemes and stress/emphasis of prosodic features.
  • the radio frequency communications unit 116 is typically a combined receiver and transmitter having a common antenna.
  • the radio frequency communications unit 116 has a transceiver coupled to antenna via a radio frequency amplifier.
  • the transceiver is also coupled to a combined modulator/demodulator that couples the communications unit 116 to the processor 102.
  • the non- volatile memory 120 stores a user programmable phonebook database Db and Read Only Memory 118 stores operating code (OC) for device processor 102
  • a step 220 of receiving a text string TS from the memory 120 is performed.
  • the text string TS may have originated from a text message received by module 116 or by any other means.
  • Step 230 provides for selecting at least one word from the text string TS and a segmenting step 240 provides for Segmenting the word into sub-words the sub- words forming a sub-word sequence with at least one of sub-words comprising at least two letters.
  • An identifying step 250 then provides for identifying phonemes for the sub-words.
  • a concatenating step 260 then provides for concatenating the phonemes into a phoneme sequence.
  • the sub-word sequence is determined by analysis of all possible sub-words that could comprise the selected word. For instance, referring briefly to the Direct Acyclic Graph (DAG) of Fig. 3, if the selected word was "mention”, then the Direct Acyclic Graph DAG is constructed with all possible sub-words that could comprise the selected word "mention”. With each sub-word a pre-defined weight WT is provided, for example as shown the sub-word "ment”, “men” and “tion” have respective weights 88, 86 and 204.
  • DAG Direct Acyclic Graph
  • the concatenating step 260 traverses the DAG and selects the sub-words with the maximum combined (summed) weights WT that form the selected word. In the case for the word "mention” the sub-words "men” and “tion” would be selected.
  • the step 250 of identifying phonemes uses two tables, stored in memory 120, one table part of which is illustrated in Fig. 4 is a mapping table
  • the other table is a phoneme identifier table PIT as part of which is illustrated in FIG. 5.
  • the phoneme identifier table PIT comprises a sub-word field; phoneme weight field; position relevance field(s) or indicators; and a phoneme identifier field(s).
  • the first line is aa 120 A_C, where aa is the sub- word; 120 is the phoneme weight, the letter A is the position relevance and "C" is the phoneme identifier corresponding to the sub-word aa.
  • the position relevance may be labeled as: A with the meaning relevant for all positions; I with the meaning relevant for sub-words at the beginning of a word; M with the meaning relevant for sub-words in the middle of a word; and F with the meaning relevant for sub-words at the end of a word.
  • short morpheme-like string is always preferable. For instance, the word seeing will be segmented as s ee
  • affix If one short string is a prefix or suffix of a long string, we add its occurring time to the long string; but other sub-strings are not being considered. ambiguity
  • one morpheme-like string can correspond to multiple phoneme strings; for instance, en can pronounce as ehn and axn.
  • the morpheme-like string can correspond to more than one phoneme string.
  • we choose the phoneme string with maximal occurring time and calculate the ratio r as follows: r ⁇ max ⁇ N llk ⁇ (3) where u is the string index while k is the position index, if r ⁇ ⁇ ( ⁇ is a threshold, ⁇ 0.7) , we exclude this morpheme- like string.
  • the method 200 next effects a step 265 of performing stress or emphasis assignment on the phonemes that represent vowels.
  • This step 265 identifies vowels from the suitably identified phonemes identified in the pervious step 250. Essentially, this step 265 searches a relative strength/weakness vowel pair table stored in the memory 120. Part of this vowel pair table is illustrated in Fig 6.
  • the stress weights are determined by using a training lexicon. Each entry in this lexicon has word format and its corresponding pronunciation, including stress, syllable boundary and letter-to-phoneme alignment. So based on this lexicon, stress was determined by statistical analysis. In this regard, stress reflects strong/weak relationship between vowels. To generate required data, statistical analysis for all entries in the lexicon were therefore conducted. Specifically, within the scope of a word, if vowel vf is stressed, VJ is unstressed, we assign one point for the pair (v , ) and zero point for pair (VJ,V ). If both are unstressed, the point is also zero.
  • a test step 270 is then performed to determine if there are any more words in the text string TS that need to be processed. If yes then the method 200 returns to step 230, otherwise a performing speech synthesis on the phoneme sequence is effected at a performing step 280.
  • the performing speech synthesis is effected by the synthesizer 110 on the phoneme sequence for each of the words.
  • the method 200 then ends at an end step 290.
  • the stress primary, secondary or no stress as appropriate
  • the vowels is also used to provide an improved synthesized speech quality by appropriate stress emphasis.
  • the present invention improves or at least alleviates sounds and vowel stress/emphasis depending on other adjacent letters and position in a text segment to be synthesized.
  • the detailed description provides a preferred exemplary embodiment only, and is not intended to limit the scope, applicability, or configuration of the invention. Rather, the detailed description of the preferred exemplary embodiment provides those skilled in the art with an enabling description for implementing preferred exemplary embodiment of the invention. It should be understood that various changes may be made in the function and arrangement of elements without departing from the spirit and scope of the invention as set forth in the appended claims.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)
  • Document Processing Apparatus (AREA)

Abstract

There is described a method (200) for text to speech synthesis, the method (200) includes receiving (220) a text string and selecting at least one word from the string. Then a step of segmenting (240) the word into a sub-words forming a sub-word sequence with at least one of the sub-words comprising at least two letters. The step of identifying (250) provides for identifying phonemes for the sub-words and step (260) effects concatenating the phonemes into a phoneme sequence. A performing speech synthesis (280) on the phoneme sequence is then conducted.

Description

LETTERTO SOUND CONVERSIONFORSYNTHESIZED PRONUNCIATIONOF ATEXT SEGMENT
FIELD OF THE INVENTION The present invention relates generally to Text-To-Speech (TTS) synthesis. The invention is particularly useful for letter to sound conversion for synthesized pronunciation of a text segment. BACKGROUND OF THE INVENTION Text to Speech (TTS) conversion, often referred to as concatenated text to speech synthesis, allows electronic devices to receive an input text string and provide a converted representation of the string in the form of synthesized speech. However, a device that may be required to synthesize speech originating from a non-deterministic number of received text strings will have difficulty in providing high quality realistic synthesized speech. One difficulty is based on letter to sound conversion in which identical letters or groups of letters may have different sounds and vowel stress/emphasis depending on other adjacent letters and position in a text segment to be synthesized. In this specification, including the claims, the terms 'comprises', 'comprising' or similar terms are intended to mean a non-exclusive inclusion, such that a method or apparatus that comprises a list of elements does not include those elements solely, but may well include other elements not listed.
SUMMARY OF THE INVENTION According to one aspect of the invention there is provided a method for text to speech synthesis, the method including: receiving a text string and selecting at least one word therefrom; segmenting the word into a sub-words the sub-words forming a sub-word sequence with at least one of sub-words comprising at least two letters; identifying phonemes for the sub-words; concatenating the phonemes into a phoneme sequence; and performing speech synthesis on the phoneme sequence. Suitably, The sub-word sequence is determined by analysis of possible sub-words that could comprise the word Preferably, each one of the possible sub-words has an associated predefined weight. Suitably, the sub-words with the maximum combined weights that form the selected word are chosen to provide the sub-word sequence. The sub-word sequence is suitably determined from analysis of a Direct Acyclic Graph. Suitably, the identifying phonemes use a phoneme identifier table comprising a phonemes corresponding to at least one said sub-word. Preferably, the identifier table also comprises a position relevance indicator that indicates the relevance of the position of the sub-word in the word. There may also suitably be a phoneme weight associated with the position relevance indicator.
BRIEF DESCRIPTION OF THE DRAWINGS In order that the invention may be readily understood and put into practical effect, reference will now be made to a preferred embodiment as illustrated with reference to the accompanying drawings in which: Fig. 1 is a schematic block diagram of an electronic device in accordance with the present invention; Fig. 2 is flow diagram illustrating a method for text to speech synthesis; Fig. 3 illustrates a Direct Acyclic Graph (DAG); Fig 4 is part of a mapping table that maps symbols with phonemes; Fig 5 is part of a phoneme identifier table; and Fig 6 is part of a vowel pair table. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT OF THE INVENTION Referring to Fig. 1 there is illustrated an electronic device 100, in the form of a radio-telephone, comprising a device processor 102 operatively coupled by a bus 103 to a user interface 104 that is typically a touch screen or alternatively a display screen and keypad. The electronic device 100 also has an utterance corpus 106, a speech synthesizer 110, Non Volatile memory 120, Read Only Memory 118 and Radio communications module 116 all operatively coupled to the processor 102 by the bus 103. The speech synthesizer 110 has an output coupled to drive a speaker 112. The corpus 106 includes representations of words or phonemes and associated sampled, digitized and processed utterance waveforms PUWs. In other words, and as described below, the Non Volatile memory 120 (memory module) in use for Text-To-Speech (TTS) synthesis (the text may be received by module 116 or otherwise). Also the waveform utterance corpus comprises sampled and digitized utterance waveforms in the form of phonemes and stress/emphasis of prosodic features. As will be apparent to a person skilled in the art, the radio frequency communications unit 116 is typically a combined receiver and transmitter having a common antenna. The radio frequency communications unit 116 has a transceiver coupled to antenna via a radio frequency amplifier. The transceiver is also coupled to a combined modulator/demodulator that couples the communications unit 116 to the processor 102. Also, in this embodiment the non- volatile memory 120 (memory module) stores a user programmable phonebook database Db and Read Only Memory 118 stores operating code (OC) for device processor 102 Referring to Fig. 2 there is illustrated a method 200 for text to speech synthesis. After a start step 210 a step 220 of receiving a text string TS from the memory 120 is performed. The text string TS may have originated from a text message received by module 116 or by any other means. Step 230 provides for selecting at least one word from the text string TS and a segmenting step 240 provides for Segmenting the word into sub-words the sub- words forming a sub-word sequence with at least one of sub-words comprising at least two letters. An identifying step 250 then provides for identifying phonemes for the sub-words. A concatenating step 260 then provides for concatenating the phonemes into a phoneme sequence. The sub-word sequence is determined by analysis of all possible sub-words that could comprise the selected word. For instance, referring briefly to the Direct Acyclic Graph (DAG) of Fig. 3, if the selected word was "mention", then the Direct Acyclic Graph DAG is constructed with all possible sub-words that could comprise the selected word "mention". With each sub-word a pre-defined weight WT is provided, for example as shown the sub-word "ment", "men" and "tion" have respective weights 88, 86 and 204. Accordingly, the concatenating step 260 traverses the DAG and selects the sub-words with the maximum combined (summed) weights WT that form the selected word. In the case for the word "mention" the sub-words "men" and "tion" would be selected. The step 250 of identifying phonemes uses two tables, stored in memory 120, one table part of which is illustrated in Fig. 4 is a mapping table
MT that maps symbols with phonemes. As shown, phoneme ae is identified by symbol @, whereas phoneme th is identified by symbol D. The other table is a phoneme identifier table PIT as part of which is illustrated in FIG. 5. The phoneme identifier table PIT comprises a sub-word field; phoneme weight field; position relevance field(s) or indicators; and a phoneme identifier field(s). For instance, in Fig 5, the first line is aa 120 A_C, where aa is the sub- word; 120 is the phoneme weight, the letter A is the position relevance and "C" is the phoneme identifier corresponding to the sub-word aa. The position relevance may be labeled as: A with the meaning relevant for all positions; I with the meaning relevant for sub-words at the beginning of a word; M with the meaning relevant for sub-words in the middle of a word; and F with the meaning relevant for sub-words at the end of a word. Thus, using the phoneme identifier table PIT and with the sub-words position in a word then the step 250 of identifying phonemes is effected. The phoneme weights and DAC weights pre-defined weight WT are the same weights obtained from Fig 5. These weights were determined such that if we choose the occurring time as a weight, one sub-string has higher weight than the string itself. As a consequence, if we select the segmentation form with maximal weight as the result, short morpheme-like string is always preferable. For instance, the word seeing will be segmented as s ee|m|g instead of s\ee\ing. But overall, the relationship between long string and phoneme sequence is more confident. To ensure the high priority of long morpheme-like string, we consider the following aspects: affix If one short string is a prefix or suffix of a long string, we add its occurring time to the long string; but other sub-strings are not being considered. ambiguity In some cases, one morpheme-like string can correspond to multiple phoneme strings; for instance, en can pronounce as ehn and axn. To decrease uncertainty, we employ the string positions such as word initial, word medial and word final. Even under this condition, the morpheme-like string can correspond to more than one phoneme string. To overcome this problem, we choose the phoneme string with maximal occurring time and calculate the ratio r as follows: r ^ max{Nllk} (3) where u is the string index while k is the position index, if r < α ( α is a threshold, α = 0.7) , we exclude this morpheme- like string. For example, the word-final en can pronounce as ehn and αxn, if the total time is 1000, if the time corresponding to αxn is 800 (of course, it is the maximal time), r=0.8. Hence, we will add the word- final en to the list. minimal occurring time. We also set the minimal occurring time, min (min-9) as the threshold. Each string whose occurring time is less than this value has been discarded. Under these constraints, we assign each string weight Ws in the following way: Ws = 101nNs , Ns is the adjusted occurring time. The method 200 next effects a step 265 of performing stress or emphasis assignment on the phonemes that represent vowels. This step 265 identifies vowels from the suitably identified phonemes identified in the pervious step 250. Essentially, this step 265 searches a relative strength/weakness vowel pair table stored in the memory 120. Part of this vowel pair table is illustrated in Fig 6. For instance consider 3 vowels that could be identified as phonemes in a word, these vowels being identified by the symbols (obtained fro the mapping table MT) 'ax; aa; and ae. Then by analysis of the vowel pair table when 'ax occurs before aa then a stress weight of 368 is indicated in contrast to a, stress weight of 354 when aa occurs before 'ax. Hence, by analyzing the vowel pair table for 'ax; aa; and ae the following analysis results: the vowel identified by symbol ae has primary (the most) stress; the vowel identified by symbol 'ax has secondary stress; and the vowel identified by symbol aa has no stress. Essentially, the stress weights are determined by using a training lexicon. Each entry in this lexicon has word format and its corresponding pronunciation, including stress, syllable boundary and letter-to-phoneme alignment. So based on this lexicon, stress was determined by statistical analysis. In this regard, stress reflects strong/weak relationship between vowels. To generate required data, statistical analysis for all entries in the lexicon were therefore conducted. Specifically, within the scope of a word, if vowel vf is stressed, VJ is unstressed, we assign one point for the pair (v , ) and zero point for pair (VJ,V ). If both are unstressed, the point is also zero. A test step 270 is then performed to determine if there are any more words in the text string TS that need to be processed. If yes then the method 200 returns to step 230, otherwise a performing speech synthesis on the phoneme sequence is effected at a performing step 280. The performing speech synthesis is effected by the synthesizer 110 on the phoneme sequence for each of the words. The method 200 then ends at an end step 290. During the performing speech synthesis of step 280 the stress (primary, secondary or no stress as appropriate) on the vowels is also used to provide an improved synthesized speech quality by appropriate stress emphasis. Advantageously, the present invention improves or at least alleviates sounds and vowel stress/emphasis depending on other adjacent letters and position in a text segment to be synthesized. The detailed description provides a preferred exemplary embodiment only, and is not intended to limit the scope, applicability, or configuration of the invention. Rather, the detailed description of the preferred exemplary embodiment provides those skilled in the art with an enabling description for implementing preferred exemplary embodiment of the invention. It should be understood that various changes may be made in the function and arrangement of elements without departing from the spirit and scope of the invention as set forth in the appended claims.

Claims

WE CLAIM:
1. A method for text to speech synthesis, the method including: receiving a text string and selecting at least one word therefrom; segmenting the word into a sub-words the sub-words forming a sub-word sequence with at least one of sub-words comprising at least two letters; identifying phonemes for the sub-words; concatenating the phonemes into a phoneme sequence; and performing speech synthesis on the phoneme sequence.
2. A method for text to speech synthesis, as claimed in claim 1, wherein the sub-word sequence is determined by analysis of possible sub- words that could comprise the word.
3. A method for text to speech synthesis, as claimed in claim 1, wherein each one of the possible sub-words has an associated pre-defined weight.
4. A method for text to speech synthesis, as claimed in claim 1, wherein the sub-words with the maximum combined weights that form the selected word are chosen to provide the sub-word sequence.
5. A method for text to speech synthesis, as claimed in claim 4, wherein the sub-word sequence is suitably determined from analysis of a Direct Acyclic Graph.
6. A method for text to speech synthesis, as claimed in claim 1, wherein the identifying phonemes uses a phoneme identifier table comprising a phonemes corresponding to at least one said sub-word.
7. A method for text to speech synthesis, as claimed in claim 6, wherein the identifier table also comprises a position relevance indicator that indicates the relevance of the position of the sub-word in the word.
8. A method for text to speech synthesis, as claimed in claim 7, wherein there is a phoneme weight associated with the position relevance indicator.
PCT/US2004/030468 2003-09-29 2004-09-17 Letter to sound conversion for synthesized pronounciation of a text segment WO2005034083A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP04784356A EP1668629B1 (en) 2003-09-29 2004-09-17 Letter-to-sound conversion for synthesized pronunciation of a text segment
DE602004019949T DE602004019949D1 (en) 2003-09-29 2004-09-17 IMPLEMENTATION OF LETTERS ON SOUND FOR THE SYNTHETIZED SPEECH OF A TEXTSEGMENT

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN03132709.5 2003-09-29
CNB031327095A CN1308908C (en) 2003-09-29 2003-09-29 Transformation from characters to sound for synthesizing text paragraph pronunciation

Publications (1)

Publication Number Publication Date
WO2005034083A1 true WO2005034083A1 (en) 2005-04-14

Family

ID=34398362

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2004/030468 WO2005034083A1 (en) 2003-09-29 2004-09-17 Letter to sound conversion for synthesized pronounciation of a text segment

Country Status (6)

Country Link
EP (1) EP1668629B1 (en)
KR (1) KR100769032B1 (en)
CN (1) CN1308908C (en)
DE (1) DE602004019949D1 (en)
RU (1) RU2320026C2 (en)
WO (1) WO2005034083A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8234116B2 (en) 2006-08-22 2012-07-31 Microsoft Corporation Calculating cost measures between HMM acoustic models
US10685644B2 (en) 2017-12-29 2020-06-16 Yandex Europe Ag Method and system for text-to-speech synthesis
WO2020118643A1 (en) * 2018-12-13 2020-06-18 Microsoft Technology Licensing, Llc Neural text-to-speech synthesis with multi-level text information

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100935014B1 (en) * 2008-01-29 2010-01-06 고려대학교 산학협력단 Method for prediction of symptom corresponding to analysis of coloring patterns in art therapy assessment and medium of recording its program
US9472182B2 (en) * 2014-02-26 2016-10-18 Microsoft Technology Licensing, Llc Voice font speaker and prosody interpolation
RU2606312C2 (en) * 2014-11-27 2017-01-10 Роман Валерьевич Мещеряков Speech synthesis device
CN105895076B (en) * 2015-01-26 2019-11-15 科大讯飞股份有限公司 A kind of phoneme synthesizing method and system
CN105895075B (en) * 2015-01-26 2019-11-15 科大讯飞股份有限公司 Improve the method and system of synthesis phonetic-rhythm naturalness
CN109002454B (en) * 2018-04-28 2022-05-27 陈逸天 Method and electronic equipment for determining spelling partition of target word
CN109376358B (en) * 2018-10-25 2021-07-16 陈逸天 Word learning method and device based on historical spelling experience and electronic equipment
CN112786002B (en) * 2020-12-28 2022-12-06 科大讯飞股份有限公司 Voice synthesis method, device, equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5682501A (en) * 1994-06-22 1997-10-28 International Business Machines Corporation Speech synthesis system
US6064960A (en) 1997-12-18 2000-05-16 Apple Computer, Inc. Method and apparatus for improved duration modeling of phonemes
US20020185030A1 (en) * 2000-05-20 2002-12-12 Reese James Warren Shaped charges having enhanced tungsten liners

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5748840A (en) * 1990-12-03 1998-05-05 Audio Navigation Systems, Inc. Methods and apparatus for improving the reliability of recognizing words in a large database when the words are spelled or spoken
KR100236961B1 (en) * 1997-07-23 2000-01-15 정선종 Method for word grouping by its vowel-consonant structure
US6347295B1 (en) * 1998-10-26 2002-02-12 Compaq Computer Corporation Computer method and apparatus for grapheme-to-phoneme rule-set-generation
JP2002535728A (en) * 1999-01-05 2002-10-22 コーニンクレッカ フィリップス エレクトロニクス エヌ ヴィ Speech recognition device including sub-word memory
KR100373329B1 (en) * 1999-08-17 2003-02-25 한국전자통신연구원 Apparatus and method for text-to-speech conversion using phonetic environment and intervening pause duration
US8744835B2 (en) * 2001-03-16 2014-06-03 Meaningful Machines Llc Content conversion method and apparatus
US7143353B2 (en) * 2001-03-30 2006-11-28 Koninklijke Philips Electronics, N.V. Streaming video bookmarks
GB0113587D0 (en) * 2001-06-04 2001-07-25 Hewlett Packard Co Speech synthesis apparatus

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5682501A (en) * 1994-06-22 1997-10-28 International Business Machines Corporation Speech synthesis system
US6064960A (en) 1997-12-18 2000-05-16 Apple Computer, Inc. Method and apparatus for improved duration modeling of phonemes
US20020185030A1 (en) * 2000-05-20 2002-12-12 Reese James Warren Shaped charges having enhanced tungsten liners

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP1668629A4 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8234116B2 (en) 2006-08-22 2012-07-31 Microsoft Corporation Calculating cost measures between HMM acoustic models
US10685644B2 (en) 2017-12-29 2020-06-16 Yandex Europe Ag Method and system for text-to-speech synthesis
WO2020118643A1 (en) * 2018-12-13 2020-06-18 Microsoft Technology Licensing, Llc Neural text-to-speech synthesis with multi-level text information
US12094447B2 (en) 2018-12-13 2024-09-17 Microsoft Technology Licensing, Llc Neural text-to-speech synthesis with multi-level text information

Also Published As

Publication number Publication date
RU2006114705A (en) 2007-11-10
EP1668629A4 (en) 2007-01-10
EP1668629B1 (en) 2009-03-11
RU2320026C2 (en) 2008-03-20
CN1604184A (en) 2005-04-06
DE602004019949D1 (en) 2009-04-23
EP1668629A1 (en) 2006-06-14
KR100769032B1 (en) 2007-10-22
CN1308908C (en) 2007-04-04
KR20060056404A (en) 2006-05-24

Similar Documents

Publication Publication Date Title
WO2005034085A1 (en) Identifying natural speech pauses in a text string
KR100769033B1 (en) Method for synthesizing speech
US6505158B1 (en) Synthesis-based pre-selection of suitable units for concatenative speech
EP1168299B1 (en) Method and system for preselection of suitable units for concatenative speech
JP4473193B2 (en) Mixed language text speech synthesis method and speech synthesizer
US6029132A (en) Method for letter-to-sound in text-to-speech synthesis
WO2005059894A1 (en) Multi-lingual speech synthesis
EP1668629B1 (en) Letter-to-sound conversion for synthesized pronunciation of a text segment
KR100593757B1 (en) Foreign language studying device for improving foreign language studying efficiency, and on-line foreign language studying system using the same
KR20150105075A (en) Apparatus and method for automatic interpretation
JPH05143093A (en) Method and apparatus for forming model of uttered word
EP1668630B1 (en) Improvements to an utterance waveform corpus
JP3655808B2 (en) Speech synthesis apparatus, speech synthesis method, portable terminal device, and program recording medium
JP2000056789A (en) Speech synthesis device and telephone set
JP3366253B2 (en) Speech synthesizer
JP3626398B2 (en) Text-to-speech synthesizer, text-to-speech synthesis method, and recording medium recording the method
JP2015060038A (en) Voice synthesizer, language dictionary correction method, language dictionary correction computer program
KR200412740Y1 (en) Foreign language studying device for improving foreign language studying efficiency, and on-line foreign language studying system using the same
JPH09237096A (en) Kanji (chinese character) explaining method and device
JP5301376B2 (en) Speech synthesis apparatus and program
Görmez et al. TTTS: Turkish text-to-speech system
CN114327090A (en) Japanese input method and related device and equipment
KR20050006936A (en) Method of selective prosody realization for specific forms in dialogical text for Korean TTS system
JP2006284700A (en) Voice synthesizer and voice synthesizing processing program
Gakuru Development of a kenyan english text to speech system: A method of developing a TTS for a previously undefined english dialect

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NA NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GM KE LS MW MZ NA SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
WWE Wipo information: entry into national phase

Ref document number: 2004784356

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 1020067006095

Country of ref document: KR

WWE Wipo information: entry into national phase

Ref document number: 2006114705

Country of ref document: RU

WWP Wipo information: published in national office

Ref document number: 1020067006095

Country of ref document: KR

WWP Wipo information: published in national office

Ref document number: 2004784356

Country of ref document: EP