EP1668628A1 - Verfahren zum synthetisieren von sprache - Google Patents

Verfahren zum synthetisieren von sprache

Info

Publication number
EP1668628A1
EP1668628A1 EP04784355A EP04784355A EP1668628A1 EP 1668628 A1 EP1668628 A1 EP 1668628A1 EP 04784355 A EP04784355 A EP 04784355A EP 04784355 A EP04784355 A EP 04784355A EP 1668628 A1 EP1668628 A1 EP 1668628A1
Authority
EP
European Patent Office
Prior art keywords
match
speech
pitch
speech segment
segment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP04784355A
Other languages
English (en)
French (fr)
Other versions
EP1668628A4 (de
Inventor
Fang Chen
Gui-Lin Chen
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Motorola Solutions Inc
Original Assignee
Motorola Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Motorola Inc filed Critical Motorola Inc
Publication of EP1668628A1 publication Critical patent/EP1668628A1/de
Publication of EP1668628A4 publication Critical patent/EP1668628A4/de
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/06Elementary speech units used in speech synthesisers; Concatenation rules
    • G10L13/07Concatenation rules
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/06Elementary speech units used in speech synthesisers; Concatenation rules

Definitions

  • the present invention relates generally to Text-To-Speech (TTS) synthesis.
  • TTS Text-To-Speech
  • the invention is particularly useful for, but not necessarily limited to, determining an appropriate synthesized pronunciation of a text segment using a non-exhaustive utterance corpus.
  • TTS Text to Speech
  • concatenated text to speech synthesis allows electronic devices to receive an input text string and provide a converted representation of the string in the form of synthesized speech.
  • a device that may be required to synthesize speech originating from a non-deterministic number of received text strings will have difficulty in providing high quality realistic synthesized speech. That is because the pronunciation of each word or syllable (for Chinese characters and the like) to be synthesized is context and location dependent. For example, a pronunciation of a word at the beginning of a sentence
  • input text string may be drawn out or lengthened.
  • the pronunciation of the same word may be lengthened even more if it occurs in the middle of a sentence where emphasis is required.
  • the pronunciation of a word depends on at least tone (pitch), volume and duration.
  • many languages include numerous possible pronunciations of individual syllables.
  • a single syllable represented by a Chinese character (or other similar character based script) may have up to 6 different pronunciations.
  • a large pre-recorded utterance waveform corpus of sentences is required. This corpus typically requires on average about 500 variations of each pronunciation if realistic speech synthesis is to be achieved.
  • an utterance waveform corpus of all pronunciations for every character would be prohibitively large.
  • the size of the utterance waveform corpus may be particularly limited when it is embedded in a small electronic device having a low memory capacity such as a radio telephone or a personal digital assistant.
  • the algorithms used to compare the input text strings with the audio database also need to be efficient and fast so that the resulting synthesized and concatenated speech flows naturally and smoothly. Due to memory and processing speed limitations, existing TTS methods for embedded applications often result in speech that is unnatural or robotic sounding. There is therefore a need for an improved method for performing TTS to provide a natural sounding synthesized speech whilst using a non-exhaustive utterance corpus.
  • the present invention is a method of performing speech synthesis that includes comparing an input text segment with an utterance waveform corpus that contains numerous speech samples. The method determines whether there is a contextual best match between the text segment and one speech sample included in the utterance waveform corpus. If there is not a contextual best match, the method determines whether there is a contextual phonetic hybrid match between the text segment and a speech sample included in the utterance waveform corpus.
  • a contextual phonetic hybrid match requires a match of all implicit prosodic features in a defined prosodic feature group.
  • the prosodic feature group is redefined by deleting one of the implicit prosodic features from the prosodic feature group so as to redefine the prosodic feature group.
  • the prosodic feature group is successively redefined by deleting one implicit prosodic feature from the group until a match is found between the input text segment and a speech sample. When a match is found, the matched speech sample is used to generate concatenative speech.
  • Fig. 1 is a block diagram of an electronic device upon which the invention may be implemented
  • Fig. 2 is a flow chart illustrating a specific embodiment of the present invention used to generate concatenative speech in the Chinese language
  • Fig. 3 is a flowchart illustrating the process of determining whether a contextual phonetic hybrid match exists by successively relaxing the constraints used to define a match.
  • Fig. 1 there is illustrated a block diagram of an electronic device 10 upon which the invention may be implemented.
  • the device 10 includes a processor 30 operatively coupled, by a common bus 15, to a text memory module 20, a Read only Memory (ROM) 40, a Random Access Memory (RAM) 50 and a waveform corpus 60.
  • the processor 30 is also operatively coupled to a touch screen display 90 and an input of a speech synthesizer 70.
  • An output of the speech synthesizer 70 is operatively coupled to a speaker 80.
  • the text memory module is a store for storing text obtained by any receiving means possible such as by radio reception, internet, or plug in portable memory cards etc.
  • the ROM stores operating code for performing the invention as described in Figures 2 and 3.
  • the Corpus 60 is a essentially a conventional corpus as is the speech synthesizer 70 and speaker 80 and the touch screen display 90 is a user interface and allows for display of text stored in the a text memory module 20.
  • Fig. 2 is a flow chart illustrating a specific embodiment of the present invention used to generate concatenative speech 110 from an input text segment 120 in the Chinese language.
  • the text segment 120 is compared with an utterance waveform corpus 60, which includes a plurality of speech samples 140, to determine whether there is a contextual best match (step S110). If a contextual best match is found between a text segment 120 and a specific speech sample 140, that specific speech sample 140 is sent to a concatenating algorithm 150 for generating the concatenative speech 110. If no contextual best match is found between the text segment 120 and a specific speech sample 140, then the text segment 120 is compared again with the utterance waveform corpus 130 to determine whether there is a contextual phonetic hybrid match (step S120).
  • Fig. 3 is a flowchart illustrating the process of determining whether a contextual phonetic hybrid match exists by successively relaxing the constraints used to define a match.
  • a contextual phonetic hybrid match requires a match between a text segment 120 and all of the implicit prosodic features 210 included in a defined prosodic feature group 220. If no match is found, one of the implicit prosodic features 210 is deleted from the defined prosodic feature group 220 and the group 220 is redefined as including all of the previously included features 210 less the deleted feature 210 (e.g., Step 130). The redefined prosodic feature group 220 is then compared with the text segment 120 to determine whether there is a match. The process of deleting an implicit prosodic feature 210, redefining the prosodic feature group 220, and then redetermining whether there is a contextual phonetic hybrid match, continues until a match is found (Steps S130, S140, etc.
  • the matched speech sample 140 which matches the text segment 120, is sent to the concatenating algorithm 150 for generating concatenative speech 110.
  • the concatenating algorithm 150 for generating concatenative speech 110.
  • a basic phonetic match is performed matching only pinyin (Step S180).
  • the utterance waveform corpus 60 is designed so that there is always at least one syllable included with the correct pinyin to match all possible input text segments 120. That basic phonetic match is then input into the concatenating algorithm 150.
  • the invention is thus a multi-layer, data-driven method for controlling the prosody (rhythm and intonation) of the resulting synthesized, concatenative speech 110.
  • each layer of the method includes a redefined prosodic feature group 220.
  • a text segment 120 means any type of input text string or segment of coded language. It should not be limited to only visible text that is scaned or otherwise entered into a TTS system.
  • the utterance waveform corpus 130 of the present invention is annotated with information concerning each speech sample 140 (usually a word) that is included in the corpus 130.
  • the speech samples 140 themselves are generally recordings of actual human speech, usually digitized or analog waveforms. Annotations are thus required to identify the samples 140.
  • Such annotations may include the specific letters or characters (depending on the language) that define the sample 140 as well as the implicit prosodic features 210 of the speech sample 140.
  • the implicit prosodic features 210 include context information concerning how the speech sample 140 is used in a sentence.
  • a speech sample 140 in the Chinese language may include the following implicit prosodic features 210: Text context: the Chinese characters immediately preceding and immediately following the annotated text of a speech sample 140.
  • Pinyin the phonetic representation of a speech sample 140. Pinyin is a standard romanization of the Chinese language using the Western alphabet.
  • Tone context the tone context of the Chinese characters immediately preceding and immediately following the annotated text of a speech sample 140.
  • Co-articulation the phonetic level representatives that immediately precede and immediately follow the annotated text of a speech sample 140, such as phonemes or sub-syllables.
  • Syllable position the position of a syllable in a prosodic phrase.
  • Phrase position the position of a prosodic phrase in a sentence. Usually the phrase position is identified as one of the three positions of sentence initial, sentence medial and sentence final.
  • Character symbol the code (e.g., ASCII code) representing the Chinese character that defines a speech sample 140.
  • Length of phrase the number of Chinese characters included in a prosodic phrase.
  • each character's sound could represent a speech sample 140 and could be annotated with the above implicit prosodic features 210.
  • the character "H” as found in the above sentence could be annotated as follows: Text context: ⁇ , nowadays ; Pinyin: guo2; Tone context: 1 , 3; Co-articulation: ong, h ; Syllable position: 2 ; Phrase position: 1 ; Character symbol: ASCII code for H ; and Length of phrase: 2.
  • step S110 determines whether there is a contextual best match between a text segment 120 and a speech sample 140.
  • a contextual best match is generally defined as the closest, or an exact, match of both 1 ) the letters or characters (depending on the language) of an input text segment 120 with the corresponding letters or characters of an annotated speech sample 140, and 2) the implicit prosodic features 210 of the input text segment 120 with the implicit prosodic features 210 of the annotated speech sample 140.
  • a best match is being determined by identifying the greatest number of consecutive syllables in the input text segment that are identical to attributes and attribute positions in each of the waveform utterances (speech sample) in the waveform corpus 60.
  • a speech sample 140 selected immediately as an element for use in the concatenating algorithm 150.
  • the method of the present invention determines whether there is a contextual phonetic hybrid match between an input text segment 120 and a speech sample 140.
  • a contextual phonetic hybrid match requires a match between a text segment 120 and all of the implicit prosodic features 210 included in a defined prosodic feature group 220. As shown in Fig.
  • one embodiment of the present invention used to synthesize speech in the Chinese language uses a first defined prosodic feature group 220 that includes the implicit prosodic features 210 of pinyin, tone context, co- articulation, syllable position, phrase position, character symbol, and length of phrase (Step S120). If none of the annotated speech samples 140 found in the utterance waveform corpus 130 have identical values for each of the above features 210 as found in the input text segment 120, then the corpus 130 does not contain a speech sample 140 that is close enough to the input text segment 120 based on the matching rules as applied in Step S120. Therefore the constraints of the matching rules must be relaxed and thus broadened to include other speech samples 140 that possess the next most preferable features 210 found in the input text segment 120.
  • the matching rules are broadened by deleting the one feature 210 found in the defined prosodic feature group 220 that is least likely to affect the natural prosody of the input text segment 120.
  • the next most preferable features 210 found in the illustrated embodiment of the present invention include all of the features 210 defined above less the length of phrase feature 210.
  • the order in which the implicit prosodic features 210 are deleted from the defined prosodic feature group 220 is determined empirically. When the features 210 are deleted in a proper order, the method of the present invention results in efficient and fast speech synthesis. The output speech therefore sounds more natural even though the utterance waveform corpus 130 may be relatively limited in size.
  • the variable BestPitch may be determined based on a statistical analysis of the utterance waveform corpus 130.
  • a corpus 130 may include five tones, each having an average pitch.
  • Each annotated speech sample 140 in the corpus 130 may also include individual prosody information represented by the values of pitch, duration and energy. So the average values of pitch, duration and energy of the entire corpus 130 are available.
  • the best pitch for a particular context may then be determined using the following formula: BestPitch - pitch tone — nlndexx empiricalvalue (Eq.
  • pitch t one the average pitch including tone of the utterance waveform corpus
  • nlndex the index of the text segment 120 in a prosody phrase
  • empircalvalue an empirical value based on the utterance waveform corpus.
  • the empirical value of 4 is used in one particular embodiment of the present invention that synthesizes the Chinese language; however this number could vary depending on the content of a particular utterance waveform corpus 130.
  • the empirical value of 4 is used in one particular embodiment of the present invention that synthesizes the Chinese language, however this number could vary depending on the content of a particular utterance waveform corpus 130.
  • the invention is suitable for many languages.
  • the implicit prosodic features 210 would need to be deleted or redefined from the examples given hereinabove.
  • the feature 210 identified above as tone context would be deleted in an application of the present invention for the English language because English is not a tonal language.
  • the feature 210 identified above as pinyin would likely be redefined as simply a phonetic symbol when the present invention is applied to English.
  • the present invention is therefore a multi-layer, data-driven prosodic control scheme that utilizes the implicit prosodic information in an utterance waveform corpus 130.
  • the method of the present invention When searching for an appropriate speech sample 140 to match with a given input text segment 120, the method of the present invention employs a strategy based on multi-layer matching, where each layer is tried in turn until a sufficiently good match is found. By successively relaxing the constraints of each layer, the method efficiently determines whether the utterance waveform corpus 130 contains a match.
  • the method is therefore particularly appropriate for embedded TTS systems where the size of the utterance waveform corpus 130 and the processing power of the system may be limited.
EP04784355A 2003-09-29 2004-09-17 Verfahren zum synthetisieren von sprache Withdrawn EP1668628A4 (de)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CNB031326986A CN1260704C (zh) 2003-09-29 2003-09-29 语音合成方法
PCT/US2004/030467 WO2005034082A1 (en) 2003-09-29 2004-09-17 Method for synthesizing speech

Publications (2)

Publication Number Publication Date
EP1668628A1 true EP1668628A1 (de) 2006-06-14
EP1668628A4 EP1668628A4 (de) 2007-01-10

Family

ID=34398359

Family Applications (1)

Application Number Title Priority Date Filing Date
EP04784355A Withdrawn EP1668628A4 (de) 2003-09-29 2004-09-17 Verfahren zum synthetisieren von sprache

Country Status (5)

Country Link
EP (1) EP1668628A4 (de)
KR (1) KR100769033B1 (de)
CN (1) CN1260704C (de)
MX (1) MXPA06003431A (de)
WO (1) WO2005034082A1 (de)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112530406A (zh) * 2020-11-30 2021-03-19 深圳市优必选科技股份有限公司 一种语音合成方法、语音合成装置及智能设备

Families Citing this family (59)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8677377B2 (en) 2005-09-08 2014-03-18 Apple Inc. Method and apparatus for building an intelligent automated assistant
US9318108B2 (en) 2010-01-18 2016-04-19 Apple Inc. Intelligent automated assistant
US8996376B2 (en) 2008-04-05 2015-03-31 Apple Inc. Intelligent text-to-speech conversion
US10241644B2 (en) 2011-06-03 2019-03-26 Apple Inc. Actionable reminder entries
US10241752B2 (en) 2011-09-30 2019-03-26 Apple Inc. Interface for a virtual digital assistant
US9431006B2 (en) 2009-07-02 2016-08-30 Apple Inc. Methods and apparatuses for automatic speech recognition
TWI421857B (zh) * 2009-12-29 2014-01-01 Ind Tech Res Inst 產生詞語確認臨界值的裝置、方法與語音辨識、詞語確認系統
US8682667B2 (en) 2010-02-25 2014-03-25 Apple Inc. User profiling for selecting user specific voice input processing information
US10134385B2 (en) 2012-03-02 2018-11-20 Apple Inc. Systems and methods for name pronunciation
US9721563B2 (en) 2012-06-08 2017-08-01 Apple Inc. Name recognition system
KR20140008870A (ko) * 2012-07-12 2014-01-22 삼성전자주식회사 컨텐츠 정보 제공 방법 및 이를 적용한 방송 수신 장치
US9547647B2 (en) 2012-09-19 2017-01-17 Apple Inc. Voice-based media searching
WO2014197334A2 (en) 2013-06-07 2014-12-11 Apple Inc. System and method for user-specified pronunciation of words for speech synthesis and recognition
US9430463B2 (en) 2014-05-30 2016-08-30 Apple Inc. Exemplar-based natural language processing
US9338493B2 (en) 2014-06-30 2016-05-10 Apple Inc. Intelligent automated assistant for TV user interactions
US9668121B2 (en) 2014-09-30 2017-05-30 Apple Inc. Social reminders
CN105989833B (zh) * 2015-02-28 2019-11-15 讯飞智元信息科技有限公司 多语种混语文本字音转换方法及系统
US10567477B2 (en) 2015-03-08 2020-02-18 Apple Inc. Virtual assistant continuity
CN106157948B (zh) * 2015-04-22 2019-10-18 科大讯飞股份有限公司 一种基频建模方法及系统
US9578173B2 (en) 2015-06-05 2017-02-21 Apple Inc. Virtual assistant aided communication with 3rd party service in a communication session
CN105096934B (zh) * 2015-06-30 2019-02-12 百度在线网络技术(北京)有限公司 构建语音特征库的方法、语音合成方法、装置及设备
US10671428B2 (en) 2015-09-08 2020-06-02 Apple Inc. Distributed personal assistant
US10747498B2 (en) 2015-09-08 2020-08-18 Apple Inc. Zero latency digital assistant
US11010550B2 (en) 2015-09-29 2021-05-18 Apple Inc. Unified language modeling framework for word prediction, auto-completion and auto-correction
US10366158B2 (en) 2015-09-29 2019-07-30 Apple Inc. Efficient word encoding for recurrent neural network language models
US10691473B2 (en) 2015-11-06 2020-06-23 Apple Inc. Intelligent automated assistant in a messaging environment
US10049668B2 (en) 2015-12-02 2018-08-14 Apple Inc. Applying neural network language models to weighted finite state transducers for automatic speech recognition
US10223066B2 (en) 2015-12-23 2019-03-05 Apple Inc. Proactive assistance based on dialog communication between devices
US10446143B2 (en) 2016-03-14 2019-10-15 Apple Inc. Identification of voice inputs providing credentials
US9934775B2 (en) 2016-05-26 2018-04-03 Apple Inc. Unit-selection text-to-speech synthesis based on predicted concatenation parameters
US9972304B2 (en) 2016-06-03 2018-05-15 Apple Inc. Privacy preserving distributed evaluation framework for embedded personalized systems
US10249300B2 (en) 2016-06-06 2019-04-02 Apple Inc. Intelligent list reading
US10049663B2 (en) 2016-06-08 2018-08-14 Apple, Inc. Intelligent automated assistant for media exploration
DK179309B1 (en) 2016-06-09 2018-04-23 Apple Inc Intelligent automated assistant in a home environment
US10586535B2 (en) 2016-06-10 2020-03-10 Apple Inc. Intelligent digital assistant in a multi-tasking environment
US10509862B2 (en) 2016-06-10 2019-12-17 Apple Inc. Dynamic phrase expansion of language input
US10192552B2 (en) 2016-06-10 2019-01-29 Apple Inc. Digital assistant providing whispered speech
US10067938B2 (en) 2016-06-10 2018-09-04 Apple Inc. Multilingual word prediction
US10490187B2 (en) 2016-06-10 2019-11-26 Apple Inc. Digital assistant providing automated status report
DK179343B1 (en) 2016-06-11 2018-05-14 Apple Inc Intelligent task discovery
DK179415B1 (en) 2016-06-11 2018-06-14 Apple Inc Intelligent device arbitration and control
DK201670540A1 (en) 2016-06-11 2018-01-08 Apple Inc Application integration with a digital assistant
DK179049B1 (en) 2016-06-11 2017-09-18 Apple Inc Data driven natural language event detection and classification
US10043516B2 (en) 2016-09-23 2018-08-07 Apple Inc. Intelligent automated assistant
CN106534528A (zh) * 2016-11-04 2017-03-22 广东欧珀移动通信有限公司 一种文本信息的处理方法、装置及移动终端
US10593346B2 (en) 2016-12-22 2020-03-17 Apple Inc. Rank-reduced token representation for automatic speech recognition
DK201770439A1 (en) 2017-05-11 2018-12-13 Apple Inc. Offline personal assistant
DK179745B1 (en) 2017-05-12 2019-05-01 Apple Inc. SYNCHRONIZATION AND TASK DELEGATION OF A DIGITAL ASSISTANT
DK179496B1 (en) 2017-05-12 2019-01-15 Apple Inc. USER-SPECIFIC Acoustic Models
DK201770431A1 (en) 2017-05-15 2018-12-20 Apple Inc. Optimizing dialogue policy decisions for digital assistants using implicit feedback
DK201770432A1 (en) 2017-05-15 2018-12-21 Apple Inc. Hierarchical belief states for digital assistants
DK179560B1 (en) 2017-05-16 2019-02-18 Apple Inc. FAR-FIELD EXTENSION FOR DIGITAL ASSISTANT SERVICES
CN107481713B (zh) * 2017-07-17 2020-06-02 清华大学 一种混合语言语音合成方法及装置
CN109948124B (zh) * 2019-03-15 2022-12-23 腾讯科技(深圳)有限公司 语音文件切分方法、装置及计算机设备
CN110942765B (zh) * 2019-11-11 2022-05-27 珠海格力电器股份有限公司 一种构建语料库的方法、设备、服务器和存储介质
CN111128116B (zh) * 2019-12-20 2021-07-23 珠海格力电器股份有限公司 一种语音处理方法、装置、计算设备及存储介质
KR20210109222A (ko) 2020-02-27 2021-09-06 주식회사 케이티 음성을 합성하는 장치, 방법 및 컴퓨터 프로그램
US20210350788A1 (en) * 2020-05-06 2021-11-11 Samsung Electronics Co., Ltd. Electronic device for generating speech signal corresponding to at least one text and operating method of the electronic device
CN113393829B (zh) * 2021-06-16 2023-08-29 哈尔滨工业大学(深圳) 一种融合韵律和个人信息的中文语音合成方法

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5970454A (en) * 1993-12-16 1999-10-19 British Telecommunications Public Limited Company Synthesizing speech by converting phonemes to digital waveforms

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS6449622A (en) * 1987-08-19 1989-02-27 Jsp Corp Resin foaming particle containing crosslinked polyolefin-based resin and manufacture thereof
US5704007A (en) * 1994-03-11 1997-12-30 Apple Computer, Inc. Utilization of multiple voice sources in a speech synthesizer
US6134528A (en) * 1997-06-13 2000-10-17 Motorola, Inc. Method device and article of manufacture for neural-network based generation of postlexical pronunciations from lexical pronunciations
KR100259777B1 (ko) * 1997-10-24 2000-06-15 정선종 텍스트/음성변환기에서의최적합성단위열선정방법
US7283964B1 (en) * 1999-05-21 2007-10-16 Winbond Electronics Corporation Method and apparatus for voice controlled devices with improved phrase storage, use, conversion, transfer, and recognition
DE60215296T2 (de) * 2002-03-15 2007-04-05 Sony France S.A. Verfahren und Vorrichtung zum Sprachsyntheseprogramm, Aufzeichnungsmedium, Verfahren und Vorrichtung zur Erzeugung einer Zwangsinformation und Robotereinrichtung
JP2003295882A (ja) * 2002-04-02 2003-10-15 Canon Inc 音声合成用テキスト構造、音声合成方法、音声合成装置及びそのコンピュータ・プログラム
KR100883649B1 (ko) * 2002-04-04 2009-02-18 삼성전자주식회사 텍스트/음성 변환 장치 및 방법
GB2388286A (en) * 2002-05-01 2003-11-05 Seiko Epson Corp Enhanced speech data for use in a text to speech system
CN1320482C (zh) * 2003-09-29 2007-06-06 摩托罗拉公司 标识文本串中的自然语音停顿的方法

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5970454A (en) * 1993-12-16 1999-10-19 British Telecommunications Public Limited Company Synthesizing speech by converting phonemes to digital waveforms

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
HELEN M MENG ET AL: "CU VOCAL: CORPUS-BASED SYLLABLE CONCATENATION FOR CHINESE SPEECH SYNTHESIS ACROSS DOMAINS AND DIALECTS" ICSLP 2002 : 7TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING. DENVER, COLORADO, SEPT. 16 - 20, 2002, vol. 4 OF 4, 16 September 2002 (2002-09-16), pages 2373-2376, XP007011576 ISBN: 1-876346-40-X *
HIROKAWA T ET AL: "HIGH QUALITY SPEECH SYNTHESIS SYSTEM BASED ON WAVEFORM CONCATENATION OF PHONEME SEGMENT" IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS, COMMUNICATIONS AND COMPUTER SCIENCES, ENGINEERING SCIENCES SOCIETY, TOKYO, JP, vol. 76A, no. 11, 1 November 1993 (1993-11-01), pages 1964-1970, XP000420615 ISSN: 0916-8508 *
REN-HUA WANG ET AL.: "A CORPUS-BASED CHINESE SPEECH SYNTHESIS WITH CONTEXTUAL DEPENDENT UNIT SELECTION" IEEE INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING (ICSLP), vol. 2, 16 October 2000 (2000-10-16), pages 391-394, XP007010255 *
See also references of WO2005034082A1 *
WEIBIN ZHU ET AL: "Corpus building for data-driven tts systems" SPEECH SYNTHESIS, 2002. PROCEEDINGS OF 2002 IEEE WORKSHOP ON 11-13 SEPT. 2002, PISCATAWAY, NJ, USA,IEEE, 11 September 2002 (2002-09-11), pages 199-202, XP010653645 ISBN: 0-7803-7395-2 *
WOEI-LUEN PERNG ET AL: "Image Talk: a real time synthetic talking head using one single image with Chinese text-to-speech capability" COMPUTER GRAPHICS AND APPLICATIONS, 1998. PACIFIC GRAPHICS '98. SIXTH PACIFIC CONFERENCE ON SINGAPORE 26-29 OCT. 1998, LOS ALAMITOS, CA, USA,IEEE COMPUT. SOC, US, 26 October 1998 (1998-10-26), pages 140-148, XP010315487 ISBN: 0-8186-8620-0 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112530406A (zh) * 2020-11-30 2021-03-19 深圳市优必选科技股份有限公司 一种语音合成方法、语音合成装置及智能设备

Also Published As

Publication number Publication date
KR20060066121A (ko) 2006-06-15
CN1604182A (zh) 2005-04-06
KR100769033B1 (ko) 2007-10-22
CN1260704C (zh) 2006-06-21
EP1668628A4 (de) 2007-01-10
WO2005034082A1 (en) 2005-04-14
MXPA06003431A (es) 2006-06-20

Similar Documents

Publication Publication Date Title
KR100769033B1 (ko) 스피치 합성 방법
US5949961A (en) Word syllabification in speech synthesis system
US6823309B1 (en) Speech synthesizing system and method for modifying prosody based on match to database
US6029132A (en) Method for letter-to-sound in text-to-speech synthesis
US6684187B1 (en) Method and system for preselection of suitable units for concatenative speech
US6505158B1 (en) Synthesis-based pre-selection of suitable units for concatenative speech
US6243680B1 (en) Method and apparatus for obtaining a transcription of phrases through text and spoken utterances
US6910012B2 (en) Method and system for speech recognition using phonetically similar word alternatives
EP0833304B1 (de) Grundfrequenzmuster enthaltende Prosodie-Datenbanken für die Sprachsynthese
KR100403293B1 (ko) 음성합성방법, 음성합성장치 및 음성합성프로그램을기록한 컴퓨터판독 가능한 매체
JP3481497B2 (ja) 綴り言葉に対する複数発音を生成し評価する判断ツリーを利用する方法及び装置
EP1213705A2 (de) Verfahren und Anordnung zur Sprachsysnthese ohne prosodische Veränderung
WO1996023298A2 (en) System amd method for generating and using context dependent sub-syllable models to recognize a tonal language
JPH0916602A (ja) 翻訳装置および翻訳方法
JP5198046B2 (ja) 音声処理装置及びそのプログラム
WO2006106182A1 (en) Improving memory usage in text-to-speech system
JP3576066B2 (ja) 音声合成システム、および音声合成方法
Hendessi et al. A speech synthesizer for Persian text using a neural network with a smooth ergodic HMM
Akinwonmi Development of a prosodic read speech syllabic corpus of the Yoruba language
Kaur et al. BUILDING AText-TO-SPEECH SYSTEM FOR PUNJABI LANGUAGE
JP2005534968A (ja) 漢字語の読みの決定
Bharthi et al. Unit selection based speech synthesis for converting short text message into voice message in mobile phones
GB2292235A (en) Word syllabification.
CN114999447A (zh) 一种基于对抗生成网络的语音合成模型及训练方法
JP2003345372A (ja) 音声合成装置及び音声合成方法

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 20060323

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): DE FR GB IT

DAX Request for extension of the european patent (deleted)
RBV Designated contracting states (corrected)

Designated state(s): DE FR GB IT

A4 Supplementary search report drawn up and despatched

Effective date: 20061208

17Q First examination report despatched

Effective date: 20070907

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN

18D Application deemed to be withdrawn

Effective date: 20080118

P01 Opt-out of the competence of the unified patent court (upc) registered

Effective date: 20230520