EP1668628A1 - Verfahren zum synthetisieren von sprache - Google Patents
Verfahren zum synthetisieren von spracheInfo
- Publication number
- EP1668628A1 EP1668628A1 EP04784355A EP04784355A EP1668628A1 EP 1668628 A1 EP1668628 A1 EP 1668628A1 EP 04784355 A EP04784355 A EP 04784355A EP 04784355 A EP04784355 A EP 04784355A EP 1668628 A1 EP1668628 A1 EP 1668628A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- match
- speech
- pitch
- speech segment
- segment
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
- G10L13/10—Prosody rules derived from text; Stress or intonation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/06—Elementary speech units used in speech synthesisers; Concatenation rules
- G10L13/07—Concatenation rules
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/06—Elementary speech units used in speech synthesisers; Concatenation rules
Definitions
- the present invention relates generally to Text-To-Speech (TTS) synthesis.
- TTS Text-To-Speech
- the invention is particularly useful for, but not necessarily limited to, determining an appropriate synthesized pronunciation of a text segment using a non-exhaustive utterance corpus.
- TTS Text to Speech
- concatenated text to speech synthesis allows electronic devices to receive an input text string and provide a converted representation of the string in the form of synthesized speech.
- a device that may be required to synthesize speech originating from a non-deterministic number of received text strings will have difficulty in providing high quality realistic synthesized speech. That is because the pronunciation of each word or syllable (for Chinese characters and the like) to be synthesized is context and location dependent. For example, a pronunciation of a word at the beginning of a sentence
- input text string may be drawn out or lengthened.
- the pronunciation of the same word may be lengthened even more if it occurs in the middle of a sentence where emphasis is required.
- the pronunciation of a word depends on at least tone (pitch), volume and duration.
- many languages include numerous possible pronunciations of individual syllables.
- a single syllable represented by a Chinese character (or other similar character based script) may have up to 6 different pronunciations.
- a large pre-recorded utterance waveform corpus of sentences is required. This corpus typically requires on average about 500 variations of each pronunciation if realistic speech synthesis is to be achieved.
- an utterance waveform corpus of all pronunciations for every character would be prohibitively large.
- the size of the utterance waveform corpus may be particularly limited when it is embedded in a small electronic device having a low memory capacity such as a radio telephone or a personal digital assistant.
- the algorithms used to compare the input text strings with the audio database also need to be efficient and fast so that the resulting synthesized and concatenated speech flows naturally and smoothly. Due to memory and processing speed limitations, existing TTS methods for embedded applications often result in speech that is unnatural or robotic sounding. There is therefore a need for an improved method for performing TTS to provide a natural sounding synthesized speech whilst using a non-exhaustive utterance corpus.
- the present invention is a method of performing speech synthesis that includes comparing an input text segment with an utterance waveform corpus that contains numerous speech samples. The method determines whether there is a contextual best match between the text segment and one speech sample included in the utterance waveform corpus. If there is not a contextual best match, the method determines whether there is a contextual phonetic hybrid match between the text segment and a speech sample included in the utterance waveform corpus.
- a contextual phonetic hybrid match requires a match of all implicit prosodic features in a defined prosodic feature group.
- the prosodic feature group is redefined by deleting one of the implicit prosodic features from the prosodic feature group so as to redefine the prosodic feature group.
- the prosodic feature group is successively redefined by deleting one implicit prosodic feature from the group until a match is found between the input text segment and a speech sample. When a match is found, the matched speech sample is used to generate concatenative speech.
- Fig. 1 is a block diagram of an electronic device upon which the invention may be implemented
- Fig. 2 is a flow chart illustrating a specific embodiment of the present invention used to generate concatenative speech in the Chinese language
- Fig. 3 is a flowchart illustrating the process of determining whether a contextual phonetic hybrid match exists by successively relaxing the constraints used to define a match.
- Fig. 1 there is illustrated a block diagram of an electronic device 10 upon which the invention may be implemented.
- the device 10 includes a processor 30 operatively coupled, by a common bus 15, to a text memory module 20, a Read only Memory (ROM) 40, a Random Access Memory (RAM) 50 and a waveform corpus 60.
- the processor 30 is also operatively coupled to a touch screen display 90 and an input of a speech synthesizer 70.
- An output of the speech synthesizer 70 is operatively coupled to a speaker 80.
- the text memory module is a store for storing text obtained by any receiving means possible such as by radio reception, internet, or plug in portable memory cards etc.
- the ROM stores operating code for performing the invention as described in Figures 2 and 3.
- the Corpus 60 is a essentially a conventional corpus as is the speech synthesizer 70 and speaker 80 and the touch screen display 90 is a user interface and allows for display of text stored in the a text memory module 20.
- Fig. 2 is a flow chart illustrating a specific embodiment of the present invention used to generate concatenative speech 110 from an input text segment 120 in the Chinese language.
- the text segment 120 is compared with an utterance waveform corpus 60, which includes a plurality of speech samples 140, to determine whether there is a contextual best match (step S110). If a contextual best match is found between a text segment 120 and a specific speech sample 140, that specific speech sample 140 is sent to a concatenating algorithm 150 for generating the concatenative speech 110. If no contextual best match is found between the text segment 120 and a specific speech sample 140, then the text segment 120 is compared again with the utterance waveform corpus 130 to determine whether there is a contextual phonetic hybrid match (step S120).
- Fig. 3 is a flowchart illustrating the process of determining whether a contextual phonetic hybrid match exists by successively relaxing the constraints used to define a match.
- a contextual phonetic hybrid match requires a match between a text segment 120 and all of the implicit prosodic features 210 included in a defined prosodic feature group 220. If no match is found, one of the implicit prosodic features 210 is deleted from the defined prosodic feature group 220 and the group 220 is redefined as including all of the previously included features 210 less the deleted feature 210 (e.g., Step 130). The redefined prosodic feature group 220 is then compared with the text segment 120 to determine whether there is a match. The process of deleting an implicit prosodic feature 210, redefining the prosodic feature group 220, and then redetermining whether there is a contextual phonetic hybrid match, continues until a match is found (Steps S130, S140, etc.
- the matched speech sample 140 which matches the text segment 120, is sent to the concatenating algorithm 150 for generating concatenative speech 110.
- the concatenating algorithm 150 for generating concatenative speech 110.
- a basic phonetic match is performed matching only pinyin (Step S180).
- the utterance waveform corpus 60 is designed so that there is always at least one syllable included with the correct pinyin to match all possible input text segments 120. That basic phonetic match is then input into the concatenating algorithm 150.
- the invention is thus a multi-layer, data-driven method for controlling the prosody (rhythm and intonation) of the resulting synthesized, concatenative speech 110.
- each layer of the method includes a redefined prosodic feature group 220.
- a text segment 120 means any type of input text string or segment of coded language. It should not be limited to only visible text that is scaned or otherwise entered into a TTS system.
- the utterance waveform corpus 130 of the present invention is annotated with information concerning each speech sample 140 (usually a word) that is included in the corpus 130.
- the speech samples 140 themselves are generally recordings of actual human speech, usually digitized or analog waveforms. Annotations are thus required to identify the samples 140.
- Such annotations may include the specific letters or characters (depending on the language) that define the sample 140 as well as the implicit prosodic features 210 of the speech sample 140.
- the implicit prosodic features 210 include context information concerning how the speech sample 140 is used in a sentence.
- a speech sample 140 in the Chinese language may include the following implicit prosodic features 210: Text context: the Chinese characters immediately preceding and immediately following the annotated text of a speech sample 140.
- Pinyin the phonetic representation of a speech sample 140. Pinyin is a standard romanization of the Chinese language using the Western alphabet.
- Tone context the tone context of the Chinese characters immediately preceding and immediately following the annotated text of a speech sample 140.
- Co-articulation the phonetic level representatives that immediately precede and immediately follow the annotated text of a speech sample 140, such as phonemes or sub-syllables.
- Syllable position the position of a syllable in a prosodic phrase.
- Phrase position the position of a prosodic phrase in a sentence. Usually the phrase position is identified as one of the three positions of sentence initial, sentence medial and sentence final.
- Character symbol the code (e.g., ASCII code) representing the Chinese character that defines a speech sample 140.
- Length of phrase the number of Chinese characters included in a prosodic phrase.
- each character's sound could represent a speech sample 140 and could be annotated with the above implicit prosodic features 210.
- the character "H” as found in the above sentence could be annotated as follows: Text context: ⁇ , nowadays ; Pinyin: guo2; Tone context: 1 , 3; Co-articulation: ong, h ; Syllable position: 2 ; Phrase position: 1 ; Character symbol: ASCII code for H ; and Length of phrase: 2.
- step S110 determines whether there is a contextual best match between a text segment 120 and a speech sample 140.
- a contextual best match is generally defined as the closest, or an exact, match of both 1 ) the letters or characters (depending on the language) of an input text segment 120 with the corresponding letters or characters of an annotated speech sample 140, and 2) the implicit prosodic features 210 of the input text segment 120 with the implicit prosodic features 210 of the annotated speech sample 140.
- a best match is being determined by identifying the greatest number of consecutive syllables in the input text segment that are identical to attributes and attribute positions in each of the waveform utterances (speech sample) in the waveform corpus 60.
- a speech sample 140 selected immediately as an element for use in the concatenating algorithm 150.
- the method of the present invention determines whether there is a contextual phonetic hybrid match between an input text segment 120 and a speech sample 140.
- a contextual phonetic hybrid match requires a match between a text segment 120 and all of the implicit prosodic features 210 included in a defined prosodic feature group 220. As shown in Fig.
- one embodiment of the present invention used to synthesize speech in the Chinese language uses a first defined prosodic feature group 220 that includes the implicit prosodic features 210 of pinyin, tone context, co- articulation, syllable position, phrase position, character symbol, and length of phrase (Step S120). If none of the annotated speech samples 140 found in the utterance waveform corpus 130 have identical values for each of the above features 210 as found in the input text segment 120, then the corpus 130 does not contain a speech sample 140 that is close enough to the input text segment 120 based on the matching rules as applied in Step S120. Therefore the constraints of the matching rules must be relaxed and thus broadened to include other speech samples 140 that possess the next most preferable features 210 found in the input text segment 120.
- the matching rules are broadened by deleting the one feature 210 found in the defined prosodic feature group 220 that is least likely to affect the natural prosody of the input text segment 120.
- the next most preferable features 210 found in the illustrated embodiment of the present invention include all of the features 210 defined above less the length of phrase feature 210.
- the order in which the implicit prosodic features 210 are deleted from the defined prosodic feature group 220 is determined empirically. When the features 210 are deleted in a proper order, the method of the present invention results in efficient and fast speech synthesis. The output speech therefore sounds more natural even though the utterance waveform corpus 130 may be relatively limited in size.
- the variable BestPitch may be determined based on a statistical analysis of the utterance waveform corpus 130.
- a corpus 130 may include five tones, each having an average pitch.
- Each annotated speech sample 140 in the corpus 130 may also include individual prosody information represented by the values of pitch, duration and energy. So the average values of pitch, duration and energy of the entire corpus 130 are available.
- the best pitch for a particular context may then be determined using the following formula: BestPitch - pitch tone — nlndexx empiricalvalue (Eq.
- pitch t one the average pitch including tone of the utterance waveform corpus
- nlndex the index of the text segment 120 in a prosody phrase
- empircalvalue an empirical value based on the utterance waveform corpus.
- the empirical value of 4 is used in one particular embodiment of the present invention that synthesizes the Chinese language; however this number could vary depending on the content of a particular utterance waveform corpus 130.
- the empirical value of 4 is used in one particular embodiment of the present invention that synthesizes the Chinese language, however this number could vary depending on the content of a particular utterance waveform corpus 130.
- the invention is suitable for many languages.
- the implicit prosodic features 210 would need to be deleted or redefined from the examples given hereinabove.
- the feature 210 identified above as tone context would be deleted in an application of the present invention for the English language because English is not a tonal language.
- the feature 210 identified above as pinyin would likely be redefined as simply a phonetic symbol when the present invention is applied to English.
- the present invention is therefore a multi-layer, data-driven prosodic control scheme that utilizes the implicit prosodic information in an utterance waveform corpus 130.
- the method of the present invention When searching for an appropriate speech sample 140 to match with a given input text segment 120, the method of the present invention employs a strategy based on multi-layer matching, where each layer is tried in turn until a sufficiently good match is found. By successively relaxing the constraints of each layer, the method efficiently determines whether the utterance waveform corpus 130 contains a match.
- the method is therefore particularly appropriate for embedded TTS systems where the size of the utterance waveform corpus 130 and the processing power of the system may be limited.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNB031326986A CN1260704C (zh) | 2003-09-29 | 2003-09-29 | 语音合成方法 |
PCT/US2004/030467 WO2005034082A1 (en) | 2003-09-29 | 2004-09-17 | Method for synthesizing speech |
Publications (2)
Publication Number | Publication Date |
---|---|
EP1668628A1 true EP1668628A1 (de) | 2006-06-14 |
EP1668628A4 EP1668628A4 (de) | 2007-01-10 |
Family
ID=34398359
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP04784355A Withdrawn EP1668628A4 (de) | 2003-09-29 | 2004-09-17 | Verfahren zum synthetisieren von sprache |
Country Status (5)
Country | Link |
---|---|
EP (1) | EP1668628A4 (de) |
KR (1) | KR100769033B1 (de) |
CN (1) | CN1260704C (de) |
MX (1) | MXPA06003431A (de) |
WO (1) | WO2005034082A1 (de) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112530406A (zh) * | 2020-11-30 | 2021-03-19 | 深圳市优必选科技股份有限公司 | 一种语音合成方法、语音合成装置及智能设备 |
Families Citing this family (59)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8677377B2 (en) | 2005-09-08 | 2014-03-18 | Apple Inc. | Method and apparatus for building an intelligent automated assistant |
US9318108B2 (en) | 2010-01-18 | 2016-04-19 | Apple Inc. | Intelligent automated assistant |
US8996376B2 (en) | 2008-04-05 | 2015-03-31 | Apple Inc. | Intelligent text-to-speech conversion |
US10241644B2 (en) | 2011-06-03 | 2019-03-26 | Apple Inc. | Actionable reminder entries |
US10241752B2 (en) | 2011-09-30 | 2019-03-26 | Apple Inc. | Interface for a virtual digital assistant |
US9431006B2 (en) | 2009-07-02 | 2016-08-30 | Apple Inc. | Methods and apparatuses for automatic speech recognition |
TWI421857B (zh) * | 2009-12-29 | 2014-01-01 | Ind Tech Res Inst | 產生詞語確認臨界值的裝置、方法與語音辨識、詞語確認系統 |
US8682667B2 (en) | 2010-02-25 | 2014-03-25 | Apple Inc. | User profiling for selecting user specific voice input processing information |
US10134385B2 (en) | 2012-03-02 | 2018-11-20 | Apple Inc. | Systems and methods for name pronunciation |
US9721563B2 (en) | 2012-06-08 | 2017-08-01 | Apple Inc. | Name recognition system |
KR20140008870A (ko) * | 2012-07-12 | 2014-01-22 | 삼성전자주식회사 | 컨텐츠 정보 제공 방법 및 이를 적용한 방송 수신 장치 |
US9547647B2 (en) | 2012-09-19 | 2017-01-17 | Apple Inc. | Voice-based media searching |
WO2014197334A2 (en) | 2013-06-07 | 2014-12-11 | Apple Inc. | System and method for user-specified pronunciation of words for speech synthesis and recognition |
US9430463B2 (en) | 2014-05-30 | 2016-08-30 | Apple Inc. | Exemplar-based natural language processing |
US9338493B2 (en) | 2014-06-30 | 2016-05-10 | Apple Inc. | Intelligent automated assistant for TV user interactions |
US9668121B2 (en) | 2014-09-30 | 2017-05-30 | Apple Inc. | Social reminders |
CN105989833B (zh) * | 2015-02-28 | 2019-11-15 | 讯飞智元信息科技有限公司 | 多语种混语文本字音转换方法及系统 |
US10567477B2 (en) | 2015-03-08 | 2020-02-18 | Apple Inc. | Virtual assistant continuity |
CN106157948B (zh) * | 2015-04-22 | 2019-10-18 | 科大讯飞股份有限公司 | 一种基频建模方法及系统 |
US9578173B2 (en) | 2015-06-05 | 2017-02-21 | Apple Inc. | Virtual assistant aided communication with 3rd party service in a communication session |
CN105096934B (zh) * | 2015-06-30 | 2019-02-12 | 百度在线网络技术(北京)有限公司 | 构建语音特征库的方法、语音合成方法、装置及设备 |
US10671428B2 (en) | 2015-09-08 | 2020-06-02 | Apple Inc. | Distributed personal assistant |
US10747498B2 (en) | 2015-09-08 | 2020-08-18 | Apple Inc. | Zero latency digital assistant |
US11010550B2 (en) | 2015-09-29 | 2021-05-18 | Apple Inc. | Unified language modeling framework for word prediction, auto-completion and auto-correction |
US10366158B2 (en) | 2015-09-29 | 2019-07-30 | Apple Inc. | Efficient word encoding for recurrent neural network language models |
US10691473B2 (en) | 2015-11-06 | 2020-06-23 | Apple Inc. | Intelligent automated assistant in a messaging environment |
US10049668B2 (en) | 2015-12-02 | 2018-08-14 | Apple Inc. | Applying neural network language models to weighted finite state transducers for automatic speech recognition |
US10223066B2 (en) | 2015-12-23 | 2019-03-05 | Apple Inc. | Proactive assistance based on dialog communication between devices |
US10446143B2 (en) | 2016-03-14 | 2019-10-15 | Apple Inc. | Identification of voice inputs providing credentials |
US9934775B2 (en) | 2016-05-26 | 2018-04-03 | Apple Inc. | Unit-selection text-to-speech synthesis based on predicted concatenation parameters |
US9972304B2 (en) | 2016-06-03 | 2018-05-15 | Apple Inc. | Privacy preserving distributed evaluation framework for embedded personalized systems |
US10249300B2 (en) | 2016-06-06 | 2019-04-02 | Apple Inc. | Intelligent list reading |
US10049663B2 (en) | 2016-06-08 | 2018-08-14 | Apple, Inc. | Intelligent automated assistant for media exploration |
DK179309B1 (en) | 2016-06-09 | 2018-04-23 | Apple Inc | Intelligent automated assistant in a home environment |
US10586535B2 (en) | 2016-06-10 | 2020-03-10 | Apple Inc. | Intelligent digital assistant in a multi-tasking environment |
US10509862B2 (en) | 2016-06-10 | 2019-12-17 | Apple Inc. | Dynamic phrase expansion of language input |
US10192552B2 (en) | 2016-06-10 | 2019-01-29 | Apple Inc. | Digital assistant providing whispered speech |
US10067938B2 (en) | 2016-06-10 | 2018-09-04 | Apple Inc. | Multilingual word prediction |
US10490187B2 (en) | 2016-06-10 | 2019-11-26 | Apple Inc. | Digital assistant providing automated status report |
DK179343B1 (en) | 2016-06-11 | 2018-05-14 | Apple Inc | Intelligent task discovery |
DK179415B1 (en) | 2016-06-11 | 2018-06-14 | Apple Inc | Intelligent device arbitration and control |
DK201670540A1 (en) | 2016-06-11 | 2018-01-08 | Apple Inc | Application integration with a digital assistant |
DK179049B1 (en) | 2016-06-11 | 2017-09-18 | Apple Inc | Data driven natural language event detection and classification |
US10043516B2 (en) | 2016-09-23 | 2018-08-07 | Apple Inc. | Intelligent automated assistant |
CN106534528A (zh) * | 2016-11-04 | 2017-03-22 | 广东欧珀移动通信有限公司 | 一种文本信息的处理方法、装置及移动终端 |
US10593346B2 (en) | 2016-12-22 | 2020-03-17 | Apple Inc. | Rank-reduced token representation for automatic speech recognition |
DK201770439A1 (en) | 2017-05-11 | 2018-12-13 | Apple Inc. | Offline personal assistant |
DK179745B1 (en) | 2017-05-12 | 2019-05-01 | Apple Inc. | SYNCHRONIZATION AND TASK DELEGATION OF A DIGITAL ASSISTANT |
DK179496B1 (en) | 2017-05-12 | 2019-01-15 | Apple Inc. | USER-SPECIFIC Acoustic Models |
DK201770431A1 (en) | 2017-05-15 | 2018-12-20 | Apple Inc. | Optimizing dialogue policy decisions for digital assistants using implicit feedback |
DK201770432A1 (en) | 2017-05-15 | 2018-12-21 | Apple Inc. | Hierarchical belief states for digital assistants |
DK179560B1 (en) | 2017-05-16 | 2019-02-18 | Apple Inc. | FAR-FIELD EXTENSION FOR DIGITAL ASSISTANT SERVICES |
CN107481713B (zh) * | 2017-07-17 | 2020-06-02 | 清华大学 | 一种混合语言语音合成方法及装置 |
CN109948124B (zh) * | 2019-03-15 | 2022-12-23 | 腾讯科技(深圳)有限公司 | 语音文件切分方法、装置及计算机设备 |
CN110942765B (zh) * | 2019-11-11 | 2022-05-27 | 珠海格力电器股份有限公司 | 一种构建语料库的方法、设备、服务器和存储介质 |
CN111128116B (zh) * | 2019-12-20 | 2021-07-23 | 珠海格力电器股份有限公司 | 一种语音处理方法、装置、计算设备及存储介质 |
KR20210109222A (ko) | 2020-02-27 | 2021-09-06 | 주식회사 케이티 | 음성을 합성하는 장치, 방법 및 컴퓨터 프로그램 |
US20210350788A1 (en) * | 2020-05-06 | 2021-11-11 | Samsung Electronics Co., Ltd. | Electronic device for generating speech signal corresponding to at least one text and operating method of the electronic device |
CN113393829B (zh) * | 2021-06-16 | 2023-08-29 | 哈尔滨工业大学(深圳) | 一种融合韵律和个人信息的中文语音合成方法 |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5970454A (en) * | 1993-12-16 | 1999-10-19 | British Telecommunications Public Limited Company | Synthesizing speech by converting phonemes to digital waveforms |
Family Cites Families (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPS6449622A (en) * | 1987-08-19 | 1989-02-27 | Jsp Corp | Resin foaming particle containing crosslinked polyolefin-based resin and manufacture thereof |
US5704007A (en) * | 1994-03-11 | 1997-12-30 | Apple Computer, Inc. | Utilization of multiple voice sources in a speech synthesizer |
US6134528A (en) * | 1997-06-13 | 2000-10-17 | Motorola, Inc. | Method device and article of manufacture for neural-network based generation of postlexical pronunciations from lexical pronunciations |
KR100259777B1 (ko) * | 1997-10-24 | 2000-06-15 | 정선종 | 텍스트/음성변환기에서의최적합성단위열선정방법 |
US7283964B1 (en) * | 1999-05-21 | 2007-10-16 | Winbond Electronics Corporation | Method and apparatus for voice controlled devices with improved phrase storage, use, conversion, transfer, and recognition |
DE60215296T2 (de) * | 2002-03-15 | 2007-04-05 | Sony France S.A. | Verfahren und Vorrichtung zum Sprachsyntheseprogramm, Aufzeichnungsmedium, Verfahren und Vorrichtung zur Erzeugung einer Zwangsinformation und Robotereinrichtung |
JP2003295882A (ja) * | 2002-04-02 | 2003-10-15 | Canon Inc | 音声合成用テキスト構造、音声合成方法、音声合成装置及びそのコンピュータ・プログラム |
KR100883649B1 (ko) * | 2002-04-04 | 2009-02-18 | 삼성전자주식회사 | 텍스트/음성 변환 장치 및 방법 |
GB2388286A (en) * | 2002-05-01 | 2003-11-05 | Seiko Epson Corp | Enhanced speech data for use in a text to speech system |
CN1320482C (zh) * | 2003-09-29 | 2007-06-06 | 摩托罗拉公司 | 标识文本串中的自然语音停顿的方法 |
-
2003
- 2003-09-29 CN CNB031326986A patent/CN1260704C/zh not_active Expired - Lifetime
-
2004
- 2004-09-17 WO PCT/US2004/030467 patent/WO2005034082A1/en active Application Filing
- 2004-09-17 EP EP04784355A patent/EP1668628A4/de not_active Withdrawn
- 2004-09-17 MX MXPA06003431A patent/MXPA06003431A/es not_active Application Discontinuation
- 2004-09-17 KR KR1020067006170A patent/KR100769033B1/ko active IP Right Grant
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5970454A (en) * | 1993-12-16 | 1999-10-19 | British Telecommunications Public Limited Company | Synthesizing speech by converting phonemes to digital waveforms |
Non-Patent Citations (6)
Title |
---|
HELEN M MENG ET AL: "CU VOCAL: CORPUS-BASED SYLLABLE CONCATENATION FOR CHINESE SPEECH SYNTHESIS ACROSS DOMAINS AND DIALECTS" ICSLP 2002 : 7TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING. DENVER, COLORADO, SEPT. 16 - 20, 2002, vol. 4 OF 4, 16 September 2002 (2002-09-16), pages 2373-2376, XP007011576 ISBN: 1-876346-40-X * |
HIROKAWA T ET AL: "HIGH QUALITY SPEECH SYNTHESIS SYSTEM BASED ON WAVEFORM CONCATENATION OF PHONEME SEGMENT" IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS, COMMUNICATIONS AND COMPUTER SCIENCES, ENGINEERING SCIENCES SOCIETY, TOKYO, JP, vol. 76A, no. 11, 1 November 1993 (1993-11-01), pages 1964-1970, XP000420615 ISSN: 0916-8508 * |
REN-HUA WANG ET AL.: "A CORPUS-BASED CHINESE SPEECH SYNTHESIS WITH CONTEXTUAL DEPENDENT UNIT SELECTION" IEEE INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING (ICSLP), vol. 2, 16 October 2000 (2000-10-16), pages 391-394, XP007010255 * |
See also references of WO2005034082A1 * |
WEIBIN ZHU ET AL: "Corpus building for data-driven tts systems" SPEECH SYNTHESIS, 2002. PROCEEDINGS OF 2002 IEEE WORKSHOP ON 11-13 SEPT. 2002, PISCATAWAY, NJ, USA,IEEE, 11 September 2002 (2002-09-11), pages 199-202, XP010653645 ISBN: 0-7803-7395-2 * |
WOEI-LUEN PERNG ET AL: "Image Talk: a real time synthetic talking head using one single image with Chinese text-to-speech capability" COMPUTER GRAPHICS AND APPLICATIONS, 1998. PACIFIC GRAPHICS '98. SIXTH PACIFIC CONFERENCE ON SINGAPORE 26-29 OCT. 1998, LOS ALAMITOS, CA, USA,IEEE COMPUT. SOC, US, 26 October 1998 (1998-10-26), pages 140-148, XP010315487 ISBN: 0-8186-8620-0 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112530406A (zh) * | 2020-11-30 | 2021-03-19 | 深圳市优必选科技股份有限公司 | 一种语音合成方法、语音合成装置及智能设备 |
Also Published As
Publication number | Publication date |
---|---|
KR20060066121A (ko) | 2006-06-15 |
CN1604182A (zh) | 2005-04-06 |
KR100769033B1 (ko) | 2007-10-22 |
CN1260704C (zh) | 2006-06-21 |
EP1668628A4 (de) | 2007-01-10 |
WO2005034082A1 (en) | 2005-04-14 |
MXPA06003431A (es) | 2006-06-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
KR100769033B1 (ko) | 스피치 합성 방법 | |
US5949961A (en) | Word syllabification in speech synthesis system | |
US6823309B1 (en) | Speech synthesizing system and method for modifying prosody based on match to database | |
US6029132A (en) | Method for letter-to-sound in text-to-speech synthesis | |
US6684187B1 (en) | Method and system for preselection of suitable units for concatenative speech | |
US6505158B1 (en) | Synthesis-based pre-selection of suitable units for concatenative speech | |
US6243680B1 (en) | Method and apparatus for obtaining a transcription of phrases through text and spoken utterances | |
US6910012B2 (en) | Method and system for speech recognition using phonetically similar word alternatives | |
EP0833304B1 (de) | Grundfrequenzmuster enthaltende Prosodie-Datenbanken für die Sprachsynthese | |
KR100403293B1 (ko) | 음성합성방법, 음성합성장치 및 음성합성프로그램을기록한 컴퓨터판독 가능한 매체 | |
JP3481497B2 (ja) | 綴り言葉に対する複数発音を生成し評価する判断ツリーを利用する方法及び装置 | |
EP1213705A2 (de) | Verfahren und Anordnung zur Sprachsysnthese ohne prosodische Veränderung | |
WO1996023298A2 (en) | System amd method for generating and using context dependent sub-syllable models to recognize a tonal language | |
JPH0916602A (ja) | 翻訳装置および翻訳方法 | |
JP5198046B2 (ja) | 音声処理装置及びそのプログラム | |
WO2006106182A1 (en) | Improving memory usage in text-to-speech system | |
JP3576066B2 (ja) | 音声合成システム、および音声合成方法 | |
Hendessi et al. | A speech synthesizer for Persian text using a neural network with a smooth ergodic HMM | |
Akinwonmi | Development of a prosodic read speech syllabic corpus of the Yoruba language | |
Kaur et al. | BUILDING AText-TO-SPEECH SYSTEM FOR PUNJABI LANGUAGE | |
JP2005534968A (ja) | 漢字語の読みの決定 | |
Bharthi et al. | Unit selection based speech synthesis for converting short text message into voice message in mobile phones | |
GB2292235A (en) | Word syllabification. | |
CN114999447A (zh) | 一种基于对抗生成网络的语音合成模型及训练方法 | |
JP2003345372A (ja) | 音声合成装置及び音声合成方法 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
17P | Request for examination filed |
Effective date: 20060323 |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): DE FR GB IT |
|
DAX | Request for extension of the european patent (deleted) | ||
RBV | Designated contracting states (corrected) |
Designated state(s): DE FR GB IT |
|
A4 | Supplementary search report drawn up and despatched |
Effective date: 20061208 |
|
17Q | First examination report despatched |
Effective date: 20070907 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN |
|
18D | Application deemed to be withdrawn |
Effective date: 20080118 |
|
P01 | Opt-out of the competence of the unified patent court (upc) registered |
Effective date: 20230520 |