US8155963B2 - Autonomous system and method for creating readable scripts for concatenative text-to-speech synthesis (TTS) corpora - Google Patents
Autonomous system and method for creating readable scripts for concatenative text-to-speech synthesis (TTS) corpora Download PDFInfo
- Publication number
- US8155963B2 US8155963B2 US11/332,292 US33229206A US8155963B2 US 8155963 B2 US8155963 B2 US 8155963B2 US 33229206 A US33229206 A US 33229206A US 8155963 B2 US8155963 B2 US 8155963B2
- Authority
- US
- United States
- Prior art keywords
- script
- words
- template
- cohesive
- phonemes
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active, expires
Links
- 238000013515 script Methods 0.000 title claims abstract description 107
- 238000000034 method Methods 0.000 title claims abstract description 40
- 230000015572 biosynthetic process Effects 0.000 title description 3
- 238000003786 synthesis reaction Methods 0.000 title description 3
- MQJKPEGWNLWLTK-UHFFFAOYSA-N Dapsone Chemical compound C1=CC(N)=CC=C1S(=O)(=O)C1=CC=C(N)C=C1 MQJKPEGWNLWLTK-UHFFFAOYSA-N 0.000 claims description 9
- 230000001427 coherent effect Effects 0.000 claims description 5
- 238000007796 conventional method Methods 0.000 description 15
- 238000012545 processing Methods 0.000 description 4
- 239000000284 extract Substances 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 238000013459 approach Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000007418 data mining Methods 0.000 description 2
- 238000013500 data storage Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 235000002767 Daucus carota Nutrition 0.000 description 1
- 244000000626 Daucus carota Species 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000002939 deleterious effect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000003090 exacerbative effect Effects 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/04—Details of speech synthesis systems, e.g. synthesiser structure or memory management
Definitions
- the present invention generally relates to a method and system for providing an improved ability to create a cohesive script for generating a speech corpus (e.g., voice database) for concatenative Text-To-Speech synthesis (“concatenative TTS”), and more particularly, for providing improved quality of that speech corpus resulting from greater fluency and more-natural prosody in the recordings based on the cohesive script.
- a speech corpus e.g., voice database
- concatenative TTS concatenative Text-To-Speech synthesis
- phoneme means the smallest unit of speech used in linguistic analysis.
- the sound represented by “s” is a phoneme.
- phoneme can refer to shorter units, such as fractions of a phoneme e.g. “burst portion of t” or “first 1 ⁇ 3 of s”, or longer units, such as syllables.
- sounds represented by “sh” or “k” are examples of phonemes which have unambiguous pronunciations. It is noted that phonemes (e.g., “sh”) are not equivalent to the number of letters. That is, two letters (e.g., “sh”) can make one phoneme, and one letter, “x”, can make two phonemes, “k” and “s”.
- English speakers generally have a repertoire of about 40 phonemes and utter about 10 phonemes per second.
- the ordinarily skilled artisan would understand that the present invention is not limited to any particular language (e.g., English) or number of phonemes (e.g., 40).
- the exemplary features described herein with reference to the English language are for exemplary purposes only.
- concise means joining together sequences of recordings of phonemes.
- Phonemes include linguistic units, e.g. there is one phoneme “k”.
- a concatenative system will employ many recordings of “k”, such as one from the beginning of “kook” and another from “keep”, which sound considerably different.
- a “text database” means any collection of text, for example, a collection of existing sentences, phrases, words, etc., or combinations thereof.
- a “script” generally means a written text document, or collection of words, sentences, etc., which can be read by a professional speaker to generate a speech database, or a speech corpus (or corpora).
- a “speech corpus” (or “speech corpora”) generally means a collection of speech recordings or audio recordings (e.g., which are generated by reading a script).
- the conventional script generally is made up largely of words and phrases that are chosen for their diverse phoneme content, to ensure ample representation of most or all of the English phoneme sequences.
- a conventional method of generating the script is by data mining.
- data mining generally includes, for example, searching through a very large text database to find words or word sequences containing the required phoneme sequences.
- a database sufficiently large to deliver the required phonemic content generally may contain many sentences with grammatical errors, poor writing, non-English words, and other impediments to smooth oral delivery by the speaker.
- a rare phoneme sequence may be found embedded in a 20-word sentence.
- this 20-word sentence into the script provides one useful word but also drags 19 superfluous words along with it.
- the length of the script is undesirably increased. Omitting the superfluous words would preclude smooth reading of sentences.
- Scripts that are generated by conventional methods and systems contain numerous examples of this problem. That is, a script is generated by conventional means to include a long difficult sentence solely for the purpose of providing one essential word (or phrase, etc.).
- a script developed according to the conventional methods and systems can read more like a hodgepodge of often awkward sentences that are stripped of their original context.
- professional speakers who are called upon to read these conventional scripts for example, for three hours or more in a single stretch of time, usually consider the task to be an onerous one, which can affect the quality of the reading.
- an exemplary feature of the present invention is to provide a method and system for providing an improved ability to create a script, and the speech corpus (i.e., a voice or speech database) for concatenative Text-To-Speech generated by reading such a script.
- the present invention more particularly provides improved quality of the speech corpus resulting from greater fluency and more-natural prosody in the recordings.
- the present invention exemplarily begins with the assumption that it generally would be desirable (e.g., necessary) for a speaker to read about 10,000 English phoneme sequences.
- Applicants have recognized that those sounds preferably can be embedded in real sentences which have some meaning.
- a list of sounds (e.g., “oot,” “ool,” “oop,” etc) can be provided.
- it can be difficult to easily make sentences that assimilate such a list of sounds.
- the present invention preferably can consult a pronunciation dictionary and find a list of words, or in some cases word sequences (e.g., pairs), that contain the desired (e.g., required) sounds. Thus, a list of 10,000 words or word sequences could be provided. However, a fluently-readable script still may not be produced.
- an intelligent software system preferably can be provided that can take as its input an unstructured vocabulary list and autonomously produce one or more cohesive written text documents (i.e., cohesive scripts), which can be read by a professional speaker to generate a speech corpus (or corpora) having greater fluency and more-natural prosody in the recordings.
- cohesive written text documents i.e., cohesive scripts
- a series of pre-written templates preferably can imbue the document with ideas, concepts, and characters that can be used to form the basis of its storyline or content.
- the exemplary features of the invention preferably can include script structural templates which can be thought of as grammars for generating different types of scripts that satisfy predetermined structural properties.
- the script structural templates may cascade, for example, into paragraph and sentence templates.
- the exemplary invention preferably can include templates to produce conceptual coherence such as a story line, plot, or theme for selecting characters and events to describe, and the order in which they will be introduced. These templates preferably can be used to populate the script with content.
- the exemplary invention preferably provides a script that can meet many (or all) of the requirements of conventional scripts by containing many (or all) of the required phoneme sequences in a far more efficient way by providing a scripts which may contain a higher concentration of required phoneme sequences in each sentence.
- a script provided according to the exemplary invention preferably may be much easier to read than a script provided according to the conventional methods and systems.
- the exemplary aspects of the present invention can improve the recording process by making the recording process faster and cheaper; and also can improve the resulting speech corpus, for example, because the script may be read with a more natural inflection.
- a method of generating a speech corpus for concatenative text-to-speech includes autonomously generating a cohesive script from a text database.
- the method preferably includes selecting a word or a word sequence from the text database based on an enumerated phoneme sequence, and then generating a coherent script including the selected word or word sequence.
- the enumerated phoneme sequence preferably includes a diphone, a triphone, a quadphone, a syllable, and/or a bisyllable.
- the method preferably includes extracting at least one predetermined sequence of phonemes from the text database, associating the predetermined sequence of phonemes with a plurality of words included in the text database that include the predetermined sequence of phonemes, selecting N words that include the predetermined sequence of phonemes, and generating the cohesive script based on the N words.
- the text database preferably includes an unstructured vocabulary list, an inventory of occurrences of at least one phonemic unit, an inventory of occurrences of at least one phonemic sequence, a dictionary, and/or a word pronunciation guide.
- the autonomous generation of the cohesive script preferably includes extracting a plurality of triphones from the text database, associating each of the plurality of triphones with a plurality of words included in the text database that include the each of the plurality of triphones, selecting N words that include each of the plurality of triphones, and generating the cohesive script based on the N words.
- a system for generating a speech corpus for concatenative text-to-speech includes an extracting unit that extracts a plurality of triphones from a text database, an associating unit that associates each of the plurality of triphones with a plurality of words included in the text database that include the each of the plurality of triphones, a selecting unit that selects N words that include each of the plurality of triphones, and an input unit that inputs the N selected words into an autonomous language generating unit, wherein the autonomous language generating unit generates the cohesive script.
- FIG. 1 illustrates an exemplary system 100 according to the present invention
- FIG. 2 illustrates another exemplary system 200 according to the present invention
- FIG. 3 illustrates an exemplary method 300 , according to the present invention
- FIG. 4 illustrates an exemplary hardware/information handling system 400 for incorporating the present invention therein.
- FIG. 5 illustrates a recordable signal bearing medium 500 (e.g., recordable storage medium) for storing steps of a program of a method according to the present invention.
- a recordable signal bearing medium 500 e.g., recordable storage medium
- FIGS. 1-5 there are shown exemplary aspects of the method and structures according to the present invention.
- the unique and unobvious features of the exemplary aspects of the present invention are directed to a novel system and method for providing an improved ability to create a voice database for concatenative Text-To-Speech. More particularly, the exemplary aspects of the invention provide improved quality of that database resulting from greater fluency and more-natural prosody in the script used to make the recordings, as well as more compactness of coverage of a plurality of phonetic events.
- the exemplary invention preferably provides an extracting unit that extracts (e.g., see 115 ), for example, all triphones from an unabridged English dictionary including a word pronunciation guide (e.g., see 110 ).
- extracts e.g., see 115
- all triphones from an unabridged English dictionary including a word pronunciation guide (e.g., see 110 ).
- the term “triphone” generally means, for example, any phonetic sequence, which might include a diphone, etc.
- a “triphone” can be a sequence of (or phrase having) three phonemes.
- the ordinarily skilled artisan would understand, however, that the present invention is not limited to triphones, and also may include diphones, quadphones, syllables, bisyllables, etc.
- diphone generally means, for example, a unit of speech that includes the second half of one phoneme followed by the first half of the next phoneme, cut out of the words in which they were originally articulated. In this way, diphones contain the transitions from one sound to the next. Thus, diphones form building blocks for synthetic speech.
- the phrase “picked carrots” includes a triphone (e.g., the phonetic sequence of phonemes k-t-k).
- this triphone, or phonetic sequence of phonemes could be included in a sentence or phrase in the script.
- most, or preferably, all of the possible triphones may be included in the script.
- the triphones can be bordered by the middle of the phone or syllable (as typically done for diphones) or bordered by the edge.
- the present invention is not limited to triphones, and also may include diphones, quadphones, syllables, bisyllables, etc.
- the triphones preferably can be associated with dictionary words that contain such triphones (e.g., see 120 ).
- the exemplary invention preferably selects N words that contain each triphone (e.g., see 125 ).
- the N selected words are then input into an autonomous language generating unit e.g., 130 ; which performs the steps according to an autonomous language generating software).
- an autonomous language generating unit e.g., 130 ; which performs the steps according to an autonomous language generating software).
- the autonomous language generating unit preferably receives an input from a character template unit including one or more character templates (e.g., 135 ), a concept template unit including one or more concept templates (e.g., 140 ), a location template unit including one or more location templates (e.g., 145 ), a story line template unit including one or more story line templates (e.g., 150 ), a script template unit including one or more script templates (e.g., 155 ), etc.
- a character template unit including one or more character templates (e.g., 135 ), a concept template unit including one or more concept templates (e.g., 140 ), a location template unit including one or more location templates (e.g., 145 ), a story line template unit including one or more story line templates (e.g., 150 ), a script template unit including one or more script templates (e.g., 155 ), etc.
- the exemplary invention also preferably includes a control unit (e.g., 120 ) that controls format mechanics (e.g., script size, sentence structure, target sentence length, etc.) of the autonomous language generated by the autonomous language generating unit (e.g., 130 ).
- a control unit e.g., 120
- format mechanics e.g., script size, sentence structure, target sentence length, etc.
- the resulting data output from the autonomous language generating unit (e.g., 130 ) and the control unit (e.g., 160 ) provides a TTS script (or script) (e.g., 165 ), which solves the aforementioned problems of the conventional methods and systems.
- a TTS script or script
- the present invention exemplarily begins with the assumption that it generally would be desirable (e.g., necessary) for a speaker to read about 10,000 English phoneme sequences.
- Applicants have recognized that those sounds preferably can be embedded in real sentences which have some meaning.
- a list of sounds (e.g., “oot,” “ool,” “oop,” etc) can be provided.
- it can be difficult to easily make sentences that assimilate such a list of sounds.
- the present invention preferably can consult a pronunciation dictionary and find a list of words, or in some cases word sequences, that contain the preferred or required sounds. Thus, a list of 10,000 words or word sequences could be provided. However, a fluently-readable script still may not be produced.
- an intelligent software system preferably can be provided that can take as its input a text database, including, for example, an unstructured vocabulary list, and autonomously produce one or more cohesive written text documents (i.e., cohesive scripts).
- a text database including, for example, an unstructured vocabulary list, and autonomously produce one or more cohesive written text documents (i.e., cohesive scripts).
- a series of pre-written templates preferably can imbue the cohesive script with ideas, concepts, and characters that can be used to form the basis of the storyline or content of the cohesive script.
- the exemplary features of the invention preferably can include script structural templates which can be considered to be grammars for generating different types of scripts that satisfy predetermined structural properties.
- the script structural templates may cascade, for example, into paragraph and sentence templates.
- the exemplary invention preferably can include templates to produce conceptual coherence such as a story line, plot, or theme for selecting characters and events to describe, and the order in which they will be introduced. These templates preferably can be used to populate the cohesive script with content.
- a cohesive script provided according to the exemplary invention preferably would meet many (or all) of the requirements of conventional scripts (i.e., it would contain many (or all) of the required phoneme sequences) in a far more efficient way because the present invention would contain a higher concentration of required phoneme sequences in each sentence.
- the cohesive script, and the resulting speech corpus preferably would be shorter as compared to the conventional systems and methods.
- the time to read such a cohesive script and therefore, the time to generate the speech corpus, preferably would be reduced as compared to the conventional systems and methods.
- a cohesive script provided according to the exemplary invention preferably would be much easier to read than a script provided according to the conventional methods and systems.
- an exemplary system for generating a speech corpus for concatenative text-to-speech preferably includes an extracting unit (e.g., 210 ) that extracts an enumerated phoneme sequence (e.g., a triphone, diphone, quadphone, syllable, bisyllable, etc., or a plurality thereof; e.g., 215 ) from a text database (e.g., 220 ).
- an enumerated phoneme sequence e.g., a triphone, diphone, quadphone, syllable, bisyllable, etc., or a plurality thereof; e.g., 215
- a text database e.g., 220
- the text database preferably may include one or more dictionary databases (e.g., 280 ), word pronunciation guide databases (e.g., 275 ), word databases (e.g., 220 ), enumerated phoneme sequence database (e.g., a triphone, diphone, quadphone, syllable, and/or bisyllable database, etc., or a plurality thereof; e.g., 215 ), vocabulary lists or databases (e.g., 216 ), inventory of occurrences of phonemic units or sequences (e.g., 217 ), etc.
- dictionary databases e.g., 280
- word pronunciation guide databases e.g., 275
- word databases e.g., 220
- enumerated phoneme sequence database e.g., a triphone, diphone, quadphone, syllable, and/or bisyllable database, etc., or a plurality thereof; e.g., 215
- vocabulary lists or databases e
- the system preferably may include an associating unit (e.g., 225 ) that associates each of the enumerated phoneme sequences (e.g., a triphone, diphone, quadphone, syllable, bisyllable, etc., or a plurality thereof; e.g., 215 ) with a plurality of words (e.g., 222 ) included in the text database (e.g., 220 ) that include each of the enumerated phoneme sequences (e.g., a triphone, diphone, quadphone, syllable, bisyllable, etc., or a plurality thereof; e.g., 215 ).
- an associating unit e.g., 225
- associates each of the enumerated phoneme sequences e.g., a triphone, diphone, quadphone, syllable, bisyllable, etc., or a plurality thereof; e.g., 2
- the system preferably can include a selecting unit (e.g., 230 ) that selects N words (e.g., 224 ) that include each of the enumerated phoneme sequences, as well as an input unit (e.g., 235 ) that inputs the N selected words (e.g., 224 ) into an autonomous language generating unit (e.g., 240 ), which generates a cohesive script (e.g., 250 ).
- the cohesive script may be read by a user (e.g., a professional speaker) to generate a speech corpus (or corpora)(e.g., 251 ) for concatenative TTS.
- the autonomous language generating unit preferably receives input from at least one of a character template unit (e.g., 241 ), a concept template unit (e.g., 242 ), a location template unit (e.g., 243 ), a story line template unit (e.g., 244 ), and a script template unit (e.g., 245 ).
- a character template unit e.g., 241
- a concept template unit e.g., 242
- a location template unit e.g., 243
- a story line template unit e.g., 244
- a script template unit e.g., 245
- the system preferably includes a control unit (e.g., 255 ) that controls format mechanics (e.g., at least one of a script size (e.g., 260 ), a sentence structure (e.g., 261 ), a target sentence length (e.g., 262 ), etc.) of the autonomous language generated by the autonomous language generating unit.
- a control unit e.g., 255
- format mechanics e.g., at least one of a script size (e.g., 260 ), a sentence structure (e.g., 261 ), a target sentence length (e.g., 262 ), etc.
- the system preferably includes an output unit (e.g., 270 ) that outputs the script (e.g., 250 ), which can be used to generate an improved speech corpus (e.g., 251 ) for concatenative TTS.
- an output unit e.g., 270
- the script e.g., 250
- an improved speech corpus e.g., 251
- an exemplary method 300 of generating a speech corpus for concatenative text-to-speech preferably includes extracting a plurality of triphones from a text database (e.g., see step 305 ), associating each of the enumerated phoneme sequences (e.g., a triphone, diphone, quadphone, syllable, bisyllable, etc., or a plurality thereof; e.g., 215 ) with a plurality of words included in the text database that include the each of the enumerated phoneme sequences (e.g., see step 310 ), selecting N words that include each of the enumerated phoneme sequences (e.g., see step 315 ), generating a cohesive script based on the N selected words (e.g., see step 320 ), outputting the cohesive script to a first user (e.g., a user/person who reads the cohesive script; e.g., see
- the cohesive script (and thus, the resulting speech corpus) preferably is generated based on at least one of a character template, a concept template, a location template, a story line template, and a script template.
- the method also preferably controls format mechanics (e.g., at least one of a script size, a sentence structure, a target sentence length of the script, etc.), and thus, the resulting speech corpus.
- the resulting script can then be output (e.g., see step 325 ) to a user (e.g., professional speaker) to generate an improved speech corpus according to the present invention (e.g., see steps 330 , 335 ).
- Another exemplary aspect of the invention is directed to a method of deploying computing infrastructure in which computer-readable code is integrated into a computing system, and combines with the computing system to perform the method described above.
- Yet another exemplary aspect of the invention is directed to a signal-bearing medium tangibly embodying a program of machine-readable instructions executable by a digital processing apparatus to perform the exemplary method described above.
- FIG. 4 illustrates a typical hardware configuration of an information handling/computer system for use with the invention and which preferably has at least one processor or central processing unit (CPU) 411 .
- processor or central processing unit (CPU) 411 .
- the CPUs 411 are interconnected via a system bus 412 to a random access memory (RAM) 414 , read-only memory (ROM) 416 , input/output (I/O) adapter 418 (for connecting peripheral devices such as disk units 421 and tape drives 440 to the bus 412 ), user interface adapter 422 (for connecting a keyboard 424 , mouse 426 , speaker 428 , microphone 432 , and/or other user interface device to the bus 412 ), a communication adapter 434 for connecting an information handling system to a data processing network, the Internet, an Intranet, a personal area network (PAN), etc., and a display adapter 436 for connecting the bus 412 to a display device 438 and/or printer.
- RAM random access memory
- ROM read-only memory
- I/O input/output
- I/O input/output
- user interface adapter 422 for connecting a keyboard 424 , mouse 426 , speaker 428 , microphone 432 , and/or other
- a different aspect of the invention includes a computer-implemented method for performing the above method. As an example, this method may be implemented in the particular environment discussed above.
- Such a method may be implemented, for example, by operating a computer, as embodied by a digital data processing apparatus, to execute a sequence of machine-readable instructions. These instructions may reside in various types of signal-bearing media.
- This signal-bearing media may include, for example, a RAM contained within the CPU 411 , as represented by the fast-access storage for example.
- the instructions may be contained in another signal-bearing media, such as a magnetic data storage or CD-ROM diskette 500 ( FIG. 5 ), directly or indirectly accessible by the CPU 411 .
- the instructions may be stored on a variety of machine-readable data storage media, such as DASD storage (e.g., a conventional “hard drive” or a RAID array, magnetic tape, electronic read-only memory (e.g., ROM, EPROM, or EEPROM), an optical storage device (e.g. CD-ROM, WORM, DVD, digital optical tape, etc.), paper “punch” cards, or other suitable signal-bearing media including transmission media such as digital and analog and communication links and wireless.
- DASD storage e.g., a conventional “hard drive” or a RAID array
- magnetic tape e.g., electronic read-only memory (e.g., ROM, EPROM, or EEPROM)
- an optical storage device e.g. CD-ROM, WORM, DVD, digital optical tape, etc.
- paper “punch” cards e.g. CD-ROM, WORM, DVD, digital optical tape, etc.
- signal-bearing media including transmission media such as digital and analog and
- the machine-readable instructions may comprise software object code, compiled from a language such as “C”, etc.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Machine Translation (AREA)
Abstract
Description
Claims (21)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US11/332,292 US8155963B2 (en) | 2006-01-17 | 2006-01-17 | Autonomous system and method for creating readable scripts for concatenative text-to-speech synthesis (TTS) corpora |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US11/332,292 US8155963B2 (en) | 2006-01-17 | 2006-01-17 | Autonomous system and method for creating readable scripts for concatenative text-to-speech synthesis (TTS) corpora |
Publications (2)
Publication Number | Publication Date |
---|---|
US20070168193A1 US20070168193A1 (en) | 2007-07-19 |
US8155963B2 true US8155963B2 (en) | 2012-04-10 |
Family
ID=38264342
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US11/332,292 Active 2029-02-21 US8155963B2 (en) | 2006-01-17 | 2006-01-17 | Autonomous system and method for creating readable scripts for concatenative text-to-speech synthesis (TTS) corpora |
Country Status (1)
Country | Link |
---|---|
US (1) | US8155963B2 (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120150534A1 (en) * | 2010-12-08 | 2012-06-14 | Educational Testing Service | Computer-Implemented Systems and Methods for Determining a Difficulty Level of a Text |
US9251782B2 (en) | 2007-03-21 | 2016-02-02 | Vivotext Ltd. | System and method for concatenate speech samples within an optimal crossing point |
US10685644B2 (en) * | 2017-12-29 | 2020-06-16 | Yandex Europe Ag | Method and system for text-to-speech synthesis |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2140448A1 (en) * | 2007-03-21 | 2010-01-06 | Vivotext Ltd. | Speech samples library for text-to-speech and methods and apparatus for generating and using same |
US8966369B2 (en) * | 2007-05-24 | 2015-02-24 | Unity Works! Llc | High quality semi-automatic production of customized rich media video clips |
US8893171B2 (en) * | 2007-05-24 | 2014-11-18 | Unityworks! Llc | Method and apparatus for presenting and aggregating information related to the sale of multiple goods and services |
TWI336879B (en) * | 2007-06-23 | 2011-02-01 | Ind Tech Res Inst | Speech synthesizer generating system and method |
WO2014197592A2 (en) * | 2013-06-04 | 2014-12-11 | Ims Solutions Inc. | Enhanced human machine interface through hybrid word recognition and dynamic speech synthesis tuning |
JP6934848B2 (en) * | 2018-09-27 | 2021-09-15 | 株式会社Kddi総合研究所 | Learning data creation device, classification model learning device, and categorization device |
CN114464161A (en) * | 2022-01-29 | 2022-05-10 | 上海擎朗智能科技有限公司 | Voice broadcasting method, mobile device, voice broadcasting device and storage medium |
Citations (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5737725A (en) * | 1996-01-09 | 1998-04-07 | U S West Marketing Resources Group, Inc. | Method and system for automatically generating new voice files corresponding to new text from a script |
US5758323A (en) * | 1996-01-09 | 1998-05-26 | U S West Marketing Resources Group, Inc. | System and Method for producing voice files for an automated concatenated voice system |
US6144938A (en) * | 1998-05-01 | 2000-11-07 | Sun Microsystems, Inc. | Voice user interface with personality |
US6173263B1 (en) * | 1998-08-31 | 2001-01-09 | At&T Corp. | Method and system for performing concatenative speech synthesis using half-phonemes |
US20020010584A1 (en) * | 2000-05-24 | 2002-01-24 | Schultz Mitchell Jay | Interactive voice communication method and system for information and entertainment |
US20030028380A1 (en) * | 2000-02-02 | 2003-02-06 | Freeland Warwick Peter | Speech system |
US6539354B1 (en) * | 2000-03-24 | 2003-03-25 | Fluent Speech Technologies, Inc. | Methods and devices for producing and using synthetic visual speech based on natural coarticulation |
US20030158734A1 (en) * | 1999-12-16 | 2003-08-21 | Brian Cruickshank | Text to speech conversion using word concatenation |
US20050108013A1 (en) * | 2003-11-13 | 2005-05-19 | International Business Machines Corporation | Phonetic coverage interactive tool |
US6990451B2 (en) * | 2001-06-01 | 2006-01-24 | Qwest Communications International Inc. | Method and apparatus for recording prosody for fully concatenated speech |
US6990449B2 (en) * | 2000-10-19 | 2006-01-24 | Qwest Communications International Inc. | Method of training a digital voice library to associate syllable speech items with literal text syllables |
US7174295B1 (en) * | 1999-09-06 | 2007-02-06 | Nokia Corporation | User interface for text to speech conversion |
US7308407B2 (en) * | 2003-03-03 | 2007-12-11 | International Business Machines Corporation | Method and system for generating natural sounding concatenative synthetic speech |
US7328157B1 (en) * | 2003-01-24 | 2008-02-05 | Microsoft Corporation | Domain adaptation for TTS systems |
US7693719B2 (en) * | 2004-10-29 | 2010-04-06 | Microsoft Corporation | Providing personalized voice font for text-to-speech applications |
-
2006
- 2006-01-17 US US11/332,292 patent/US8155963B2/en active Active
Patent Citations (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5758323A (en) * | 1996-01-09 | 1998-05-26 | U S West Marketing Resources Group, Inc. | System and Method for producing voice files for an automated concatenated voice system |
US5737725A (en) * | 1996-01-09 | 1998-04-07 | U S West Marketing Resources Group, Inc. | Method and system for automatically generating new voice files corresponding to new text from a script |
US6144938A (en) * | 1998-05-01 | 2000-11-07 | Sun Microsystems, Inc. | Voice user interface with personality |
US6173263B1 (en) * | 1998-08-31 | 2001-01-09 | At&T Corp. | Method and system for performing concatenative speech synthesis using half-phonemes |
US7174295B1 (en) * | 1999-09-06 | 2007-02-06 | Nokia Corporation | User interface for text to speech conversion |
US20030158734A1 (en) * | 1999-12-16 | 2003-08-21 | Brian Cruickshank | Text to speech conversion using word concatenation |
US20030028380A1 (en) * | 2000-02-02 | 2003-02-06 | Freeland Warwick Peter | Speech system |
US6539354B1 (en) * | 2000-03-24 | 2003-03-25 | Fluent Speech Technologies, Inc. | Methods and devices for producing and using synthetic visual speech based on natural coarticulation |
US20020010584A1 (en) * | 2000-05-24 | 2002-01-24 | Schultz Mitchell Jay | Interactive voice communication method and system for information and entertainment |
US6990449B2 (en) * | 2000-10-19 | 2006-01-24 | Qwest Communications International Inc. | Method of training a digital voice library to associate syllable speech items with literal text syllables |
US6990451B2 (en) * | 2001-06-01 | 2006-01-24 | Qwest Communications International Inc. | Method and apparatus for recording prosody for fully concatenated speech |
US7328157B1 (en) * | 2003-01-24 | 2008-02-05 | Microsoft Corporation | Domain adaptation for TTS systems |
US7308407B2 (en) * | 2003-03-03 | 2007-12-11 | International Business Machines Corporation | Method and system for generating natural sounding concatenative synthetic speech |
US20050108013A1 (en) * | 2003-11-13 | 2005-05-19 | International Business Machines Corporation | Phonetic coverage interactive tool |
US7693719B2 (en) * | 2004-10-29 | 2010-04-06 | Microsoft Corporation | Providing personalized voice font for text-to-speech applications |
Non-Patent Citations (5)
Title |
---|
Eide et al "A Corpus-Based Approach to Expressive speech Synthesis" Jun. 2004, Fifth ISCA ITRW on Speech Stnthesis, pp. 79-84. * |
Haiping et al, "Generating Script Using Statistical Information of The Context Variation Unit Vector" Sep. 2002, ISCA Archive, ICSLP2002, pp. 1-4. * |
Haiping et al, "Generating Script Using Statistical Information of The Context Variation Unit Vector" Sep. 2002, ISCA Archive, ICSLP2002, pp. 1-4. * |
Hamza et al "Data-Driven Segment Preselection in the IBM Trainable Speech Synthesis System", Sep. 2002, ISCA Archive, ICSLP2002, pp. 1-4. * |
Zhu et al "Corpus Building For Data-Driven TTS Systems", Sep. 2002, Proceedings of the 2002 IEEE Workshop on speech synthesis, pp. 199-202. * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9251782B2 (en) | 2007-03-21 | 2016-02-02 | Vivotext Ltd. | System and method for concatenate speech samples within an optimal crossing point |
US20120150534A1 (en) * | 2010-12-08 | 2012-06-14 | Educational Testing Service | Computer-Implemented Systems and Methods for Determining a Difficulty Level of a Text |
US8892421B2 (en) * | 2010-12-08 | 2014-11-18 | Educational Testing Service | Computer-implemented systems and methods for determining a difficulty level of a text |
US10685644B2 (en) * | 2017-12-29 | 2020-06-16 | Yandex Europe Ag | Method and system for text-to-speech synthesis |
Also Published As
Publication number | Publication date |
---|---|
US20070168193A1 (en) | 2007-07-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8155963B2 (en) | Autonomous system and method for creating readable scripts for concatenative text-to-speech synthesis (TTS) corpora | |
US8244534B2 (en) | HMM-based bilingual (Mandarin-English) TTS techniques | |
US20120191457A1 (en) | Methods and apparatus for predicting prosody in speech synthesis | |
Kasuriya et al. | Thai speech corpus for Thai speech recognition | |
US20090138266A1 (en) | Apparatus, method, and computer program product for recognizing speech | |
US20080120093A1 (en) | System for creating dictionary for speech synthesis, semiconductor integrated circuit device, and method for manufacturing semiconductor integrated circuit device | |
Panda et al. | A survey on speech synthesis techniques in Indian languages | |
Bharadwaj et al. | Analysis of Prosodic features for the degree of emotions of an Assamese Emotional Speech | |
Van Bael et al. | Automatic phonetic transcription of large speech corpora | |
Hansakunbuntheung et al. | Thai tagged speech corpus for speech synthesis | |
Demenko et al. | JURISDIC: Polish Speech Database for Taking Dictation of Legal Texts. | |
Bijankhan et al. | Tfarsdat-the telephone farsi speech database. | |
JP4964695B2 (en) | Speech synthesis apparatus, speech synthesis method, and program | |
Gebreegziabher et al. | An amharic syllable-based speech corpus for continuous speech recognition | |
Awino et al. | Phonemic Representation and Transcription for Speech to Text Applications for Under-resourced Indigenous African Languages: The Case of Kiswahili | |
Evdokimova et al. | Automatic phonetic transcription for Russian: Speech variability modeling | |
Levow | Adaptations in spoken corrections: Implications for models of conversational speech | |
Iyanda et al. | Development of a Yorúbà Textto-Speech System Using Festival | |
Marasek et al. | Multi-level annotation in SpeeCon Polish speech database | |
Nguyen | Hmm-based vietnamese text-to-speech: Prosodic phrasing modeling, corpus design system design, and evaluation | |
Sudhakar et al. | Development of Concatenative Syllable-Based Text to Speech Synthesis System for Tamil | |
Mustafa et al. | EM-HTS: real-time HMM-based Malay emotional speech synthesis. | |
Ekpenyong et al. | A Template-Based Approach to Intelligent Multilingual Corpora Transcription | |
Klabbers | Text-to-Speech Synthesis | |
Mesa et al. | Development of Tagalog speech corpus |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:AARON, ANDREW STEPHEN;FERRUCCI, DAVID ANGELO;PITRELLI, JOHN FERDINAND;SIGNING DATES FROM 20051219 TO 20060111;REEL/FRAME:018561/0773 Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:AARON, ANDREW STEPHEN;FERRUCCI, DAVID ANGELO;PITRELLI, JOHN FERDINAND;REEL/FRAME:018561/0773;SIGNING DATES FROM 20051219 TO 20060111 |
|
AS | Assignment |
Owner name: NUANCE COMMUNICATIONS, INC., MASSACHUSETTS Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:INTERNATIONAL BUSINESS MACHINES CORPORATION;REEL/FRAME:022689/0317 Effective date: 20090331 Owner name: NUANCE COMMUNICATIONS, INC.,MASSACHUSETTS Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:INTERNATIONAL BUSINESS MACHINES CORPORATION;REEL/FRAME:022689/0317 Effective date: 20090331 |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
FPAY | Fee payment |
Year of fee payment: 4 |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 8TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1552); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 8 |
|
AS | Assignment |
Owner name: CERENCE INC., MASSACHUSETTS Free format text: INTELLECTUAL PROPERTY AGREEMENT;ASSIGNOR:NUANCE COMMUNICATIONS, INC.;REEL/FRAME:050836/0191 Effective date: 20190930 |
|
AS | Assignment |
Owner name: CERENCE OPERATING COMPANY, MASSACHUSETTS Free format text: CORRECTIVE ASSIGNMENT TO CORRECT THE ASSIGNEE NAME PREVIOUSLY RECORDED AT REEL: 050836 FRAME: 0191. ASSIGNOR(S) HEREBY CONFIRMS THE INTELLECTUAL PROPERTY AGREEMENT;ASSIGNOR:NUANCE COMMUNICATIONS, INC.;REEL/FRAME:050871/0001 Effective date: 20190930 |
|
AS | Assignment |
Owner name: BARCLAYS BANK PLC, NEW YORK Free format text: SECURITY AGREEMENT;ASSIGNOR:CERENCE OPERATING COMPANY;REEL/FRAME:050953/0133 Effective date: 20191001 |
|
AS | Assignment |
Owner name: CERENCE OPERATING COMPANY, MASSACHUSETTS Free format text: RELEASE BY SECURED PARTY;ASSIGNOR:BARCLAYS BANK PLC;REEL/FRAME:052927/0335 Effective date: 20200612 |
|
AS | Assignment |
Owner name: WELLS FARGO BANK, N.A., NORTH CAROLINA Free format text: SECURITY AGREEMENT;ASSIGNOR:CERENCE OPERATING COMPANY;REEL/FRAME:052935/0584 Effective date: 20200612 |
|
AS | Assignment |
Owner name: CERENCE OPERATING COMPANY, MASSACHUSETTS Free format text: CORRECTIVE ASSIGNMENT TO CORRECT THE REPLACE THE CONVEYANCE DOCUMENT WITH THE NEW ASSIGNMENT PREVIOUSLY RECORDED AT REEL: 050836 FRAME: 0191. ASSIGNOR(S) HEREBY CONFIRMS THE ASSIGNMENT;ASSIGNOR:NUANCE COMMUNICATIONS, INC.;REEL/FRAME:059804/0186 Effective date: 20190930 |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 12TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1553); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 12 |