WO2023242609A1 - Speech synthesis with foreign fragments - Google Patents

Speech synthesis with foreign fragments Download PDF

Info

Publication number
WO2023242609A1
WO2023242609A1 PCT/IB2022/000417 IB2022000417W WO2023242609A1 WO 2023242609 A1 WO2023242609 A1 WO 2023242609A1 IB 2022000417 W IB2022000417 W IB 2022000417W WO 2023242609 A1 WO2023242609 A1 WO 2023242609A1
Authority
WO
WIPO (PCT)
Prior art keywords
foreign
phonetic representation
words
native
nativized
Prior art date
Application number
PCT/IB2022/000417
Other languages
French (fr)
Inventor
Corinne BOS-PLACHEZ
Vito QUINCI
Alina LENHARDT
Benjamin Vincent Marcel PICART
Martine Marguerite STAESSEN
Athos TONIOLO
Original Assignee
Cerence Operating Company
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cerence Operating Company filed Critical Cerence Operating Company
Priority to PCT/IB2022/000417 priority Critical patent/WO2023242609A1/en
Publication of WO2023242609A1 publication Critical patent/WO2023242609A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination

Definitions

  • This invention relates to synthesis of speech with foreign fragments.
  • Text-to-speech (TTS) systems synthesize speech waveforms from textual input.
  • Some conventional TTS systems process textual input to determine phonetic transcriptions of the words or subsets of the words (i.e., letters or groups of letters termed “graphemes”) in the textual input.
  • the phonetic transcriptions are provided to a synthesizer that converts the phonetic transcriptions into speech waveforms that can be output using, for example, a loudspeaker.
  • Certain textual inputs to TTS systems include fragments of one or more words in in a language that is in a second language.
  • An example of such a textual input that is “J’ai lu Harry Potter hier” where the TTS system is configured to synthesize native French-language text as would be spoken by a French speaker, but Harry Potter is foreign-language text because it is the proper name of a fictional British wizard.
  • One approach may be to ignore the change of language and use the same synthesis rules for the whole input.
  • Another approach may be to switch to between French and English synthesis rules (i.e., as if switching between a native French and a native English speaker mid- sentence). However, such approaches may not produce natural sounding synthesized speech.
  • an approach to speech synthesis accepts text that has a foreign language fragment and produces a synthesized waveform that pronounces the native language text as spoken by a native speaker and pronounces the foreign language fragment in a manner that would be pronounced by the native speaker, which may not correspond to a “correct” pronunciation of the foreign language fragment by a speaker of that foreign language.
  • Some TTS systems may, for example, identify that a text input includes both French and English words and then use French phonemes to synthesize the French words and English (e.g., British English) phonemes to synthesize the English words using pronunciation rules that are native to each of the languages.
  • French phonemes e.g., British English
  • One drawback of this technique is that the synthesized speech may sound unnatural because midway through the sentence it seems that a French speaker suddenly switches over to speaking perfect English and then switches back to speaking perfect French.
  • Other techniques may replace English phonemes for the textual input with their closest French equivalents, or may use French grapheme-to-phoneme rules to synthesize the English Fragment.
  • the resulting synthesized speech may sound unnatural because aspects of the text may be unnaturally omitted from the English part of the synthesized speech.
  • a method for synthesizing speech from a textual input includes receiving the textual input, the textual input including native words in a native language and foreign words in a foreign language, and processing the textual input to determine a phonetic representation of the textual input.
  • the processing includes determining a native phonetic representation of the of the native words, and determining a nativized phonetic representation of the foreign words.
  • Determining the nativized phonetic representation includes forming a foreign phonetic representation of the foreign words using a foreign phoneme set, and mapping the foreign phonetic representation to the nativized phonetic representation according to a model of a native speaker’s pronunciation of foreign words.
  • aspects may include one or more of the following features.
  • the native phonetic representation may include phonemes from a native phoneme set.
  • the nativized phonetic representation may include phonemes from a native phoneme set.
  • the nativized phonetic representation may include phonemes from a second set of phonemes different from the native phoneme set.
  • the second set of phonemes may include a phoneme set for the foreign language.
  • the mapping of the foreign phonetic representation to the nativized phonetic representation may use contextual information associated with the foreign phonetic representation.
  • the contextual information may include a grapheme representation of the foreign text associated with the foreign phonetic representation.
  • the contextual information may include an alignment of graphemes in the grapheme representation of the foreign text to phonemes in the foreign phonetic representation.
  • the contextual information may include location information of phonemes in the foreign phonetic representation.
  • the model of the native speaker’s pronunciation of foreign words may be based on training data comprising foreign textual phrases and phonetic transcriptions of a native speaker’s pronunciation of the foreign textual phrases. At least some of the native speaker’s pronunciations of foreign textual phrases may be mispronunciations.
  • the method may include providing a combination of the native phonetic representation of the of the native words and the nativized phonetic representation of the foreign words to a waveform synthesizer for synthesis of a speech waveform.
  • the method may include synthesizing the speech waveform based on the combination of the native phonetic representation of the of the native words and the nativized phonetic representation of the foreign words using the waveform synthesizer.
  • the method may include configuring a neural network according to the model of a native speaker’s pronunciation of foreign words. Mapping the foreign phonetic representation to the nativized phonetic representation according to the model of a native speaker’ s pronunciation of foreign words may include applying one or more mapping rules. The method may include identifying the foreign words in the textual input.
  • a system for synthesizing speech from a textual input includes an input for receiving the textual input, the textual input including native words in a native language and foreign words in a foreign language and one or more processors configured to process the textual input to determine a phonetic representation of the textual input.
  • the processing includes determining a native phonetic representation of the of the native words, and determining a nativized phonetic representation of the foreign words.
  • Determining the nativized phonetic representation includes forming a foreign phonetic representation of the foreign words using a foreign phoneme set, and mapping the foreign phonetic representation to the nativized phonetic representation according to a model of a native speaker’s pronunciation of foreign words.
  • software stored in a non-transitory form on a computer-readable medium includes instructions for causing a computing system to synthesize speech from a textual input including to receive the textual input, the textual input including native words in a native language and foreign words in a foreign language and process the textual input to determine a phonetic representation of the textual input.
  • the processing includes determining a native phonetic representation of the of the native words, and determining a nativized phonetic representation of the foreign words.
  • Determining the nativized phonetic representation includes forming a foreign phonetic representation of the foreign words using a foreign phoneme set and mapping the foreign phonetic representation to the nativized phonetic representation according to a model of a native speaker’ s pronunciation of foreign words.
  • a method for determining configuration data for configuring a module for mapping a foreign phonetic representation to a nativized phonetic representation includes receiving training data comprising a plurality of textual representations of foreign language words or phrases and a corresponding plurality of reference phonetic representations of the foreign language words or phrases and processing the training data to form the configuration data.
  • aspects may include one or more of the following features.
  • Processing the training data may include for each textual representation of a foreign language word or phrase forming a foreign phonetic representation of the foreign word or phrase using a foreign phoneme set, mapping the foreign phonetic representation to the nativized phonetic representation according to using a model of a native speaker’s pronunciation of foreign words, determining a difference between the nativized phonetic representation and a reference phonetic representation corresponding to the foreign language word or phrase, and updating the model of the native speaker’s pronunciation of foreign words based at least in part on the determined difference.
  • An advantage of training a mapping from a correct foreign pronunciation to a pronunciation in the target language may be that relatively less training data is needed to represent the manner in which native speakers “nativize” the foreign fragments.
  • mappings can be trained for different target users. For example, speakers that live in a generally bilingual area may pronounce the foreign fragments more closely to their correct foreign form as compared to speakers from an area where that foreign language is not spoken. To accommodate “natural” synthesis of foreign fragments to the users from such different areas, different mappings may be used.
  • FIG. 1 is a text-to-speech synthesis system.
  • FIG. 2 is a detailed example of the text-to-speech synthesis system’s operation.
  • FIG. 3 is a phoneme mapping module.
  • FIG. 4 is an override mapping module.
  • FIG. 5 is a system for training the phoneme mapping module.
  • FIG. 6 is a table of phoneme definitions.
  • a text-to-speech synthesis system 100 is configured to receive textual input 102 and to process the textual input to generate a synthesized speech waveform 104 for presentation using a loudspeaker 106.
  • the textual input to the system 100 includes both words in a native language and words in a foreign language.
  • the system 100 is configured to generate a synthesized speech waveform from such textual input that pronounces the native language words in the textual input as a native speaker would and imitates how a native speaker would pronounce (and possibly mispronounce) the foreign language words in the textual input.
  • the synthesized speech waveform 104 sounds more natural to a native language listener.
  • the remainder of this document uses the French language as the native language and the English language as the foreign language. However, it is noted that the described examples are not limited to any two languages.
  • the system 100 includes a language parser 108, a French language processing pipeline 110, an English language processing pipeline 112, and a waveform generator 114.
  • the language parser identifies French words or phrases 114 in the textual input 102 and identifies English words or phrases 116 in the textual input 102.
  • the French words or phrases 114 are provided to and processed by the French language processing pipeline 110 and the English words or phrases 116 are provided to and processed by the English language processing pipeline 116.
  • the outputs of the French language processing pipeline 112 and the English language processing pipeline 112 are combined and provided to the waveform generator 114, which processes the combined output to generate the synthesized speech waveform 104.
  • the French language processing pipeline 110 includes a French grapheme- to phoneme (G2P) module 118, which identifies graphemes (i.e., letters or groups of letters) in the French words or phrases 114 and determines French language phonemes associated with the identified graphemes.
  • the output of the graphemes to French phonemes module 118 is a sequence of French phonemes 121 corresponding to the French words or phrases 114.
  • the English language processing pipeline 112 processes the English words or phrases 116 to ultimately generate a sequence of phonemes in a “French + ” phoneme set.
  • this phoneme set is denoted “French + ” to account for phoneme set including French phonemes and possibly additional phonemes that a native French speaker would use when attempting to pronounce English words but that may not be required to pronounce French words.
  • the additional phonemes are existing English phonemes and/or new phonemes unique to French speakers attempting to pronounce English words.
  • the English language processing pipeline 112 includes a graphemes to English phonemes module 120, an English to French + phoneme mapping module 123, and an optional override mapping module 124.
  • the graphemes to English phonemes module 120 identifies graphemes in the English words or phrases 116 and determines English language phonemes associated with the identified graphemes.
  • the output of the graphemes to English phonemes module 120 is a sequence of English phonemes 122 corresponding to the English words or phrases 116.
  • the sequence of English phonemes 122 (and optionally the graphemes associated with the English words or phrases 116 and aligned to the English phonemes) are provided to the English to French + phoneme mapping module 123, which maps the sequence of English phonemes 122 into a sequence of French + phonemes 126 (as is described in greater detail below).
  • the English to French + phoneme module 123 is implemented as a neural network that is parameterized by mapping parameters, 0 125.
  • the sequence of French* phonemes are processed by the override mapping module 124 to override certain phoneme mappings using override rules (as is described in greater detail below).
  • the sequence of French phonemes 121 and the sequence of French* phonemes 126 are provided to a combiner 128, which combines the two sequences of phonemes according to the order of the words and phrases in the textual input 102 to generate the combined French and French* phonemes 130.
  • the combined French and French* phonemes 130 are proved to the waveform generator 114, which processes the phoneme sequence to generate the synthesized speech waveform 104.
  • the system 100 receives the textual input 102 “J’ai lu Harry Potter hier” and processes the textual input using the language parser 108 to identify the words “J’ai lu” and “hier” as the French words or phrases 110 and identify the words “Harry Potter” as the fragment of English words or phrases 116.
  • the French words or phrases 110, “J’ai lu” and “hier” are provided to the graphemes to French phonemes module 118 (shown twice in FIG. 2 for the sake of simplicity).
  • the graphemes to French phonemes module 118 processes “J’ai lu” to generate the sequence of French phonemes 121 ‘Z e > ‘I y (FIG. 6 includes the definition of the English, French, and French* phonemes used herein for reference).
  • the graphemes to French phonemes module 118 processes “hier” to generate the sequence of French phonemes z . ‘j E R.
  • the English words or phrases 116, “Harry Potter” are provided to the graphemes to English phonemes module 120.
  • the graphemes to English phonemes module 120 processes “Harry Potter” to generate the sequence of English phonemes
  • the sequence of English phonemes 122 (and optionally the associated English graphemes) are provided to the English to French* phoneme mapping module 123.
  • sequence of English phonemes 122 maps the sequence of English phonemes 122 to a sequence of French* phonemes 126 h a . ‘R+ I _p O . 7 E’+ R+. Note that the sequence of French* phonemes 126 includes phonemes h and R+, which are not present in native French phoneme set, but instead are borrowed from the English phoneme set.
  • the module recognizes a mapping between certain English graphemes and French + phonemes and inserts French + phonemes when corresponding English graphemes are present in the textual input.
  • the combiner 128 combines the sequence of French phonemes 121 and the sequence of French + phonemes 126 to form the combined French and French + phonemes 130, which is provided to the waveform generator 114 to generate the synthesized speech waveform 104.
  • the English to French + phoneme mapping module 123 receives the sequence of English phonemes 122 and, for each English phoneme, determines a corresponding mapped French + phoneme (e.g., the mapping module 123 is a sliding mapper that is repeatedly applied to phonemes in the sequence of English phonemes 122).
  • the module 123 includes a phoneme mapper 232 (e.g., a neural network) that is configured using mapping parameters, 0 125.
  • the phoneme mapper 123 receives a single English phoneme 234 ($ in FIG. 3) and maps the English phoneme 234 to a (possibly Null) French + phoneme (E+ in FIG. 3).
  • the mapping module 123 does not necessarily generate a single French+ phoneme based on a single English phoneme but instead may consider a context of the English phoneme and its corresponding grapheme. For example, the mapping module 123 may apply a sliding window to the sequence of English phonemes 122 and to the graphemes corresponding to the sequence of English phonemes. The information in the sliding window provides context that is used by the mapping module 123 to determine the mapping to one or more French+ phonemes.
  • the English to French + phoneme mapping module 123 implements a sequence-to- sequence mapping where a sequence of English phonemes is mapped to a sequence of French + phonemes in a way that accounts for a context of each phoneme (and possibly grapheme) in the greater sequence of phonemes.
  • the phoneme mapper 232 is configured to identify certain English graphemes that should be replaced by French + phonemes. For example, the phoneme mapper 232 may use the “e” grapheme 239 in its determination that the $ English phoneme 234 is mapped to the E+ French + phoneme 235. 4 OVERRIDE MAPPING
  • the override mapping module 124 implements rules or other heuristic techniques to identify phonemes in the sequence of French + phonemes 126 (or lack of phonemes) that should be replaced with other, different phonemes.
  • the sequence of French* phonemes 126 has a Null phoneme 436 mapped for an “h” English grapheme 437.
  • a rule 438 is applied to recognize the mapping of a Null phoneme to an “h” grapheme and replace the Null phoneme with an h French* phoneme. In this way, rules can be used to override the output of the English to French* phoneme mapping module 123 in certain circumstances.
  • a decision point 440 determines whether to use the output of the rule 438 or the mapped phoneme from the sequence of French* phonemes 126. For example, if the output of the rule 438 is Null, then the decision point 440 uses the mapped phoneme 436 from the sequence of French* phonemes 126. Otherwise, the decision point 440 uses the output of the rule 438 and overrides the mapped phoneme 436.
  • a training system 500 receives as input a set of textual English phrases 542 and a corresponding set of reference phrases transcribed into sequences of French* phonemes 544 and processes the input to determine the mapping parameters, 0 125.
  • the training system 500 processes the set of textual English phrases 542 one at a time by reading a textual English phrase 546 from the set of textual English phrases 542.
  • the textual English phrase 546 is processed by a graphemes to English phonemes module 520 to generate a sequence of English phonemes 522 corresponding to the textual English phrase 546.
  • the sequence of English phonemes 522 (and optionally the associated English graphemes) are provided to an English to French* phoneme mapping module 523.
  • the English to French* phoneme mapping module 523 maps the sequence of English phonemes 522 to a mapped sequence of French* phonemes 526 based on the current mapping parameters, 0 125.
  • the mapped sequence of French* phonemes 526 and the reference sequence of French phonemes 548 associated with the textual English phrase 546 are provided to a loss computation module 550, which computes a loss value, C, e characterizing a difference between the mapped sequence of French + phonemes 526 and the reference sequence of French* phonemes 548.
  • the loss value, C, e is provided to an optimization module 552, which updates the mapping parameters, 0 125 to minimize the loss value over the set of training data.
  • the training system 500 repeats this parameter update procedure for all of the English phrases 542, causing the mapping parameters, 0 to converge.
  • the reference sequences of French* phonemes are aligned (automatically or manually) to the textual English phrases (or the corresponding sequences of English phonemes) to improve training performance.
  • each reference sequence of French* phonemes is hand transcribed.
  • each reference sequence of French* phonemes is generated by performing phoneme recognition using a French* phoneme set on speech from typical French speakers.
  • the mapping parameters, 0 will configure the mapper to mimic the population of French speakers used to generate the training data (i.e., a population of French speakers who are fluent in English would result in different mapping behavior than a population of French speakers who are only familiar with the English language).
  • each reference sequence of French* phonemes is generated by performing phoneme recognition using a French* phoneme set on speech spoken by a French speaker reading the English phrases.
  • the examples described above use a neural network and optionally rules to map English phonemes into an augmented French phoneme set (French*). However, in some examples, no neural network is used, and a rule-based approach is sufficient. Alternatively, other machine learning approaches, such as use of a decision tree, may be used instead of a neural network.
  • the examples described above map English phonemes into phonemes in an augment French phoneme set.
  • the native French phoneme set may include sufficient phonemes for training a mapper to generate speech that imitates a native French speaker who is attempting to pronounce English words.
  • the output may comprise a representation from which the waveform may be generated.
  • the phoneme sequence may be processed to form a spectrogram (energy vs. frequency vs. time) from which the waveform may be determined in a language-independent manner.
  • mappings approaches may be used to process the foreign phoneme sequence to form the target phoneme sequence.
  • a sequence-to- sequence neural network e.g., a Transformer Neural Network or a Recurrent Neural Network, such as an LSTM
  • Alternative training approaches may be used, with the target phoneme sequence being determined by phoneme recognition by one or a combination of native speakers uttering the foreign fragments, hand transcription of such native speaker’s utterances.
  • the mapping may be trained to directly match the audio of the native speakers.
  • different foreign fragments may be processed differently, for example, according to their length. For example, it may be that short fragments are mapped closed to the native language while longer fragments retain more of their proper foreign pronunciation. Similarly, named entities may retain their proper pronunciation more than common expressions. Therefore, alternatively may have a mapping process that can vary the degree to which the foreign fragment is mapped to the native pronunciation. Furthermore, users may have a preference for the degree of “nativization” of a pronunciations, which may be settable to by the user.
  • lexica are used to override at least some part of the mapping of the English phonemes into the augmented French phoneme set (French + ) in the case of exceptions that are not properly captured by the neural network or rulebased mapping.
  • graphemes and phonemes are aligned prior to processing using a classification model or rule-based mapper.
  • Different types of grapheme-phoneme alignment techniques can be used such as Expectation Maximization-based alignment, phonetic similarity-based alignment, constraint-based alignment, using an Integer Programming framework, alignment by aggregation (i.e., a combination of integer programming and expectation maximization) or any other suitable alignment technique.
  • the approaches described above can be implemented, for example, using a programmable computing system executing suitable software instructions or it can be implemented in suitable hardware such as a field-programmable gate array (FPGA) or in some hybrid form.
  • the software may include procedures in one or more computer programs that execute on one or more programmed or programmable computing system (which may be of various architectures such as distributed, client/server, or grid) each including at least one processor, at least one data storage system (including volatile and/or non-volatile memory and/or storage elements), at least one user interface (for receiving input using at least one input device or port, and for providing output using at least one output device or port).
  • the software may include one or more modules of a larger program, for example, that provides services related to the design, configuration, and execution of dataflow graphs.
  • the modules of the program e.g., elements of a dataflow graph
  • the software may be stored in non-transitory form, such as being embodied in a volatile or non-volatile storage medium, or any other non-transitory medium, using a physical property of the medium (e.g., surface pits and lands, magnetic domains, or electrical charge) for a period of time (e.g., the time between refresh periods of a dynamic memory device such as a dynamic RAM).
  • a physical property of the medium e.g., surface pits and lands, magnetic domains, or electrical charge
  • a period of time e.g., the time between refresh periods of a dynamic memory device such as a dynamic RAM.
  • the software may be provided on a tangible, non- transitory medium, such as a CD-ROM or other computer-readable medium (e.g., readable by a general or special purpose computing system or device), or may be delivered (e.g., encoded in a propagated signal) over a communication medium of a network to a tangible, non-transitory medium of a computing system where it is executed.
  • a special purpose computer or using special-purpose hardware, such as coprocessors or field- programmable gate arrays (FPGAs) or dedicated, application- specific integrated circuits (ASICs).
  • the processing may be implemented in a distributed manner in which different parts of the computation specified by the software are performed by different computing elements.
  • Each such computer program is preferably stored on or downloaded to a computer-readable storage medium (e.g., solid state memory or media, or magnetic or optical media) of a storage device accessible by a general or special purpose programmable computer, for configuring and operating the computer when the storage device medium is read by the computer to perform the processing described herein.
  • a computer-readable storage medium e.g., solid state memory or media, or magnetic or optical media
  • the inventive system may also be considered to be implemented as a tangible, non-transitory medium, configured with a computer program, where the medium so configured causes a computer to operate in a specific and predefined manner to perform one or more of the processing steps described herein.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

A method for synthesizing speech from a textual input includes receiving the textual input, the textual input including native words in a native language and foreign words in a foreign language, and processing the textual input to determine a phonetic representation of the textual input. The processing includes determining a native phonetic representation of the of the native words, and determining a nativized phonetic representation of the foreign words. Determining the nativized phonetic representation includes forming a foreign phonetic representation of the foreign words using a foreign phoneme set, and mapping the foreign phonetic representation to the nativized phonetic representation according to a model of a native speaker's pronunciation of foreign words.

Description

SPEECH SYNTHESIS WITH FOREIGN FRAGMENTS
[0001] This invention relates to synthesis of speech with foreign fragments.
[0002] Text-to-speech (TTS) systems synthesize speech waveforms from textual input. Some conventional TTS systems process textual input to determine phonetic transcriptions of the words or subsets of the words (i.e., letters or groups of letters termed “graphemes”) in the textual input. The phonetic transcriptions are provided to a synthesizer that converts the phonetic transcriptions into speech waveforms that can be output using, for example, a loudspeaker.
[0003] Certain textual inputs to TTS systems include fragments of one or more words in in a language that is in a second language. An example of such a textual input that is “J’ai lu Harry Potter hier” where the TTS system is configured to synthesize native French-language text as would be spoken by a French speaker, but Harry Potter is foreign-language text because it is the proper name of a fictional British wizard. One approach may be to ignore the change of language and use the same synthesis rules for the whole input. Another approach may be to switch to between French and English synthesis rules (i.e., as if switching between a native French and a native English speaker mid- sentence). However, such approaches may not produce natural sounding synthesized speech.
SUMMARY OF THE INVENTION
[0004] In a general aspect, an approach to speech synthesis accepts text that has a foreign language fragment and produces a synthesized waveform that pronounces the native language text as spoken by a native speaker and pronounces the foreign language fragment in a manner that would be pronounced by the native speaker, which may not correspond to a “correct” pronunciation of the foreign language fragment by a speaker of that foreign language.
[0005] Some TTS systems may, for example, identify that a text input includes both French and English words and then use French phonemes to synthesize the French words and English (e.g., British English) phonemes to synthesize the English words using pronunciation rules that are native to each of the languages. One drawback of this technique is that the synthesized speech may sound unnatural because midway through the sentence it seems that a French speaker suddenly switches over to speaking perfect English and then switches back to speaking perfect French. Other techniques may replace English phonemes for the textual input with their closest French equivalents, or may use French grapheme-to-phoneme rules to synthesize the English Fragment. The resulting synthesized speech may sound unnatural because aspects of the text may be unnaturally omitted from the English part of the synthesized speech.
[0006] Aspects described herein address the above-described drawbacks of conventional techniques by synthesizing foreign language text embedded in a native language textual input in a way that imitates the way a native language speaker would pronounce the foreign language text. For example, many native French speakers are at least somewhat proficient English speakers. Even though the French language doesn’t include the h phoneme, those French speakers recognize that the “H” in the name “Harry” is pronounced in the English language and make a (likely imperfect) attempt to pronounce the “H” when saying “Harry.” In effect, those native French speakers are using an augmented French phoneme set to pronounce English words. Aspects described herein would imitate the French speaker’s attempt at pronouncing the “H” in “Harry” by using an augmented French phoneme set that includes extra phonemes that French speakers use when pronouncing English words, resulting in more natural and realistic sounding synthesized speech.
[0007] In a general aspect, a method for synthesizing speech from a textual input includes receiving the textual input, the textual input including native words in a native language and foreign words in a foreign language, and processing the textual input to determine a phonetic representation of the textual input. The processing includes determining a native phonetic representation of the of the native words, and determining a nativized phonetic representation of the foreign words. Determining the nativized phonetic representation includes forming a foreign phonetic representation of the foreign words using a foreign phoneme set, and mapping the foreign phonetic representation to the nativized phonetic representation according to a model of a native speaker’s pronunciation of foreign words.
[0008] Aspects may include one or more of the following features.
[0009] The native phonetic representation may include phonemes from a native phoneme set. The nativized phonetic representation may include phonemes from a native phoneme set. The nativized phonetic representation may include phonemes from a second set of phonemes different from the native phoneme set. The second set of phonemes may include a phoneme set for the foreign language.
[0010] The mapping of the foreign phonetic representation to the nativized phonetic representation may use contextual information associated with the foreign phonetic representation. The contextual information may include a grapheme representation of the foreign text associated with the foreign phonetic representation. The contextual information may include an alignment of graphemes in the grapheme representation of the foreign text to phonemes in the foreign phonetic representation. The contextual information may include location information of phonemes in the foreign phonetic representation.
[0011] The model of the native speaker’s pronunciation of foreign words may be based on training data comprising foreign textual phrases and phonetic transcriptions of a native speaker’s pronunciation of the foreign textual phrases. At least some of the native speaker’s pronunciations of foreign textual phrases may be mispronunciations. The method may include providing a combination of the native phonetic representation of the of the native words and the nativized phonetic representation of the foreign words to a waveform synthesizer for synthesis of a speech waveform. The method may include synthesizing the speech waveform based on the combination of the native phonetic representation of the of the native words and the nativized phonetic representation of the foreign words using the waveform synthesizer.
[0012] The method may include configuring a neural network according to the model of a native speaker’s pronunciation of foreign words. Mapping the foreign phonetic representation to the nativized phonetic representation according to the model of a native speaker’ s pronunciation of foreign words may include applying one or more mapping rules. The method may include identifying the foreign words in the textual input.
[0013] In another general aspect, a system for synthesizing speech from a textual input, the system includes an input for receiving the textual input, the textual input including native words in a native language and foreign words in a foreign language and one or more processors configured to process the textual input to determine a phonetic representation of the textual input. The processing includes determining a native phonetic representation of the of the native words, and determining a nativized phonetic representation of the foreign words. Determining the nativized phonetic representation includes forming a foreign phonetic representation of the foreign words using a foreign phoneme set, and mapping the foreign phonetic representation to the nativized phonetic representation according to a model of a native speaker’s pronunciation of foreign words.
[0014] In another general aspect, software stored in a non-transitory form on a computer-readable medium includes instructions for causing a computing system to synthesize speech from a textual input including to receive the textual input, the textual input including native words in a native language and foreign words in a foreign language and process the textual input to determine a phonetic representation of the textual input. The processing includes determining a native phonetic representation of the of the native words, and determining a nativized phonetic representation of the foreign words. Determining the nativized phonetic representation includes forming a foreign phonetic representation of the foreign words using a foreign phoneme set and mapping the foreign phonetic representation to the nativized phonetic representation according to a model of a native speaker’ s pronunciation of foreign words.
[0015] In another general aspect, a method for determining configuration data for configuring a module for mapping a foreign phonetic representation to a nativized phonetic representation includes receiving training data comprising a plurality of textual representations of foreign language words or phrases and a corresponding plurality of reference phonetic representations of the foreign language words or phrases and processing the training data to form the configuration data.
[0016] Aspects may include one or more of the following features.
[0017] Processing the training data may include for each textual representation of a foreign language word or phrase forming a foreign phonetic representation of the foreign word or phrase using a foreign phoneme set, mapping the foreign phonetic representation to the nativized phonetic representation according to using a model of a native speaker’s pronunciation of foreign words, determining a difference between the nativized phonetic representation and a reference phonetic representation corresponding to the foreign language word or phrase, and updating the model of the native speaker’s pronunciation of foreign words based at least in part on the determined difference.
[0018] An advantage of training a mapping from a correct foreign pronunciation to a pronunciation in the target language, as compared for example to training a full grapheme-to-phoneme mapping that maps foreign text to the target phoneme set, may be that relatively less training data is needed to represent the manner in which native speakers “nativize” the foreign fragments. Another advantage arises when using graphemes as part of the method of nativizing the foreign fragments because attributes of the foreign language pronunciation such as vowel coloring (e.g., very short vowel phonemes such as schwa $) may be lost in the English phonetic transcription. Such attributes can be restored by using the graphemes as part of nativizing the foreign fragments.
[0019] Another advantage may be that different mappings can be trained for different target users. For example, speakers that live in a generally bilingual area may pronounce the foreign fragments more closely to their correct foreign form as compared to speakers from an area where that foreign language is not spoken. To accommodate “natural” synthesis of foreign fragments to the users from such different areas, different mappings may be used.
[0020] Other features and advantages of the invention are apparent from the following description, and from the claims.
BRIEF DESCRIPTION OF THE DRAWINGS
[0021] FIG. 1 is a text-to-speech synthesis system.
[0022] FIG. 2 is a detailed example of the text-to-speech synthesis system’s operation.
[0023] FIG. 3 is a phoneme mapping module.
[0024] FIG. 4 is an override mapping module.
[0025] FIG. 5 is a system for training the phoneme mapping module.
[0026] FIG. 6 is a table of phoneme definitions.
DETAILED DESCRIPTION
1 OVERVIEW
[0027] Referring to FIG. 1, a text-to-speech synthesis system 100 is configured to receive textual input 102 and to process the textual input to generate a synthesized speech waveform 104 for presentation using a loudspeaker 106. The textual input to the system 100 includes both words in a native language and words in a foreign language. Very generally, the system 100 is configured to generate a synthesized speech waveform from such textual input that pronounces the native language words in the textual input as a native speaker would and imitates how a native speaker would pronounce (and possibly mispronounce) the foreign language words in the textual input. As a result, the synthesized speech waveform 104 sounds more natural to a native language listener. For the sake of simplicity, the remainder of this document uses the French language as the native language and the English language as the foreign language. However, it is noted that the described examples are not limited to any two languages.
[0028] In some examples, the system 100 includes a language parser 108, a French language processing pipeline 110, an English language processing pipeline 112, and a waveform generator 114. The language parser identifies French words or phrases 114 in the textual input 102 and identifies English words or phrases 116 in the textual input 102. The French words or phrases 114 are provided to and processed by the French language processing pipeline 110 and the English words or phrases 116 are provided to and processed by the English language processing pipeline 116. The outputs of the French language processing pipeline 112 and the English language processing pipeline 112 are combined and provided to the waveform generator 114, which processes the combined output to generate the synthesized speech waveform 104.
[0029] The French language processing pipeline 110 includes a French grapheme- to phoneme (G2P) module 118, which identifies graphemes (i.e., letters or groups of letters) in the French words or phrases 114 and determines French language phonemes associated with the identified graphemes. The output of the graphemes to French phonemes module 118 is a sequence of French phonemes 121 corresponding to the French words or phrases 114.
[0030] The English language processing pipeline 112 processes the English words or phrases 116 to ultimately generate a sequence of phonemes in a “French+” phoneme set. In general, this phoneme set is denoted “French+” to account for phoneme set including French phonemes and possibly additional phonemes that a native French speaker would use when attempting to pronounce English words but that may not be required to pronounce French words. In some examples, the additional phonemes are existing English phonemes and/or new phonemes unique to French speakers attempting to pronounce English words. The English language processing pipeline 112 includes a graphemes to English phonemes module 120, an English to French+ phoneme mapping module 123, and an optional override mapping module 124.
[0031] The graphemes to English phonemes module 120 identifies graphemes in the English words or phrases 116 and determines English language phonemes associated with the identified graphemes. The output of the graphemes to English phonemes module 120 is a sequence of English phonemes 122 corresponding to the English words or phrases 116.
[0032] The sequence of English phonemes 122 (and optionally the graphemes associated with the English words or phrases 116 and aligned to the English phonemes) are provided to the English to French+ phoneme mapping module 123, which maps the sequence of English phonemes 122 into a sequence of French+ phonemes 126 (as is described in greater detail below). In some examples, the English to French+ phoneme module 123 is implemented as a neural network that is parameterized by mapping parameters, 0 125. Optionally, the sequence of French* phonemes are processed by the override mapping module 124 to override certain phoneme mappings using override rules (as is described in greater detail below).
[0033] The sequence of French phonemes 121 and the sequence of French* phonemes 126 are provided to a combiner 128, which combines the two sequences of phonemes according to the order of the words and phrases in the textual input 102 to generate the combined French and French* phonemes 130. The combined French and French* phonemes 130 are proved to the waveform generator 114, which processes the phoneme sequence to generate the synthesized speech waveform 104.
2 EXAMPLE
[0034] Referring to FIG. 2, the system 100 receives the textual input 102 “J’ai lu Harry Potter hier” and processes the textual input using the language parser 108 to identify the words “J’ai lu” and “hier” as the French words or phrases 110 and identify the words “Harry Potter” as the fragment of English words or phrases 116.
[0035] The French words or phrases 110, “J’ai lu” and “hier” are provided to the graphemes to French phonemes module 118 (shown twice in FIG. 2 for the sake of simplicity). The graphemes to French phonemes module 118 processes “J’ai lu” to generate the sequence of French phonemes 121 ‘Z e > ‘I y (FIG. 6 includes the definition of the English, French, and French* phonemes used herein for reference). The graphemes to French phonemes module 118 processes “hier” to generate the sequence of French phonemes z . ‘j E R.
[0036] The English words or phrases 116, “Harry Potter” are provided to the graphemes to English phonemes module 120. The graphemes to English phonemes module 120 processes “Harry Potter” to generate the sequence of English phonemes
122 ‘h @ R+ I > ‘p A+ . t $ R+. The sequence of English phonemes 122 (and optionally the associated English graphemes) are provided to the English to French* phoneme mapping module 123. The English to French* phoneme mapping module
123 maps the sequence of English phonemes 122 to a sequence of French* phonemes 126 h a . ‘R+ I _p O . 7 E’+ R+. Note that the sequence of French* phonemes 126 includes phonemes h and R+, which are not present in native French phoneme set, but instead are borrowed from the English phoneme set.
[0037] When the English graphemes are provided to the English to French* phoneme mapping module 123, the module recognizes a mapping between certain English graphemes and French+ phonemes and inserts French+ phonemes when corresponding English graphemes are present in the textual input.
[0038] The combiner 128 combines the sequence of French phonemes 121 and the sequence of French+ phonemes 126 to form the combined French and French+ phonemes 130, which is provided to the waveform generator 114 to generate the synthesized speech waveform 104.
3 ENGLISH TO FRENCH+ MAPPING
[0039] Referring to FIG. 3, in some examples the English to French+ phoneme mapping module 123 receives the sequence of English phonemes 122 and, for each English phoneme, determines a corresponding mapped French+ phoneme (e.g., the mapping module 123 is a sliding mapper that is repeatedly applied to phonemes in the sequence of English phonemes 122). The module 123 includes a phoneme mapper 232 (e.g., a neural network) that is configured using mapping parameters, 0 125. The phoneme mapper 123 receives a single English phoneme 234 ($ in FIG. 3) and maps the English phoneme 234 to a (possibly Null) French+ phoneme (E+ in FIG. 3). It is noted that the mapping module 123 does not necessarily generate a single French+ phoneme based on a single English phoneme but instead may consider a context of the English phoneme and its corresponding grapheme. For example, the mapping module 123 may apply a sliding window to the sequence of English phonemes 122 and to the graphemes corresponding to the sequence of English phonemes. The information in the sliding window provides context that is used by the mapping module 123 to determine the mapping to one or more French+ phonemes.
[0040] In other examples, the English to French+ phoneme mapping module 123 implements a sequence-to- sequence mapping where a sequence of English phonemes is mapped to a sequence of French+ phonemes in a way that accounts for a context of each phoneme (and possibly grapheme) in the greater sequence of phonemes.
[0041] As is noted above, when the English graphemes are provided to the English to French+ phoneme mapping module 123, the phoneme mapper 232 is configured to identify certain English graphemes that should be replaced by French+ phonemes. For example, the phoneme mapper 232 may use the “e” grapheme 239 in its determination that the $ English phoneme 234 is mapped to the E+ French+ phoneme 235. 4 OVERRIDE MAPPING
[0042] Referring to FIG. 4, in some examples, the override mapping module 124 implements rules or other heuristic techniques to identify phonemes in the sequence of French+ phonemes 126 (or lack of phonemes) that should be replaced with other, different phonemes. For example, in FIG. 4, the sequence of French* phonemes 126 has a Null phoneme 436 mapped for an “h” English grapheme 437. A rule 438 is applied to recognize the mapping of a Null phoneme to an “h” grapheme and replace the Null phoneme with an h French* phoneme. In this way, rules can be used to override the output of the English to French* phoneme mapping module 123 in certain circumstances.
[0043] In FIG. 4, a decision point 440 determines whether to use the output of the rule 438 or the mapped phoneme from the sequence of French* phonemes 126. For example, if the output of the rule 438 is Null, then the decision point 440 uses the mapped phoneme 436 from the sequence of French* phonemes 126. Otherwise, the decision point 440 uses the output of the rule 438 and overrides the mapped phoneme 436.
5 TRAINING
[0044] Referring to FIG. 5, a training system 500 receives as input a set of textual English phrases 542 and a corresponding set of reference phrases transcribed into sequences of French* phonemes 544 and processes the input to determine the mapping parameters, 0 125.
[0045] In operation, the training system 500 processes the set of textual English phrases 542 one at a time by reading a textual English phrase 546 from the set of textual English phrases 542. The textual English phrase 546 is processed by a graphemes to English phonemes module 520 to generate a sequence of English phonemes 522 corresponding to the textual English phrase 546.
[0046] The sequence of English phonemes 522 (and optionally the associated English graphemes) are provided to an English to French* phoneme mapping module 523. The English to French* phoneme mapping module 523 maps the sequence of English phonemes 522 to a mapped sequence of French* phonemes 526 based on the current mapping parameters, 0 125.
[0047] The mapped sequence of French* phonemes 526 and the reference sequence of French phonemes 548 associated with the textual English phrase 546 are provided to a loss computation module 550, which computes a loss value, C,e characterizing a difference between the mapped sequence of French+ phonemes 526 and the reference sequence of French* phonemes 548. The loss value, C,e is provided to an optimization module 552, which updates the mapping parameters, 0 125 to minimize the loss value over the set of training data.
[0048] The training system 500 repeats this parameter update procedure for all of the English phrases 542, causing the mapping parameters, 0 to converge.
[0049] In some examples, the reference sequences of French* phonemes are aligned (automatically or manually) to the textual English phrases (or the corresponding sequences of English phonemes) to improve training performance.
[0050] In some examples, each reference sequence of French* phonemes is hand transcribed. In other examples, each reference sequence of French* phonemes is generated by performing phoneme recognition using a French* phoneme set on speech from typical French speakers. In such examples, the mapping parameters, 0 will configure the mapper to mimic the population of French speakers used to generate the training data (i.e., a population of French speakers who are fluent in English would result in different mapping behavior than a population of French speakers who are only familiar with the English language).
[0051] In other examples, each reference sequence of French* phonemes is generated by performing phoneme recognition using a French* phoneme set on speech spoken by a French speaker reading the English phrases.
6 ALTERNATIVES
[0052] The examples described above use a neural network and optionally rules to map English phonemes into an augmented French phoneme set (French*). However, in some examples, no neural network is used, and a rule-based approach is sufficient. Alternatively, other machine learning approaches, such as use of a decision tree, may be used instead of a neural network.
[0053] The examples described above map English phonemes into phonemes in an augment French phoneme set. However, it isn’t required that the French phoneme set is augmented — the native French phoneme set may include sufficient phonemes for training a mapper to generate speech that imitates a native French speaker who is attempting to pronounce English words.
[0054] While the system described above produces a waveform for presentation as an acoustic signal, the output may comprise a representation from which the waveform may be generated. For example, the phoneme sequence may be processed to form a spectrogram (energy vs. frequency vs. time) from which the waveform may be determined in a language-independent manner.
[0055] Alternative mappings approaches may be used to process the foreign phoneme sequence to form the target phoneme sequence. For example, a sequence-to- sequence neural network (e.g., a Transformer Neural Network or a Recurrent Neural Network, such as an LSTM) may be used. Alternative training approaches may be used, with the target phoneme sequence being determined by phoneme recognition by one or a combination of native speakers uttering the foreign fragments, hand transcription of such native speaker’s utterances. Furthermore, the mapping may be trained to directly match the audio of the native speakers.
[0056] In alternative approach, different foreign fragments may be processed differently, for example, according to their length. For example, it may be that short fragments are mapped closed to the native language while longer fragments retain more of their proper foreign pronunciation. Similarly, named entities may retain their proper pronunciation more than common expressions. Therefore, alternatively may have a mapping process that can vary the degree to which the foreign fragment is mapped to the native pronunciation. Furthermore, users may have a preference for the degree of “nativization” of a pronunciations, which may be settable to by the user.
[0057] In some examples, lexica are used to override at least some part of the mapping of the English phonemes into the augmented French phoneme set (French+) in the case of exceptions that are not properly captured by the neural network or rulebased mapping.
[0058] As is noted above, in some examples, graphemes and phonemes are aligned prior to processing using a classification model or rule-based mapper. Different types of grapheme-phoneme alignment techniques can be used such as Expectation Maximization-based alignment, phonetic similarity-based alignment, constraint-based alignment, using an Integer Programming framework, alignment by aggregation (i.e., a combination of integer programming and expectation maximization) or any other suitable alignment technique.
7 IMPLEMENTATIONS
[0059] The approaches described above can be implemented, for example, using a programmable computing system executing suitable software instructions or it can be implemented in suitable hardware such as a field-programmable gate array (FPGA) or in some hybrid form. For example, in a programmed approach the software may include procedures in one or more computer programs that execute on one or more programmed or programmable computing system (which may be of various architectures such as distributed, client/server, or grid) each including at least one processor, at least one data storage system (including volatile and/or non-volatile memory and/or storage elements), at least one user interface (for receiving input using at least one input device or port, and for providing output using at least one output device or port). The software may include one or more modules of a larger program, for example, that provides services related to the design, configuration, and execution of dataflow graphs. The modules of the program (e.g., elements of a dataflow graph) can be implemented as data structures or other organized data conforming to a data model stored in a data repository.
[0060] The software may be stored in non-transitory form, such as being embodied in a volatile or non-volatile storage medium, or any other non-transitory medium, using a physical property of the medium (e.g., surface pits and lands, magnetic domains, or electrical charge) for a period of time (e.g., the time between refresh periods of a dynamic memory device such as a dynamic RAM). In preparation for loading the instructions, the software may be provided on a tangible, non- transitory medium, such as a CD-ROM or other computer-readable medium (e.g., readable by a general or special purpose computing system or device), or may be delivered (e.g., encoded in a propagated signal) over a communication medium of a network to a tangible, non-transitory medium of a computing system where it is executed. Some or all of the processing may be performed on a special purpose computer, or using special-purpose hardware, such as coprocessors or field- programmable gate arrays (FPGAs) or dedicated, application- specific integrated circuits (ASICs). The processing may be implemented in a distributed manner in which different parts of the computation specified by the software are performed by different computing elements. Each such computer program is preferably stored on or downloaded to a computer-readable storage medium (e.g., solid state memory or media, or magnetic or optical media) of a storage device accessible by a general or special purpose programmable computer, for configuring and operating the computer when the storage device medium is read by the computer to perform the processing described herein. The inventive system may also be considered to be implemented as a tangible, non-transitory medium, configured with a computer program, where the medium so configured causes a computer to operate in a specific and predefined manner to perform one or more of the processing steps described herein. [0061] A number of embodiments of the invention have been described.
Nevertheless, it is to be understood that the foregoing description is intended to illustrate and not to limit the scope of the invention, which is defined by the scope of the following claims. Accordingly, other embodiments are also within the scope of the following claims. For example, various modifications may be made without departing from the scope of the invention. Additionally, some of the steps described above may be order independent, and thus can be performed in an order different from that described.

Claims

WHAT IS CLAIMED IS:
1. A method for synthesizing speech from a textual input, the method comprising: receiving the textual input, the textual input including native words in a native language and foreign words in a foreign language; and processing the textual input to determine a phonetic representation of the textual input, the processing including: determining a native phonetic representation of the of the native words, and determining a nativized phonetic representation of the foreign words, determining the nativized phonetic representation including: forming a foreign phonetic representation of the foreign words using a foreign phoneme set, and mapping the foreign phonetic representation to the nativized phonetic representation according to a model of a native speaker’s pronunciation of foreign words.
2. The method of claim 1 wherein the native phonetic representation comprises phonemes from a native phoneme set.
3. The method of any one of claims 1 to 2 wherein the nativized phonetic representation comprises phonemes from a native phoneme set.
4. The method of claim 3 wherein the nativized phonetic representation further comprises phonemes from a second set of phonemes different from the native phoneme set.
5. The method of claim 4 wherein the second set of phonemes includes a phoneme set for the foreign language.
6. The method of any one of claims 1 to 5 wherein the mapping of the foreign phonetic representation to the nativized phonetic representation uses contextual information associated with the foreign phonetic representation.
7. The method of claim 6 wherein the contextual information includes a grapheme representation of the foreign text associated with the foreign phonetic representation.
8. The method of claim 7 wherein the contextual information further includes an alignment of graphemes in the grapheme representation of the foreign text to phonemes in the foreign phonetic representation.
9. The method of any one of claims 6 to 8 wherein the contextual information includes location information of phonemes in the foreign phonetic representation.
10. The method of any one of claims 1 to 9 wherein the model of the native speaker’s pronunciation of foreign words is based on training data comprising foreign textual phrases and phonetic transcriptions of a native speaker’s pronunciation of the foreign textual phrases.
11. The method of claim 10 wherein at least some of the native speaker’s pronunciations of foreign textual phrases are mispronunciations.
12. The method of any one of claims 1 to 11 further comprising providing a combination of the native phonetic representation of the of the native words and the nativized phonetic representation of the foreign words to a waveform synthesizer for synthesis of a speech waveform.
13. The method of claim 12 further comprising synthesizing the speech waveform based on the combination of the native phonetic representation of the of the native words and the nativized phonetic representation of the foreign words using the waveform synthesizer.
14. The method of any one of claims 1 to 13 further comprising configuring a neural network according to the model of a native speaker’ s pronunciation of foreign words.
15. The method of any one of claims 1 to 14 wherein mapping the foreign phonetic representation to the nativized phonetic representation according to the model of a native speaker’ s pronunciation of foreign words includes applying one or more mapping rules.
16. The method of any one of claims 1 to 15 further comprising identifying the foreign words in the textual input.
17. A system for synthesizing speech from a textual input, the system comprising: an input for receiving the textual input, the textual input including native words in a native language and foreign words in a foreign language; and one or more processors configured to process the textual input to determine a phonetic representation of the textual input, the processing including: determining a native phonetic representation of the of the native words, and determining a nativized phonetic representation of the foreign words, determining the nativized phonetic representation including: forming a foreign phonetic representation of the foreign words using a foreign phoneme set, and mapping the foreign phonetic representation to the nativized phonetic representation according to a model of a native speaker’s pronunciation of foreign words.
18. Software stored in a non-transitory form on a computer-readable medium, the software including instructions for causing a computing system to synthesize speech from a textual input including: receive the textual input, the textual input including native words in a native language and foreign words in a foreign language; and process the textual input to determine a phonetic representation of the textual input, the processing including: determining a native phonetic representation of the of the native words, and determining a nativized phonetic representation of the foreign words, determining the nativized phonetic representation including: forming a foreign phonetic representation of the foreign words using a foreign phoneme set, and mapping the foreign phonetic representation to the nativized phonetic representation according to a model of a native speaker’s pronunciation of foreign words.
19. A method for determining configuration data for configuring a module for mapping a foreign phonetic representation to a nativized phonetic representation, the method comprising: receiving training data comprising a plurality of textual representations of foreign language words or phrases and a corresponding plurality of reference phonetic representations of the foreign language words or phrases; and processing the training data to form the configuration data.
20. The method of claim 19 wherein processing the training data includes: for each textual representation of a foreign language word or phrase forming a foreign phonetic representation of the foreign word or phrase using a foreign phoneme set, mapping the foreign phonetic representation to the nativized phonetic representation according to using a model of a native speaker’s pronunciation of foreign words, determining a difference between the nativized phonetic representation and a reference phonetic representation corresponding to the foreign language word or phrase, and updating the model of the native speaker’s pronunciation of foreign words based at least in part on the determined difference.
PCT/IB2022/000417 2022-06-17 2022-06-17 Speech synthesis with foreign fragments WO2023242609A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/IB2022/000417 WO2023242609A1 (en) 2022-06-17 2022-06-17 Speech synthesis with foreign fragments

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/IB2022/000417 WO2023242609A1 (en) 2022-06-17 2022-06-17 Speech synthesis with foreign fragments

Publications (1)

Publication Number Publication Date
WO2023242609A1 true WO2023242609A1 (en) 2023-12-21

Family

ID=83319173

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2022/000417 WO2023242609A1 (en) 2022-06-17 2022-06-17 Speech synthesis with foreign fragments

Country Status (1)

Country Link
WO (1) WO2023242609A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110218806A1 (en) * 2008-03-31 2011-09-08 Nuance Communications, Inc. Determining text to speech pronunciation based on an utterance from a user
EP2595143A1 (en) * 2011-11-17 2013-05-22 Svox AG Text to speech synthesis for texts with foreign language inclusions
US20160358596A1 (en) * 2015-06-08 2016-12-08 Nuance Communications, Inc. Process for improving pronunciation of proper nouns foreign to a target language text-to-speech system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110218806A1 (en) * 2008-03-31 2011-09-08 Nuance Communications, Inc. Determining text to speech pronunciation based on an utterance from a user
EP2595143A1 (en) * 2011-11-17 2013-05-22 Svox AG Text to speech synthesis for texts with foreign language inclusions
US20160358596A1 (en) * 2015-06-08 2016-12-08 Nuance Communications, Inc. Process for improving pronunciation of proper nouns foreign to a target language text-to-speech system

Similar Documents

Publication Publication Date Title
KR102581346B1 (en) Multilingual speech synthesis and cross-language speech replication
Renduchintala et al. Multi-modal data augmentation for end-to-end ASR
Zhao et al. Foreign Accent Conversion by Synthesizing Speech from Phonetic Posteriorgrams.
US20100057435A1 (en) System and method for speech-to-speech translation
US11763797B2 (en) Text-to-speech (TTS) processing
JP2008545995A (en) Hybrid speech synthesizer, method and application
Zhou et al. End-to-end code-switching tts with cross-lingual language model
CN111223474A (en) Voice cloning method and system based on multi-neural network
El Ouahabi et al. Toward an automatic speech recognition system for amazigh-tarifit language
Nosek et al. Cross-lingual neural network speech synthesis based on multiple embeddings
Chen et al. Polyglot speech synthesis based on cross-lingual frame selection using auditory and articulatory features
JP6330069B2 (en) Multi-stream spectral representation for statistical parametric speech synthesis
Stan et al. Generating the Voice of the Interactive Virtual Assistant
Zhang et al. Towards zero-shot multi-speaker multi-accent text-to-speech synthesis
US20230410790A1 (en) Speech synthesis with foreign fragments
WO2023242609A1 (en) Speech synthesis with foreign fragments
Sharma et al. Polyglot speech synthesis: a review
Zhou et al. Optimization of cross-lingual voice conversion with linguistics losses to reduce foreign accents
Zhang et al. Zero-shot multi-speaker accent TTS with limited accent data
Oshima et al. Non-native text-to-speech preserving speaker individuality based on partial correction of prosodic and phonetic characteristics
JP2005031150A (en) Apparatus and method for speech processing
Sun et al. Building multi lingual tts using cross lingual voice conversion
Evrard et al. Comparison of chironomic stylization versus statistical modeling of prosody for expressive speech synthesis.
EP1589524B1 (en) Method and device for speech synthesis
Perepelytsia et al. IDEAR: A speech database of identity-marked, clear and read speech

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22769768

Country of ref document: EP

Kind code of ref document: A1