WO2023035261A1

WO2023035261A1 - An end-to-end neural system for multi-speaker and multi-lingual speech synthesis

Info

Publication number: WO2023035261A1
Application number: PCT/CN2021/117919
Authority: WO
Inventors: Yanqing Liu; Zhihang Xu; Sheng Zhao; Bohan LI; Xu Tan; Runnan LI
Original assignee: Microsoft Technology Licensing, Llc
Priority date: 2021-09-13
Filing date: 2021-09-13
Publication date: 2023-03-16
Also published as: CN116601702A

Abstract

Systems are configured for generating, training, and utilizing TTS (text-to-speech) models configured with variance adapter components. The variance adaptor components generate and apply implicit and explicit data for refining and improving the processing of the encoded phoneme data by the acoustic model portion of the TTS models and such that the predicted spectrograms generated by the TTS models are efficiently and accurately created for rendering by vocoders in a desired target language and a target speaker prosody style corresponding to the textual data being processed. The efficiencies and accuracies realized by the variance adapter components can also be further benefited by the altering of the encoding and decoding conformers used by the TTS models, such as by applying the convolution processing prior to the self-attention processing in the encoding/decoding conformer stack (s).

Description

AN END-TO-END NEURAL SYSTEM FOR MULTI-SPEAKER AND MULTI-LINGUAL SPEECH SYNTHESIS

TECHNICAL FIELD

The field of the invention relates to text-to-speech (TTS) models and, even more particularly, to generating, training, and utilizing models that facilitate end-to-end neural processing of multi-speaker and multi-lingual speech data.

BACKGROUND

A text-to-speech (TTS) model is a machine learned or a machine learning model that is configured to convert arbitrary text into human-sounding speech data. A TTS model usually consists of a front-end natural language processing (NLP) module, an acoustic model, and a vocoder. The front-end NLP model is typically configured to do text normalization (e.g., convert a unit symbol into readable words) and to convert the text into corresponding phonemes and phoneme sequences.

Conventional acoustic models are configured to convert input text (or the converted phonemes) into a spectrum sequence, such as a Mel spectrogram sequence, while the vocoder is configured to convert the spectrum sequence into speech waveform data. Many acoustic models are also configured and trained to specify the particular manner in which the text will be uttered (e.g., in what prosody, timbre, etc. ) to be used by the vocoder to render the speech data.

Prosody typically refers to the patterns of rhythm and sound and/or intonation used by a speaker when speaking, while timbre (i.e., tone quality) typically refers to the character or quality of a musical sound or a speaker’s voice. Notably, the characteristics associated with timbre can sometimes be viewed as distinct from other additional attributes associated with the presentation of speech data, such as pitch and intensity. However, for simplicity, herein, the term prosody should be broadly interpreted to refer to any speech attributes (other than language) , including, but not limited to timbre and the additional attributes mentioned above (e.g., pitch and intensity) .

Source acoustic models are often configured as multi-speaker models trained on multi-speaker data and, in some cases, for multiple languages, which may also include different dialects of a same language.

In some instances, the source acoustic models can also be further refined or adapted using target speaker data. For instance, some acoustic models are speaker dependent, meaning that they are either directly trained on speaker data from a particular target speaker, or that they are refined by using speaker data from a particular target speaker. There are various techniques for personalizing an acoustic model for a particular target speaker.

An acoustic model, if well trained, can convert any text into speech that closely mimics the prosody of a target speaker, to sound similar to speech uttered by the target speaker, i.e., rendered in a same rhythm, sound, intonation, timbre, pitch, intensity and/or according to other speech attributes associated with the uttered speech of the target speaker.

Training data for TTS models usually comprises audio data obtained by recording the particular target speaker while they speak and a set of textual data sets corresponding to the audio data (i.e., the textual representation of what the target speaker is saying to produce the audio data) .

In some instances, the text used for training a TTS model is generated by a speech recognition model and/or natural language understanding model which is specifically configured to recognize and interpret speech and provide the textual representation of the words that are recognized in the audio data. In other instances, the speaker is given a pre-determined script from which read aloud, wherein the pre-determined script and the corresponding audio data is used to train the TTS model.

Currently, there has been a lot of focus on training acoustic models to synthesize text into more realistic and accurate speech data. However, the synthesis quality of many models is lacking desired accuracy and can sometimes be unstable. One reason for this is that conventional TTS models typically utilize a one-to-many mapping for the different prosody/target outputs corresponding to a single text input. The one-to-many mapping can also make training of existing models relatively cumbersome and slow.

Additionally, many models are also only configured for use with vocoders that render predicted speech data at relatively low frequencies (e.g., 16 kHz or 24 kHz) , rather than a preferred 40+ kHz frequency. Higher frequencies, like 41,100 or 48,000 Hz, are preferred, for accuracy, because the full spectrum of human hearing is in the range of 20Hz to 20 kHz and digital audio is processed and reproduced at one half of the underlying sample rate. Accordingly, it would be desirable to configure and train TTS models to operate with vocoders to generate output in the higher frequency ranges.

However, conventional TTS models are typically not trained to utilize or generate speech data in the higher sampling rates mentioned (e.g., 40+kHZ) , because it has been found that the higher frequencies require processing of the longer structures within the models (e.g., long Mel spectrogram and wave sample sequencing) , which can make it difficult to disambiguate between the different multi-modal outputs.

In view of the foregoing, and notwithstanding any improvements that have thus far been made to exiting TTS model technologies, there is still an ongoing need for additional improvements to TTS systems and methods for generating improved TTS models, and for acoustic models that are configured to select and apply unique combinations of targeted training data for training the TTS and acoustic models to produce speech data in an efficient and accurate manner, and particularly for accommodating the synthesizing of speech data in a variety of target prosody styles and a variety of target speaker languages and which can be rendered by vocoders within the higher sample rate ranges.

The subject matter claimed herein is not limited to embodiments that solve any disadvantages or that operate only in environments such as those described above. Rather, this background is only provided to illustrate one exemplary technology area where some embodiments described herein may be practiced.

BRIEF SUMMARY

Disclosed embodiments are directed towards embodiments for generating, training, and utilizing text-to-speech (TTS) models. In some instances, the disclosed embodiments include generating and utilizing a variance adapter incorporated into new and unique TTS models for facilitating improved accuracy and efficiency in training the TTS models and for utilizing the TTS models to perform language synthesis in a variety of languages and prosody styles.

Some embodiments include methods and systems for obtaining, configuring and/or otherwise utilizing a text-to-speech (TTS) model that is configured for generating speech data from textual data in a target speaker language and a target speaker prosody style.

The disclosed embodiments, include a system obtaining an encoder conformer that is configured to generate encoded phoneme data from received embedded phoneme data, the embedded phoneme data being based on the textual data, the encoder conformer including at least a convolution module, a self-attention module, the encoder conformer being further configured to process the embedded phoneme data by the convolution module prior to processing the embedded phoneme data with the self-attention module.

The disclosed embodiments also include a system obtaining a variance adapter trained to process the encoded phoneme data to generate corresponding variation information data and that is based at least in part on implicit data and explicit data corresponding to a target spectrogram associated with the target speaker language and the target speaker prosody style.

Disclosed embodiments also include a system obtaining a decoder conformer that is configured to generate a predicted spectrogram of speech data based on the corresponding variation information data and which corresponds directly to the textual data used to generate the embedded phoneme data, the predicted spectrogram of speech data being structured for rendering by a vocoder as speech data in the target speaker language and the target speaker prosody style.

Disclosed embodiments also include a system configuring the TTS model to perform TTS processing with (i) the encoder conformer, (ii) the variance adaptor , (iii) the decoder conformer, and (iv) computer-executable instructions which are executable by one or more hardware processors for causing: the encoder conformer to generate encoded phoneme data from the received embedded phoneme data and to provide the encoded phoneme data to the variance adaptor; the variance adaptor to process the encoded phoneme data with the implicit data and explicit data corresponding to the target spectrogram and to generate the corresponding variation information based on the target spectrogram; the variance adaptor to provide the corresponding variation information to the decoder conformer; and for the decoder conformer to generate the predicted spectrogram of speech data based on the refined encoded phoneme data.

Other disclosed embodiments include systems training the TTS models for generating speech data from textual data in a target speaker language and a target speaker prosody style by at least: identifying training data comprising textual data for which the TTS model is to be trained to generate corresponding speech data for, as well as target spectrogram training data that is generated from spoken utterances of the textual data.

These other embodiments include the systems training a variance adaptor of the TTS model with the training data to generate variation information based on the training data and at least in part on the target spectrogram, as well as for configuring and training the variance adaptor to generate the variation information data by receiving and processing encoded phoneme data from an encoder conformer that creates the encoded phoneme data from phonemes obtained from the training data and for generating the variation information data based at least in part on implicit data comprising duration data and pitch features, as well as explicit data comprising prosody features and global tokens.

The disclosed embodiments also include configuring and/or training the variance adaptor to provide the variation information data to a decoder conformer that generates a predicted spectrogram that is structured to be rendered by a vocoder as the speech data in the target speaker language and the target speaker prosody style.

Some embodiments also include causing a vocoder to render the speech data (received at the vocoder as the predicted spectrogram) at a relatively higher sample rate (e.g., 40+kHz) than the relatively lower sample rate of the predicted spectrogram (e.g., 16kHz or 24 kHz) and to thereby cause the speech data to be output or rendered at a higher sample rate than the TTS model is configured to generate by itself without the vocoder.

This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify all of the key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.

Additional features and advantages will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the teachings herein. Features and advantages of the invention may be realized and obtained by means of the instruments and combinations particularly pointed out in the appended claims. Features of the present invention will become more fully apparent from the following description and appended claims or may be learned by the practice of the invention as set forth hereinafter.

BRIEF DESCRIPTION OF THE DRAWINGS

In order to describe the manner in which the above-recited and other advantages and features can be obtained, a more particular description of the subject matter briefly described above will be rendered by reference to specific embodiments which are illustrated in the appended drawings. Understanding that these drawings depict only typical embodiments and are not therefore to be considered to be limiting in scope, embodiments will be described and explained with additional specificity and detail through the use of the accompanying drawings in which:

Figure 1 illustrates a computing environment in which a computing system incorporates and/or is utilized to perform disclosed aspects of the disclosed embodiments. The illustrated computing system is configured for facilitating TTS functionality and includes hardware storage device (s) and a plurality of machine learning engines. The computing system is in communication with remote/third party system (s) .

Figure 2A illustrates one embodiment of system model components and flow diagram components corresponding to a TTS model with a multi-speaker and multi-lingual variance adapter.

Figure 2B illustrates another embodiment of system model components and flow diagram components corresponding to a TTS model with a multi-speaker and multi-lingual the variance adaptor and which includes implicit and explicit information being provided to the acoustic model component of the TTS model.

Figure 2C illustrates one embodiment of system model components and flow diagram components corresponding to a TTS model with a multi-speaker and multi-lingual variance adapter and which includes model components for generating the various implicit and explicit information referenced in Figure 2B.

Figure 3 illustrates an embodiment of a TTS model having a multi-speaker and multi-lingual variance adapter.

Figure 4 illustrates an embodiment of a diagram of an improved conformer block to be used with a TTS model having a multi-speaker and multi-lingual variance adapter.

Figure 5 illustrates one embodiment of a flow diagram having a plurality of acts associated with methods for generating and training a TTS model with a multi-speaker and multi-lingual variance adapter and which can be refined for target speakers and languages.

Figure 6 illustrates one embodiment of a flow diagram having a plurality of acts associated with methods for synthesizing speech from text with a TTS model with a multi-speaker and multi-lingual variance adapter.

DETAILED DESCRIPTION

Disclosed embodiments include systems, methods and devices configured for generating, training, and utilizing TTS (text-to-speech) models, including some models configured with multi-lingual and multi-speaker variance adaptor components that generate and apply implicit and explicit data associated with target spectrograms corresponding to target speaker prosody styles and/or target languages.

There are many technical benefits associated with the disclosed embodiments. For example, by applying additional implicit and explicit data, it is possible to further refine and improve the accuracy in processing of the encoded phoneme data with the acoustic model portion of the TTS models and such that the predicted spectrograms generated by the TTS models can be more efficiently and accurately created, as compared to some existing systems.

The corresponding benefit of such improvements includes the ability to train TTS models for greater varieties of speaker prosodies and languages, as well as for facilitating the rendering of speech data in a desired target language and a target speaker prosody style corresponding to the textual data being processed.

The foregoing efficiencies and accuracies realized by the variance adapter components can also be further augmented by using convolution layers/processing instead of linear layers for the encoder/decoders and by altering the processing ordering of encoding and decoding conformer layers, relative to existing systems. For instance, it has been observed by the inventors that by applying convolution processing prior to self-attention processing in the encoding/decoding conformer stack (s) of the TTS models, it is possible to achieve improved prosody and audio fidelity when compared to conventional systems that do not incorporate the unique combination of conformer and acoustic model features described herein, and even while processing training data, utilizing target spectrogram data, and while generating predicted spectrogram outputs that are caused to be processed and output at relatively higher sample rates by the corresponding vocoders (e.g., 40+ kHz) .

In some experiments, for instance, the inventors where able to observe improvements in naturalness, similarity, and performance of speech synthesis with disclosed TTS models, relative to baseline/conventional models, and as scaled with the standardized MOS, SMOS and WER speech synthesis measurement scales. In some instances, for example, the inventors observed more than a 5%improvement in the measured MOS scoring accuracy of synthesizing speech from textual data into target speaking prosody styles and languages with the disclosed inventive multi-lingual and multi-speaker TTS models, as compared to corresponding use of conventional baseline multi-lingual and multi-speaker models that do not include the uniquely structured encoder/decoder conformers and variance adaptor components described herein for applying unique sets of implicit and explicit data and that fail to cause the vocoders to output/render the speech output at high frequency sample rates (e.g., 40+ kHz) .

Attention will now be directed to Figure 1, which illustrates components of a computing system 110 which may include and/or be used to implement aspects of the disclosed invention.

The computing system 110 is illustrated as being incorporated within a broader computing environment 100 that also includes one or more remote system (s) 120 communicatively connected to the computing system 110 through a network 130 (e.g., the Internet, cloud, or other network connection (s) ) . The remote system (s) 120 comprise one or more processor (s) 122 and one or more computer-executable instruction (s) stored in corresponding hardware storage device (s) 124, for facilitating processing/functionality at the remote system (s) , such as when the computing system 110 is distributed to include remote system (s) 120.

The computing system 110, as described herein, incorporates and/or utilizes various components that enable the disclosed TTS functionality for obtaining, training, and utilizing the disclosed TTS models for performing TTS speech processing. The TTS functionality processing performed by and/or incorporated into the computing system 110 and the corresponding computing system components includes, but is not limited to, identification of training data and the identification of textual data to be converted into speech data, parsing and splitting of textual data into utterance segments, words, phonemes and other language elements, normalization of the textual utterances, phonemes and other speech elements, identification and sequencing of phonemes and other language elements, encoding and decoding of phoneme and other language elements, NLP (natural language processing) , audio generation and speech recognition of speech utterances and other language elements, conversion of phonemes and other language elements into spectrum space, such as mel spectrograms (e.g., the disclosed target and predicted spectrograms) , natural language understanding, language translation, language prosody transforming, target prosody and target language identification, vocoder interfacing and processing such as, but not limited to, conversion of spectrum sequences into waveform data, re-sampling of audio, up-sampling of audio, and/or any other processing required to perform the text-to-speech functionalities described herein.

The disclosed TTS functionality also includes generating and/or obtaining training data and training the disclosed models to generate spectrograms configured for and/or structured for rendering speech data by a vocoder in a variety of different target speaker prosody styles and target speaker languages. The training functionality also includes training the disclosed individual variance adaptor components/models of the disclosed TTS models specifically configured to generate, obtain and/or apply the disclosed implicit and explicit prosody features and other speech attributes into the speech synthesis processing performed by the TTS models and, even more specifically, the variance adapter, as will be described in more detail throughout this disclosure.

Figure 1 illustrates the computing system 110 and some of the referenced components that incorporate and/or that enable the disclosed TTS functionalities. For example, the computing system 110, is shown to include one or more processor (s) 112 (such as one or more hardware processor (s) ) and a storage 140 (i.e., hardware storage device (s) ) storing computer-executable instructions 118 wherein the storage 140 is able to house any number of data types and any number of computer-executable instructions 118 by which the computing system 110 is configured to implement one or more aspects of the disclosed embodiments when the computer-executable instructions 118 are executed by the one or more processor (s) 112. The computing system 110 is also shown to include one or more user interface (s) 114 and input/output (I/O) device (s) 116.

The one or more user interface (s) 114 and input/output (I/O) device (s) 116 include, but are not limited to, speakers, microphones, vocoders, display devices, browsers and application displays and controls for receiving and displaying/rendering user inputs. These user input include user inputs for identifying and/or generating the referenced training data, and user inputs for identifying textual data to be processed and user inputs for identifying target prosody speaking styles and/or target languages for which speech data is to be synthesized based on the textual data. The user inputs can be entered in various formats (audio commands or other audio formats, typed characters or other textual formats, gestures or other visual formats) through any combination of the referenced interfaces.

Various interface menus with selectable control element, stand-alone control objects and other control features are also provided within the interface (s) 114 (not presently shown) for receiving user input that is operable, when received, to trigger access to the referenced models and model components, as well as for triggering controlled execution of instructions for implementing any of the disclosed functionality by the disclosed models and model components.

The storage 140 is shown as a single storage unit. However, it will be appreciated that the storage 140 is, in some embodiments, a distributed storage that is distributed to several separate and sometimes remote systems 120. In this regard, it will be appreciated that the system 110 will comprise a distributed system, in some embodiments, with one or more of the system 110 components being maintained/run by different discrete systems that are remote from each other and that each perform different tasks. In some instances, a plurality of distributed systems performs similar and/or shared tasks for implementing the disclosed functionality, such as in a distributed cloud environment.

In some embodiments, storage 140 is configured to store one or more of the following: training data 142, target prosody and language profiles 144 and target and predicted spectrogram data 146 (which includes the spectrograms and information used for generating the spectrograms, as well as up-sampled versions of the spectrograms for rendering and up-sampling of the predicted spectrograms, for example, by a corresponding vocoder) . It is noted that the target prosody and language profiles 144 will include a plurality of different profiles for different speakers having different prosody styles and profiles corresponding to different languages. These language profiles can also include lexicons for each language/dialect that is profiled. The target prosody and language profiles 144 correspond to training data, in some instances, which is used to refine the disclosed models to convert detected text into a target prosody and language associated with one of the stored profiles.

In some instances, the storage 140 includes computer-executable instructions 118 for instantiating or executing one or more of the models and/or engines shown in computing system 110. In some instances, the one or more models are configured as machine learning models or machine learned models. In some instances, the one or more models are configured as a pipeline of different models and/or corresponding algorithms that are configured/trained to collectively generate predicted spectrograms corresponding to identified text in a target prosody and language.

In some instances, the one or more of the disclose models are configured as engines or discrete processing systems (e.g., computing systems integrated within computing system 110) , wherein each engine (i.e., model) comprises one or more processors (e.g., hardware processor (s) 112) and corresponding computer-executable instructions 118 for implementing the disclosed functionality associated with the associated model (s) .

In some embodiments, target speaker prosody and language profiles 144 comprises electronic content/data obtained from a target speaker in the form of audio data, text data and/or visual data. Additionally, or alternatively, in some embodiments, the target speaker prosody and language profiles 144 comprise metadata (i.e., speech prosody and other speech attributes, speaker identifiers, etc. ) corresponding to the particular speaker from which the data is collected. In some embodiments, the metadata comprises attributes associated with the identity of the speaker, characteristics of the speaker and/or the speaker’s voice and/or information about where, when and/or how the speaker data is obtained.

In some embodiments, the speaker prosody and language profiles 144 includes raw data (e.g., direct recordings) . Additionally, or alternatively, in some embodiments, the speaker prosody and language profiles 144 comprise processed data (e.g., waveform format of the speaker data and/or PPG data (e.g., posteriorgram data) corresponding to a target speaker) .

In some embodiments, the frame length for each piece of phonetic information comprises whole phrases of speech, whole words of speech, particular phonemes of speech and/or a pre-determined time duration. In some examples, the frame comprises a time duration selected between 1 millisecond to 10 seconds, or more preferably between 1 millisecond to 1 second, or even more preferably between 1 millisecond to 50 milliseconds, or yet even more preferably, a duration of approximately 12.5 milliseconds.

In some embodiments, the prosody attributes extracted from the mel spectrum are included in the prosody feature data used by the fine-grained prosody module 316 (described in reference to Figure 3) . Additionally, or alternatively, the prosody feature data comprises additional prosody features or prosody attributes. For example, in some instances, the additional prosody features comprise attributes corresponding to the pitch and/or energy contours of the speech waveform data. The storage device (s) 140 can also store the pitch and intensity data in or more storage containers.

In some embodiments, the spectrogram data comprises a plurality of spectrograms. Typically, spectrograms are a visual representation of the spectrum of frequencies of a signal as it varies with time (e.g., the spectrum of frequencies that make up the speaker data) . In some instances, spectrograms are sometimes called sonographs, voiceprints or voicegrams. In some embodiments, the spectrograms included in the spectrogram data are characterized by the prosody style of a target speaker and in a target language.

In some embodiments, the spectrograms are converted to the mel-scale. The mel-scale is a non-linear scale of pitches determined by listeners to be equidistant from each-other, and more closely mimics human response/human recognition of sound versus a linear scale of frequencies. In such embodiments, the spectrogram data comprises the mel-frequency cepstrum (MFC) (i.e., the representation of the short-term power spectrum of a sound, based on the linear cosine transformation of a log power spectrum on a nonlinear mel scale of frequency. Thus, mel-frequency cepstral coefficients (MFCCs) are the coefficients that comprise an MFC. For example, the frequency bands are equally spaced on the mel scale for an MFC.

In some embodiments, the hardware storage device 140 stores various machine learned or learning models (ML Models 148) . These models include the TTS models and model components described throughout this paper and any other models required to implement the functionality described herein. These models include, for example, variance adaptor (including the corresponding pitch model, duration predictor, fine-grained prosody module, global style token module) , the encoder and decoder conformer blocks, the spectrogram, ASR and NPL models that are described herein.

The described TTS models are trainable or trained to convert input text to speech data. For example, a portion of an email containing one or more sentences (e.g., a particular number of machine-recognizable utterances/words) is applied to the disclosed TTS model (s) to cause the TTS model (s) to recognize words or parts of words (e.g., phonemes) and to produce a corresponding spectrogram and/or sound (with a corresponding vocoder) to the phonemes or utterances/words.

In some embodiments, the neural TTS model is adapted for a particular target speaker and language to generate speech output (e.g., audio and/or spectrogram) having prosody and other attributes of the target speaker and in the target language (which may be a same or different language than the source textual data being converted to speech data) .

Figures 2A-2B shown overview examples 200A-200B of the disclosed TTS systems and TTS processing flows. For instance, the high-level overview example 200A shows how text 201 is received by an NLP Module 202 (or another one of the disclosed language processing modules/models) to parse, normalize and sequence a plurality of corresponding phonemes 203. Then, these phoneme strings are encoded and processed by the acoustic module 204 to generate synthesized speech data corresponding to the text 201 and preferably by applying/considering implicit and explicit data associated with target prosody and language attributes and that is ultimately decoded into corresponding mel spectrograms 205 that are converted/rendered by a vocoder 206 into audible speech in the prosody and language of a target speaker.

In some instances, the vocoder up-samples the mel spectrograms (configured in a relatively low 16 kHz or 24 kHz sample rate) into a waveform output having a relatively higher 40+ kHz sample rate (e.g., 41000 Hz or 48000 Hz) . This enables the system to leverage the benefits of the higher sample rate (s) for rendering the speech data and without having to carry the computational cost and stability risks associated with processing.

Figure 2B illustrates a related example 200B of a TTS system and flow in which it is shown how the acoustic module 204 is configured to identify and apply/process implicit information/data 210 and explicit information/data 220 to the language data being processed by the acoustic module 204 to generate predicted spectrogram data structured to be rendered by the vocoder 206 in the target speaker prosody style and language.

Additional information regarding the target speaker prosody style and language, which is used as a basis for selecting the implicit and explicit data by the acoustic model of the TTS system/model to process the synthesized speech data from source textual data into speech data in the target speaker prosody style and language is automatically and/or manually identified/selected, based on preconfigured settings, based on contextual environmental conditions (e.g., detected user, geography, mood, etc. ) and/or based on explicit user input and/or system settings.

In some embodiments, the referenced prosody style refers to a set, or a sub-set, of prosody attributes. In some instances, the prosody attributes correspond to a particular speaker (e.g., a target speaker or a source speaker) . In some instances, a particular prosody style is assigned an identifier, for example, a name identifier. For example, the prosody styles are associated with a name identifier that identifies the speaker from which the prosody style is generated/obtained. In some examples, the prosody styles comprise descriptive identifiers, such as story-telling style (e.g., a speaking manner typically employed when reading a novel aloud or relating a story as part of a speech or conversation) , newscaster style (e.g., a speaking manner typically employed by a newscaster, in delivering news in a factual, unemotional, direct style) , presentation style (e.g., a formal speaking style typically employed when a person is giving a presentation) , conversational style (e.g., a colloquial speaking style typically employed by a person when speaking to a friend or relative) , etc. Additional styles include, but are not limited to a serious style, a casual style, and a customer service style. It will be appreciated that any other type of speaking style, besides those listed, can also be used for training an acoustic model with corresponding training data of said style (s) .

In some embodiments, the prosody styles are attributed to typical human-expressed emotions such as a happy emotion, a sad emotion, an excited emotion, a nervous emotion, or other emotion. Oftentimes, a particular speaker is feeling a particular emotion and thus the way the speaker talks is affected by the particular emotion in ways that would indicate to a listener that the speaker is feeling such an emotion. As an example, a speaker who is feeling angry may speak in a highly energized manner, at a loud volume, and/or in truncated speech. In some embodiments, a speaker may wish to convey a particular emotion to an audience, wherein the speaker will consciously choose to speak in a certain manner. For example, a speaker may wish to instill a sense of awe into an audience and will speak in a hushed, reverent tone with slower, smoother speech. It should be appreciated that in some embodiments, the prosody styles are not further categorized or defined by descriptive identifiers.

TTS systems/models described herein are configured/trained with training data 142 to generate speech (in any of a combination of languages and/or prosody styles) from arbitrary text based on training data that is specific to and that corresponds to the aforementioned target speaker prosody styles and languages profiles 144.

Additional components of the TTS systems for enabling the disclosed processing are described and shown in more detail in the example 200C of Figure 2C. In this example, internal tools are used to extract a target spectrogram from audio sample 232 provided to the system (e.g., by a target speaker) for training the model (s) to perform the TTS processing described herein and/or that are automatically generated in a target prosody style and language from arbitrary textual data when processing speech output for the arbitrary textual data. The internal tools can include an ASR model, for example, or other programming modules for performing the described functionality.

The target spectrogram is then used by or applied to a fine-grained prosody module 270 and a global style token module 280 to generate the implicit information 210 or data comprising phoneme-level prosody vectors and utterance level global tokens, respectively.

Similarly, explicit information 220 or data comprising target pitch, speaker ID, language ID and/or duration for the audio sample 232 and/or a target speaker prosody style and/or target language are also obtained using different configurable modules/models. For instance, the duration predictor 260 is used to predict durations of different utterances and/or phonemes for speech data associated with the target speech prosody style and/or language. The pitch module 250 is used to processes target pitch information from a target pitch that is obtained by various modules/tools of the system based on the audio sample 232 and/or a target speaker prosody style and/or target language. The aforementioned prosody and language profiles 144 are also usable to help identify and provide information used by the various modules to generate the implicit and/or explicit information (210, 220) .

In some instances, the implicit and/or explicit information is provided by the modules and/or processed by the various modules as feature sets or vectors of attributes associated with the implicit and/or explicit information and which is used to influence the processing of the encoded phoneme sequences and to facilitate the acoustic model generating predicted spectrogram data (e.g., mel spectrograms) information for the textual data to be converted to speech data that conforms to the desired target speaker prosody styles and/or target languages.

Attention is now directed to Figure 3, which illustrates an example 300 of a TTS system that includes a variance adaptor 310 that is configured to apply implicit and explicit data associated with a target speaker prosody style and/or language. As suggested in Figure 2C, the implicit and explicit data is obtained by the variance adaptor 310 by using various independent modules/models, including the duration predictor 312, the pitch module 314, fine-grained prosody module and the global style token module 318, and which may utilize target spectrogram data 320 associated with or generated using the target speaker prosody style and/or language profile data.

As further illustrated, some additional explicit data comprising language embeddings 324 and speaker embeddings 328 is applied in some instances, which corresponds to the identified language ID 322 and speaker ID 326 (e.g., which can be obtained from the aforementioned speaker profiles) , to further improve the efficiency and accuracy of the TTS models that use the additional data to render TTS speech data in a target speaker prosody style and language.

The illustrated example 300 also shows how phoneme data 302 is processed to generate token embeddings for or by encoder conformer blocks 306 prior to being processed by the variance adaptor 310.

Additional length regulator 330 data and/or modules can be used by decoder conformer blocks, while decoding the output from the variance adaptor 310, to identify lengths of the speech data utterances to be converted into the predicted spectrogram 360 that is structured to be rendered by a vocoder in the target speaking prosody style and language.

Additional details regarding the foregoing components will be provided with regard to the following Figures, including Figure 4, which shows a more detailed view of the encoder/decoder conformer block structure 400, as well as Figures 5 and 6, which show flow diagrams of acts associated with methods for configuring and/or utilizing the disclosed TTS systems/models (500) and for training (600) the disclosed TTS systems/models.

As illustrated in Figure 4, the disclosed TTS systems/models utilize uniquely configured encoders/decoders that include stacks of self-attention (420) , depth-wise convolution (430) and convolution feed forward layers (410 and 440) .

Notably, the ordering of the encoder/decoder blocks or layers is different than used by conventional encoders and decoders. In particular, the current encoder/decoder conformer blocks are sequenced in the processing pipelines to perform depth-wise convolution processing to the encoded data with the convolution module (430) prior to the application of self-attention processing to the encoded data. It will be appreciated that convolution processing and self-attention processing in encoder and decoders for TTS modules is known to those of skill in the art and will not be described in detail at this time.

It is also noted that current encoder/decoder conformer blocks utilize a plurality of feed forward modules/layers for performing additional convolution to the encoded data, rather than using linear processing layers that are used by conventional systems. These modifications and combination of self-attention and convolution enables the models/systems to use the self-attention to learn the global interaction while the convolution processing efficiently capture the local correlations. The end result of using these unique structures, as observed, is that they enable the current models to outperform conventional TTS in terms of improve prosody and audio fidelity relative to the conventional systems.

It is noted that the referenced global and local interactions from the implicit information are especially important for the TTS models, relative to other types of language processing, considering the TTS output sequences are longer than those used by machine translation and speech recognition models. Additional modifications to the model processing layers includes, in some instances, replacing swish processing layers/modules (known to those of skill in the art for TTS models) with ReLU processing layers/modules (also known to those of skill in the art) to facilitate better generalization, particularly for long sentences.

It is noted that the improved conformer block structure shown in Figure 4 is composed of only four modules/layers/blocks that are stacked together in the processing pipeline, i.e., a convolutional feed-forward module, a light-weight convolution module, a self-attention module and a second convolutional feed-forward module in the end. Both of the text encoder (encoder conformer blocks 306) and the mel decoder (decoder conformer blocks 340) are composed with this structure and order of sequenced blocks. Additionally, in some instances, the one or more of the text encoder (encoder conformer blocks 306) and the mel decoder (decoder conformer blocks 340) are composed of multiple sets of the sequenced conformer blocks (410, 420, 430, 440) , such as two, three, four, five, six, or more sets of the sequenced conformer blocks in a single processing pipeline.

Additionally, in some instances, the output of every conformer layer is then projected to 80 bin mel spectrogram, separately, to use an iterative loss on the predicted mel spectrogram to the target mel spectrogram.

A brief review and additional overview of the functionality and operation of the disclosed TTS models/systems that are experienced during runtime implementation and training of the TTS models/systems will now be provided. Then, a more detailed review of key features and functionality will be provided with specific reference to the flow diagrams 500 and 600 that are shown in Figures 5 and 6, respectively, for utilizing and training the TTS models/systems.

As described, the disclosed TTS models/systems model speech conditions with both explicit and implicit perspectives to facilitate handling of the variety of information conditions associated with different input textual data and target speaker prosody styles and languages to use for the predicted/generated output speech data. Specifically, during training, explicit information like speaker ID, language ID, pitch, and duration are used along with implicit information utterance-level global style tokens and phoneme-level fine-grained prosody vectors extracted from the referenced target mel spectrums associated with the target prosody styles/languages. Such processes and information can be used to fine tune the pre-trained multi-speaker and multi-lingual model with data of one or more different target speakers and for one or more different target languages.

The systems also predict explicit and implicit information for different granularity with separate predictors. Then, by unifying explicit and implicit information with different granularity, it is possible to further facilitate improvements in the naturalness, accuracy, and expressiveness of the generated speech data. Additionally, to better trade off training/decoding efficiency, modelling stability and voice quality, the systems are configured to predict 16 kHz mel spectrograms with the multi-speaker and multi-lingual acoustic models, and to thereafter model 48 kHz wave samples with a vocoder (e.g., a HiFiNet vocoder) with larger receptive field size than the underlying spectrograms. In some instances, during training, the vocoder is configured/conditioned with the mel spectrograms and trained with the multi-speaker and multi-lingual internal data profiles to further improve the performance of the vocoder to render speech outputs in the target speaker prosody style (s) and language (s) .

The disclosed systems configure and train the acoustic model used by the TTS model/system as an encoder-decoder based multi-speaker, multi-lingual variance adapter, with duration predictor, pitch predictor, global style token (GST) predictor and fine-grained prosody predictor.

The disclosed systems also configure and train language processing models to generate phoneme sequences from normalized text that are subsequently encoded by the uniquely configured encoder conformer blocks and then processed by the variance adaptor with implicit and explicit information associated with a target speaker prosody style and/or language.

Some embodiments include configuring and/or training the TTS model with a speech recognition model (e.g., an HMM based speech recognition model) to extract forced alignment for duration prediction of the speech outputs being generated.

In some embodiments, internal tools to the TTS system are used to extract frame level pitch and mel spectrogram for the target prosody/language based on stored profile information. Additionally, the systems are further configured/trained to average the frame level pitch to phone level according to the duration extracted from forced alignment.

As discussed, the explicit information used by the system includes language ID, speaker ID, duration. Additionally, acoustic residual information of different granularity can be used by the system to describe different aspects of the acoustic features. Accordingly, the systems are able extract unsupervised prosody vectors directly from the mel spectrums to provide information that supervised signals lack, and such that the explicit and implicit information modelling is complementary for predicting fine-grained frame-level mel spectrums.

It is noted that explicit information modeling performed by the system will be based on and/or include one or more of a language ID, a speaker ID, a pitch, and a duration for the target prosody style and/or language. It is noted that the phone level pitch, when used, can provide for stable prediction. Implicit information modelling includes global style token (GST) with utterance level, fine-grained prosody with phone level. Pitch, duration, GST, fine-grained prosody need prediction for inference, while language/speaker IDs don’t.

During training, the GST and fine-grained prosody vectors are extracted from the target mel spectrums while pitch and duration use ground truth target.

In some instances, the GST is configured with 3 sub modules, namely, a reference encoder, a style token layer, and a text predictor (not shown) . The reference encoder is made up of convolution and RNN layers. It takes a mel spectrogram as input and uses the last GRU state serves as the global reference embedding, which is then fed as input to the style token layer. Fine-grained phoneme-level prosody vector aligns the ground truth spectrogram with the encoder outputs using attention. The disclosed systems also directly use the corresponding latent representations as phone level vectors for training stability. In some instances, the explicit and implicit prosody vector dimension is sixteen. In other embodiments the explicit and implicit vectors dimensions are greater than or less than sixteen.

In some instances, the text predictor for explicit and implicit information contains a GRU layer followed by a bottleneck module to predict final prosody vector. This fine-grained prosody predictor takes both text encoder output and GST embedding as input, with GST’s help, it is not necessary to provide an auto-regressive predictor as is required by some conventional models to provide for faster inference.

Attention will now be directed to Figure 5, which illustrates a flow diagram 500 that includes various acts associated with exemplary methods that can be implemented by computing systems, such as computing system 110 of Figure 1.

As shown, the flow diagram 500 includes a plurality of acts (act 510, act 520, act 530, act 540, and act 550) which are associated with various methods for obtaining, generating, configuring and/or otherwise utilizing the disclosed TTS models to perform TTS processing.

As shown, the first illustrated act (act 510) includes the system obtaining an encoder conformer that is configured to generate encoded phoneme data from received embedded phoneme data, the embedded phoneme data being based on the textual data, the encoder conformer including at least a convolution module, a self-attention module, the encoder conformer being further configured to process the embedded phoneme data by the convolution module prior to processing the embedded phoneme data with the self-attention module.

The system also obtains a variance adaptor trained to process the encoded phoneme data to generate corresponding variation information data and that is based at least in part on implicit data and explicit data corresponding to a target spectrogram associated with the target speaker language and the target speaker prosody style (act 520) .

The system also obtains a decoder conformer that is configured to generate a predicted spectrogram of speech data based on the corresponding variation information data and which corresponds directly to the textual data used to generate the embedded phoneme data, the predicted spectrogram of speech data being structured for rendering by a vocoder as speech data in the target speaker language and the target speaker prosody style (act 530) .

The system also configures the TTS model to perform TTS processing with (i) the encoder conformer, (ii) the variance adaptor, (iii) the decoder conformer, and (iv) computer-executable instructions which are executable by one or more hardware processors for causing (a) the encoder conformer to generate encoded phoneme data from the received embedded phoneme data and to provide the encoded phoneme data to the variance adaptor, (b) the variance adaptor to process the encoded phoneme data with the implicit data and explicit data corresponding to the target spectrogram and to generate the corresponding variation information data based on the target spectrogram, (c) the variance adaptor to provide the corresponding variation information data to the decoder conformer, and (d) for the decoder conformer to generate the predicted spectrogram of speech data based on the refined encoded phoneme data.

In some instances, the encoder conformer is configured as one or more sets of ordered encoder conformer blocks or stacks of blocks that collectively include, for each of the one or more sets, (i) a first convolutional feed-forward module, (ii) the convolution module, (iii) the self-attention module, and (iv) a second convolutional feed-forward module.

As described, the variance adaptor includes, in some configurations, a fine-grained prosody module trained for identifying prosody features associated with corresponding speech utterances of the textual data and corresponding target spectrogram training data, and a global style token module trained for generating global tokens associated with the spoken utterances of the textual data and corresponding target spectrogram training data, wherein the referenced explicit data or information includes the reference prosody features and the global tokens.

The implicit data, on the other hand, includes the duration data and the pitch data obtained from a duration predictor of the variance adaptor and that is configured or trained for predicting duration data comprising durations of phonemes for speech utterances identified in the textual data and corresponding target spectrogram data, and a pitch module of the variance adaptor and that is configured or trained for trained for identifying pitch features associated with the speech utterances of the textual data and corresponding target spectrogram training data.

Process language embeddings and speaker embeddings for different languages and speakers associated with the textual speech, can also be included in the implicit data for generating the corresponding variation information data based on the target spectrogram for the target speaker in the target language.

In some additional embodiments, the system provides a remote system (e.g., 120) with controlled access to the TTS model by at least providing the remote system with access to the encoder conformer, the variance adaptor, and the decoder conformer, as well as selective control for triggering execution of the computer-executable instructions and for identifying and/or providing the textual data that is processed by the TTS model. These controls can be provided with application interfaces, selectable control objects that, when selected, trigger the execution of the model (s) and/or selection of data to be processed. Other controls include input fields for receiving commands and or explicit information that identifies the data to be processed. These controls can also be used to provide user control for selectively identifying the target speaker language and the target speaker prosody style to base the output on.

The disclosed embodiments also include acts for identifying the target speaker language and the target speaker prosody style, identifying the textual data, using the TTS model to generate the predicted spectrogram based on the textual data, as well as the target speaker language and the target speaker prosody style, and for causing a vocoder to render the predicted spectrogram as the speech data in the target speaker language and the target speaker prosody style. In some instances, the vocoder up-samples the output to render the speech data in a higher sample rate spectrum (40+kHz) than an underlying predicted spectrogram.

Attention will now be directed to Figure 6, which illustrates a flow diagram 600 that includes various acts (act 610, act 620, act 630, act 640, act 650 and act 660) associated with exemplary methods that can be implemented by computing systems, such as computing system 110, for training and or utilizing the TTS models described herein.

Act 610 includes the system identifying training data comprising textual data for which the TTS model is to be trained to generate corresponding speech data for, as well as target spectrogram training data that is generated from spoken utterances of the textual data.

Act 620 includes the system training a variance adaptor of the TTS model with the training data to generate variation information data based on the training data and at least in part on the target spectrogram.

This training includes training a duration predictor for predicting duration data comprising durations of phonemes identified in the training data for the speech utterances of the textual data and corresponding target spectrogram training data.

The training also includes training a pitch module for identifying pitch features associated with the speech utterances of the textual data and corresponding target spectrogram training data.

The training also includes training a fine-trained prosody module for identifying prosody features associated with corresponding speech utterances of the textual data and corresponding target spectrogram training data.

The training also includes training a global style token module for generating global tokens associated with the speech utterances of the textual data and corresponding target spectrogram training data.

Act 630 includes the system configuring and training the variance adaptor to generate the variation information data by receiving and processing encoded phoneme data from an encoder conformer that creates the encoded phoneme data from phonemes obtained from the training data (act 630) .

Act 640 includes the system configuring and training the variance adaptor to generate the variation information data based at least in part on implicit data comprising the duration data and the pitch features, as well as explicit data comprising the prosody features and the global tokens that correspond to the training data used to generate the encoded phoneme data, the encoded phoneme data comprising phonemes that are encoded by the encoder conformer.

Act 650 includes the system further configuring and training the variance adaptor to provide the variation information data to a decoder conformer that generates a predicted spectrogram that is structured to be rendered by a vocoder as the speech data in the target speaker language and the target speaker prosody style.

Disclosed methods also include training the decoder conformer to generate the predicted spectrogram of speech data based on the variation information data received from the variance adaptor and by training the decoder conformer to generate the predicted spectrogram for rendering by the vocoder as the speech data in the target speaker language and the target speaker prosody style.

Disclosed methods also include training the variance adaptor to generate the variation information data based at least in part on language embeddings and speaker embeddings corresponding to the target speaker language and the target speaker prosody style, in addition to multiple different speaker languages and speaker prosody styles.

Disclosed methods also include configuring encoder and/or decoder conformers with layers or blocks that are stacked/sequenced in a processing pipeline with a first processing layer of a first convolutional feed-forward module, next a convolution module, followed by a self-attention module, and finally by a subsequent second convolutional feed-forward module, and such that the encoder/decoder conformer is configured to process any embedded phoneme data that is received with the convolution module prior to processing the embedded phoneme data with the self-attention module while generating the encoded phoneme data.

Disclosed methods also include configuring the decoder conformer to provide the predicted spectrogram to the vocoder for rendering by the vocoder and, in some instances, at a higher sample rate than the predicted spectrogram.

Disclosed methods also include configuring the TTS model to identify user input identifying the target speaker language and the target speaker prosody style, as well as user-identified textual data to be processed by the TTS model during run-time synthesis of speech data from the user-identified textual data.

The model also obtains and/or generates the target spectrogram based on the target speaker language and the target speaker prosody style and using the TTS model during run-time, subsequent to training the TTS model, to generate the predicted spectrogram for the user-identified textual data, and based on phonemes obtained from the user-identified textual data, and based on the target spectrogram and for providing the predicted spectrogram to the vocoder for causing the vocoder to render the speech data in the target speaker language and the target speaker prosody style. In some instances, the system further identifies/receives user input specifying the speaker language and the target speaker prosody style to use for the underlying/supplemental implicit and explicit data.

As should be apparent from the foregoing disclosure, the disclosed embodiments can be utilized to provide technical benefits over conventional TTS systems and corresponding methods for generating, training, and using TTS models for text-to-speech data generation. For example, by providing and utilizing the disclosed variance adapter with the TTS models/systems, it is possible to improve the accuracy and stability of the TTS models trained to perform speech synthesis in a variety of target languages and prosody styles. Furthermore, by modifying the ordering of the processing layers of the encoder and decoder conformers within the TTS models, it is possible to further enhance and improve the prosody and audio fidelity, even while utilizing and synthesizing speech data at higher sample rates (e.g., 40+kHz) , particularly as compared to convention models that do not incorporate these unique combination of features.

Embodiments of the present invention may comprise or utilize a special purpose or general-purpose computer (e.g., computing system 110) including computer hardware, as discussed in greater detail below. Embodiments within the scope of the present invention also include physical and other computer-readable media for carrying or storing computer-executable instructions and/or data structures. Such computer-readable media can be any available media that can be accessed by a general purpose or special purpose computer system. Computer-readable media (e.g., storage 140 of Figure 1) that store computer-executable instructions (e.g., component 118 of Figure 1) are physical storage media. Computer-readable media that carry computer-executable instructions are transmission media. Thus, by way of example, and not limitation, embodiments of the invention can comprise at least two distinctly different kinds of computer-readable media: physical computer-readable storage media and transmission computer-readable media.

Physical computer-readable storage media, which is distinct and distinguished from transmission computer-readable media, include physical and tangible hardware. Examples of physical computer-readable storage media include hardware storage devices such as RAM, ROM, EEPROM, CD-ROM or other optical disk storage (such as CDs, DVDs, etc. ) , magnetic disk storage or other magnetic storage devices, or any other hardware which can be used to store desired program code means in the form of computer-executable instructions or data structures and which can be accessed by a general purpose or special purpose computer and which are distinguished from merely transitory carrier waves and other transitory media that are not configured as physical and tangible hardware.

A “network” (e.g., network 130 of Figure 1) is defined as one or more data links that enable the transport of electronic data between computer systems and/or modules and/or other electronic devices. When information is transferred or provided over a network or another communications connection (either hardwired, wireless, or a combination of hardwired or wireless) to a computer, the computer properly views the connection as a transmission medium. Transmission media can include any network links and/or data links, including transitory carrier waves, which can be used to carry, or desired program code means in the form of computer-executable instructions or data structures, and which can be accessed by a general purpose or special purpose computer. Combinations of the above are also included within the scope of computer-readable media.

Further, upon reaching various computer system components, program code means in the form of computer-executable instructions or data structures can be transferred automatically from transmission computer-readable media to physical computer-readable storage media (or vice versa) . For example, computer-executable instructions or data structures received over a network or data link can be buffered in RAM within a network interface module (e.g., a “NIC” ) , and then eventually transferred to computer system RAM and/or to less volatile computer-readable physical storage media at a computer system. Thus, computer-readable physical storage media can be included in computer system components that also (or even primarily) utilize transmission media.

Computer-executable instructions comprise, for example, instructions and data which cause a general-purpose computer, special purpose computer, or special purpose processing device to perform a certain function or group of functions. The computer-executable instructions may be, for example, binaries, intermediate format instructions such as assembly language, or even source code. Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the described features or acts described above. Rather, the described features and acts are disclosed as example forms of implementing the claims.

Those skilled in the art will appreciate that the invention may be practiced in network computing environments with many types of computer system configurations, including, personal computers, desktop computers, laptop computers, message processors, hand-held devices, multi-processor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, mobile telephones, PDAs, pagers, routers, switches, and the like. The invention may also be practiced in distributed system environments where local and remote computer systems, which are linked (either by hardwired data links, wireless data links, or by a combination of hardwired and wireless data links) through a network, both perform tasks. In a distributed system environment, program modules may be located in both local and remote memory storage devices.

Alternatively, or in addition, the functionality described herein can be performed, at least in part, by one or more hardware logic components. For example, and without limitation, illustrative types of hardware logic components that can be used include Field-programmable Gate Arrays (FPGAs) , Program-specific Integrated Circuits (ASICs) , Program-specific Standard Products (ASSPs) , System-on-a-chip systems (SOCs) , Complex Programmable Logic Devices (CPLDs) , etc.

The present invention may be embodied in other specific forms without departing from its spirit or characteristics. The described embodiments are to be considered in all respects only as illustrative and not restrictive. Additionally, it will be appreciated that the scope of the invention also includes combinations of the disclosed features that are not explicitly stated, but which are contemplated, and which can include any combination of the disclosed features that are not antithetical to the utility and functionality of the disclosed models and techniques for performing TTS processing.

The scope of the invention is, therefore, indicated by the appended claims rather than by the foregoing description. All changes which come within the meaning and range of equivalency of the claims are to be embraced within their scope.

Claims

A method implemented by a computing system for utilizing a text-to-speech (TTS) model configured for generating speech data from textual data in a target speaker language and a target speaker prosody style, the method comprising:

obtaining an encoder conformer that is configured to generate encoded phoneme data from received embedded phoneme data, the embedded phoneme data being based on the textual data, the encoder conformer including at least a convolution module, a self-attention module, the encoder conformer being further configured to process the embedded phoneme data by the convolution module prior to processing the embedded phoneme data with the self-attention module;

obtaining a variance adaptor trained to process the encoded phoneme data to generate corresponding variation information data and that is based at least in part on implicit data and explicit data corresponding to a target spectrogram associated with the target speaker language and the target speaker prosody style;

obtaining a decoder conformer that is configured to generate a predicted spectrogram of speech data based on the corresponding variation information data and which corresponds directly to the textual data used to generate the embedded phoneme data, the predicted spectrogram of speech data being structured for rendering by a vocoder as speech data in the target speaker language and the target speaker prosody style;

configuring the TTS model to perform TTS processing with (i) the encoder conformer, (ii) the variance adaptor, (iii) the decoder conformer, and (iv) computer- executable instructions which are executable by one or more hardware processors for causing:

the encoder conformer to generate encoded phoneme data from the received embedded phoneme data and to provide the encoded phoneme data to the variance adaptor;

variance adaptor to process the encoded phoneme data with the implicit data and explicit data corresponding to the target spectrogram and to generate the corresponding variation information data based on the target spectrogram;

variance adaptor to provide the corresponding variation information data to the decoder conformer; and

for the decoder conformer to generate the predicted spectrogram of speech data based on the refined encoded phoneme data.
The method of claim 1, the encoder conformer comprising an encoder conformer block that collectively includes (i) a first convolutional feed-forward module, (ii) the convolution module, (iii) the self-attention module, and (iv) a second convolutional feed-forward module.
The method of claim 2, the encoder conformer comprising a plurality of the encoder conformer blocks.
The method of claim 1, the variance adaptor including:

a fine-grained prosody module trained for identifying prosody features associated with corresponding speech utterances of the textual data and corresponding target spectrogram training data; and

a global style token module trained for generating global tokens associated with the spoken utterances of the textual data and corresponding target spectrogram training data,

the prosody features and the global tokens comprising the explicit data.
The method of claim 1, the variance adaptor including:

a duration predictor trained for predicting duration data comprising durations of phonemes for speech utterances identified in the textual data and corresponding target spectrogram data; and

a pitch module trained for identifying pitch features associated with the speech utterances of the textual data and corresponding target spectrogram training data,

the duration data and the pitch features comprising the implicit data.
The method of claim 5, variance adaptor being further trained to receive and process language embeddings and speaker embeddings for different languages and speakers associated with the textual speech, respectively, along with the encoded phoneme data to further generate the corresponding variation information data based on the target spectrogram for the target speaker in the target language, implicit data further including the language embeddings and speaker embeddings.
The method of claim 1, the method further including:

providing a remote system with controlled access to the TTS model by at least providing the remote system with:

access to the encoder conformer, variance adaptor, and the decoder conformer;

selective control for triggering execution of the computer-executable instructions; and

with control for identifying and/or providing the textual data that is processed by the TTS model.
The method of claim 1, the method further including:

providing a remote system with controlled access to the TTS model by at least providing the remote system with control for identifying the target speaker language and the target speaker prosody style.
The method of claim 8, wherein the method further includes:

identifying the target speaker language and the target speaker prosody style;

identifying the textual data;

using the TTS model to generate the predicted spectrogram based on the textual data, as well as the target speaker language and the target speaker prosody style;

causing a vocoder to render the predicted spectrogram as the speech data in the target speaker language and the target speaker prosody style.
A method implemented by a computing system for training a text-to-speech (TTS) model configured for generating speech data from textual data in a target speaker language and a target speaker prosody style, the method comprising:

identifying training data comprising textual data for which the TTS model is to be trained to generate corresponding speech data for, as well as target spectrogram training data that is generated from spoken utterances of the textual data;

training a variance adaptor of the TTS model with the training data to generate variation information data based on the training data and at least in part on the target spectrogram by at least:

(i) training a duration predictor for predicting duration data comprising durations of phonemes identified in the training data for the speech utterances of the textual data and corresponding target spectrogram training data;

(ii) training a pitch module for identifying pitch features associated with the speech utterances of the textual data and corresponding target spectrogram training data;

(iii) a fine-trained prosody module for identifying prosody features associated with corresponding speech utterances of the textual data and corresponding target spectrogram training data; and

(iv) training a global style token module for generating global tokens associated with the speech utterances of the textual data and corresponding target spectrogram training data;

configuring and training the variance adaptor to generate the variation information data by receiving and processing encoded phoneme data from an encoder conformer that creates the encoded phoneme data from phonemes obtained from the training data;

further configuring and training the variance adaptor to generate the variation information data based at least in part on implicit data comprising the duration data and the pitch features, as well as explicit data comprising the prosody features and the global tokens that correspond to the training data used to generate the encoded phoneme data, the encoded phoneme data comprising phonemes that are encoded by the encoder conformer; and

further configuring and training the variance adaptor to provide the variation information data to a decoder conformer that generates a predicted spectrogram that is structured to be rendered by a vocoder as the speech data in the target speaker language and the target speaker prosody style.
The method of claim 10, wherein the method further includes training the decoder conformer to generate the predicted spectrogram of speech data based on the variation information data received from the variance adaptor and by training the decoder conformer to generate the predicted spectrogram for rendering by the vocoder as the speech data in the target speaker language and the target speaker prosody style.
The method of claim 11, wherein the method further includes:

training the variance adaptor to generate the variation information data based at least in part on language embeddings and speaker embeddings corresponding to the target speaker language and the target speaker prosody style, in addition to multiple different speaker languages and speaker prosody styles.
The method of claim 12, wherein the method further includes configuring the encoder conformer with:

(i) a first convolutional feed-forward module;

(ii) a convolution module;

(iii) a self-attention module; and

(iv) a second convolutional feed-forward module.
The method of claim 13, wherein the method further includes:

configuring the encoder conformer to process any embedded phoneme data that is received with the convolution module prior to processing the embedded phoneme data with the self-attention module while generating the encoded phoneme data.
The method of claim 11, wherein the method further includes:

configuring the decoder conformer to provide the predicted spectrogram to the vocoder.
The method of claim 15, wherein the method further includes:

configuring the TTS model to identify user input identifying the target speaker language and the target speaker prosody style, as well as user-identified textual data to be processed by the TTS model during run-time synthesis of speech data from the user-identified textual data.
The method of claim 10, wherein the method further includes causing the TTS model to obtain the target spectrogram based on the target speaker language and the target speaker prosody style.
The method of claim 17, wherein the method further includes:

using the TTS model during run-time, subsequent to training the TTS model, to generate the predicted spectrogram for the user-identified textual data, and based on phonemes obtained from the user-identified textual data, and based on the target spectrogram which is obtained for the target speaker language and the target speaker prosody style, the predicted spectrogram being structed to be rendered by the vocoder as the speech data in the target speaker language and the target speaker prosody style; and

providing the predicted spectrogram to the vocoder.
The method of claim 18, wherein the method further includes causing the vocoder to render the predicted spectrogram as the speech data in the target speaker language and the target speaker prosody style.
The method of claim 18, wherein the method further includes receiving user input identifying at least one of the speaker language and the target speaker prosody style.