US11017761B2 - Parallel neural text-to-speech - Google Patents

Parallel neural text-to-speech Download PDF

Info

Publication number
US11017761B2
US11017761B2 US16/654,955 US201916654955A US11017761B2 US 11017761 B2 US11017761 B2 US 11017761B2 US 201916654955 A US201916654955 A US 201916654955A US 11017761 B2 US11017761 B2 US 11017761B2
Authority
US
United States
Prior art keywords
decoder
block
attention
autoregressive
vocoder
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active, expires
Application number
US16/654,955
Other versions
US20200066253A1 (en
Inventor
Kainan PENG
Wei PING
Zhao SONG
Kexin ZHAO
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Baidu USA LLC
Original Assignee
Baidu USA LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US16/058,265 external-priority patent/US10796686B2/en
Priority claimed from US16/277,919 external-priority patent/US10872596B2/en
Application filed by Baidu USA LLC filed Critical Baidu USA LLC
Priority to US16/654,955 priority Critical patent/US11017761B2/en
Assigned to BAIDU USA LLC reassignment BAIDU USA LLC ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: PENG, KAINAN, PING, Wei, SONG, Zhao, ZHAO, KEXIN
Publication of US20200066253A1 publication Critical patent/US20200066253A1/en
Priority to CN202010518795.0A priority patent/CN112669809A/en
Application granted granted Critical
Publication of US11017761B2 publication Critical patent/US11017761B2/en
Active legal-status Critical Current
Adjusted expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • G10L13/047Architecture of speech synthesisers
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0475Generative networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/09Supervised learning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections

Definitions

  • the present disclosure relates generally to systems and methods for computer learning that can provide improved computer performance, features, and uses. More particularly, the present disclosure relates to systems and methods for text-to-speech through deep neutral networks.
  • TTS text-to-speech
  • Traditional TTS systems are based on complex multi-stage hand-engineered pipelines. Typically, these systems first transform text into a compact audio representation, and then convert this representation into audio using an audio waveform synthesis method called a vocoder.
  • FIG. 1A depicts an autoregressive sequence-to-sequence model, according to embodiments of the present disclosure.
  • FIG. 1B depicts a non-autoregressive model, which distills the attention from a pretrained autoregressive model, according to embodiments of the present disclosure.
  • FIG. 2 graphical depicts an autoregressive architecture 200 , according to embodiments of the present disclosure.
  • FIG. 3 graphically depicts an alternative autoregressive model architecture, according to embodiments of the present disclosure.
  • FIG. 4 depicts a general overview methodology for using a text-to-speech architecture, according to embodiments of the present disclosure.
  • FIG. 5 graphically depicts a convolution block comprising a one-dimensional (1D) convolution with gated linear unit, and residual connection, according to embodiments of the present disclosure.
  • FIG. 6 graphically depicts an attention block, according to embodiments of the present disclosure.
  • FIG. 7 depicts a non-autoregressive model architecture (i.e., a ParaNet embodiment), according to embodiments of the present disclosure.
  • FIG. 8 graphically depicts a convolution block, according to embodiments of the present disclosure.
  • FIG. 9 graphically depicts an attention block, according to embodiments of the present disclosure.
  • FIG. 10 depicts a ParaNet embodiment iteratively refining the attention alignment in a layer-by-layer way, according to embodiments of the present disclosure.
  • FIG. 11 depicts a simplified block diagram of a variational autoencoder (VAE) framework, according to embodiments of the present disclosure.
  • VAE variational autoencoder
  • FIG. 12 depicts a general method for using a ParaNet embodiment for synthesizing a speech representation from input text, according to embodiments of the present disclosure.
  • FIG. 13 depicts a simplified block diagram of a computing device/information handling system, according to embodiments of the present disclosure.
  • connections between components or systems within the figures are not intended to be limited to direct connections. Rather, data between these components may be modified, re-formatted, or otherwise changed by intermediary components. Also, additional or fewer connections may be used. It shall also be noted that the terms “coupled,” “connected,” or “communicatively coupled” shall be understood to include direct connections, indirect connections through one or more intermediary devices, and wireless connections.
  • a service, function, or resource is not limited to a single service, function, or resource; usage of these terms may refer to a grouping of related services, functions, or resources, which may be distributed or aggregated.
  • a “layer” may comprise one or more operations.
  • Text-to-speech also called speech synthesis
  • speech synthesis has long been a vital tool in a variety of applications, such as human-computer interactions, virtual assistant, and content creation.
  • Traditional TTS systems are based on multi-stage hand-engineered pipelines.
  • deep neural networks based autoregressive models have attained state-of-the-art results, including high-fidelity audio synthesis, and much simpler sequence-to-sequence (seq2seq) pipelines.
  • embodiments of one of the most popular neural TTS pipeline comprises two components (embodiments of which are disclosed in U.S. patent application Ser. No. 16/058,265, filed on Aug.
  • Embodiments of the first non-autoregressive attention-based architecture for TTS which is fully convolutional and converts text to mel spectrogram.
  • the various embodiments may be referred to generally as “ParaNet.”
  • ParaNet iteratively refine the attention alignment between text and spectrogram in a layer-by-layer manner.
  • the non-autoregressive ParaNet embodiments are compared with autoregressive counterpart in terms of speech quality, synthesis speed and attention stability.
  • a ParaNet embodiment achieves ⁇ 46.7 times speed-up over at one autoregressive model embodiment at synthesis, while maintaining comparable speech quality using WaveNet vocoder.
  • the non-autoregressive ParaNet embodiments produce fewer attention errors on a challenging test sentences as compared to an autoregressive model embodiment, because it does not have the troublesome discrepancy between the teacher-forced training and autoregressive inference.
  • the first fully parallel neural TTS system was built by combining a non-autoregressive ParaNet embodiment with the inverse autoregressive flow (IAF)-based neural vocoder (e.g., ClariNet embodiments). It generates speech from text through a single feed-forward pass.
  • IAF inverse autoregressive flow
  • WaveVAE variational autoencoder
  • Section B discusses related work.
  • Embodiments of the non-autoregressive ParaNet architecture are described in Section C.
  • WaveVAE embodiments are presented in Section D.
  • Implementation details and experimental results are provided in Section E, and some conclusions are provided in Section F.
  • Neural speech synthesis has obtained the state-of-the-art results and gained a lot of attention.
  • Several neural TTS systems were proposed, including: novel architectures disclosed in commonly-assigned U.S. patent application Ser. No. 15/882,926, filed on 29 Jan. 2018, entitled “SYSTEMS AND METHODS FOR REAL-TIME NEURAL TEXT-TO-SPEECH,” and U.S. Prov. Pat. App. No. 62/463,482, filed on 24 Feb.
  • Tacotron, Char2Wav, and embodiments of Deep Voice 3 employ seq2seq framework with the attention mechanism, yielding much simpler pipeline compared to traditional multi-stage pipeline.
  • Their excellent extensibility leads to promising results for several challenging tasks, such as voice cloning. All of these state-of-the-art TTS systems are based on autoregressive models.
  • RNN-based autoregressive models such as Tacotron and WaveRNN, lack parallelism at both training and synthesis.
  • CNN-based autoregressive models such as WaveNet and embodiments of Deep Voice 3, enable parallel processing at training, but they still operate sequentially at synthesis since each output element must be generated before it can be passed in as input at the next time-step.
  • non-autoregressive model plays a more important role in text-to-speech, where the output speech spectrogram consists of hundreds of time-steps for a short text with a few words.
  • this work is the first non-autoregressive seq2seq model for TTS and provides as much as 46.7 times speed-up at synthesis over its autoregressive counterpart.
  • Normalizing flows are a family of generative models, in which a simple initial distribution is transformed into a more complex one by applying a series of invertible transformations.
  • Inverse autoregressive flow is a special type of normalizing flow where each invertible transformation is based on an autoregressive neural network. IAF performs synthesis in parallel and can easily reuse the expressive autoregressive architecture, such as WaveNet, which leads to the state-of-the-art results for speech synthesis.
  • Likelihood evaluation in IAF is autoregressive and slow, thus previous training methods rely on probability density distillation from a pretrained autoregressive model.
  • RealNVP and Glow are different types of normalizing flows, where both synthesis and likelihood evaluation can be performed in parallel by enforcing bipartite architecture constraints. Most recently, both methods were applied as parallel neural vocoders. These models are less expressive than autoregressive and IAF models, because half of the variables are unchanged after each transformation. As a result, these bipartite flows usually require deeper layers, larger hidden size, and huge number of parameters. For example, WaveGlow has ⁇ 200 M parameters, whereas WaveNet and ClariNet embodiments only use ⁇ 1.7 M parameters, making them more preferred in production deployment. In this patent document, one focus is on autoregressive and IAF-based neural vocoders.
  • VAE Variational autoencoder
  • VAE Variational autoencoder
  • VAE has been applied for representation learning of natural speech for years. It models either the generative process of waveform samples or spectrograms.
  • Autoregressive or recurrent neural networks have been employed as the decoder of VAE, but they can be quite slow at synthesis.
  • the feed-forward IAF is employed as the decoder, which enables parallel waveform synthesis.
  • Embodiments of a parallel TTS system comprise two components: 1) a feed-forward text-to-spectrogram model, and 2) a parallel waveform synthesizer conditioned on spectrogram.
  • an autoregressive text-to-spectrogram model such as one derived from Deep Voice 3 is first presented.
  • ParaNet embodiments non-autoregressive text-to-spectrogram models—are presented.
  • FIG. 1A depicts an autoregressive seq2seq model, according to embodiments of the present disclosure.
  • the dashed line 145 depicts the autoregressive decoding of the mel spectrogram at inference.
  • FIG. 1B depicts a non-autoregressive model, which distills the attention from a pretrained autoregressive model, according to embodiments of the present disclosure.
  • Embodiments of the autoregressive model may be based on a Deep Voice 3 embodiment or embodiments—a fully-convolutional text-to-spectrogram model, which comprises three components:
  • Encoder 115
  • a convolutional encoder which takes text inputs and encodes them into an internal hidden representation.
  • Decoder 125
  • a causal convolutional decoder which decodes the encoder representation with an attention mechanism 120 to log-mel spectrograms 135 in an autoregressive manner, in which the output of the decoder at a timestep is used as an input for the next timestep for the decoder, with an l 1 loss. It starts with 1 ⁇ 1 convolutions to preprocess the input log-mel spectrograms.
  • a non-causal convolutional post-processing network which processes the hidden representation from the decoder using both past and future context information and predicts the log-linear spectrograms with an l 1 loss. It enables bidirectional processing.
  • all these components use the same 1-D convolution with a gated linear unit.
  • a major difference between embodiments of ParaNet model and the DV3 embodiment is the decoder architecture.
  • the decoder 125 of the DV3 embodiment 100 has multiple attention-based layers, where each layer comprises a causal convolution block followed by an attention block.
  • each layer comprises a causal convolution block followed by an attention block.
  • embodiments of the autoregressive decoder herein have one attention block at its first layer. It was found that that reducing the number of attention blocks did not hurt the generated speech quality in general.
  • FIG. 2 graphical depicts an example autoregressive architecture 200 , according to embodiments of the present disclosure.
  • the architecture 200 uses residual convolutional layers in an encoder 205 to encode text into per-timestep key and value vectors 220 for an attention-based decoder 230 .
  • the decoder 230 uses these to predict the mel-scale log magnitude spectrograms 242 that correspond to the output audio.
  • the dotted arrow 246 depicts the autoregressive synthesis process during inference (during training, mel-spectrogram frames from the ground truth audio corresponding to the input text are used).
  • the hidden states of the decoder 230 are then fed to a converter network 250 to predict the vocoder parameters for waveform synthesis to produce an output wave 260 .
  • the overall objective function to be optimized may be a linear combination of the losses from the decoder and the converter.
  • the decoder 210 and converter 215 are separated and multi-task training is applied, because it makes attention learning easier in practice.
  • the loss for mel-spectrogram prediction guides training of the attention mechanism, because the attention is trained with the gradients from mel-spectrogram prediction (e.g., using an L1 loss for the mel-spectrograms) besides vocoder parameter prediction.
  • trainable speaker embeddings 270 as in Deep Voice 2 embodiments may be used across encoder 205 , decoder 230 , and converter 250 .
  • FIG. 3 graphically depicts an alternative autoregressive model architecture, according to embodiments of the present disclosure.
  • the model 300 uses a deep residual convolutional network to encode text and/or phonemes into per-timestep key 320 and value 322 vectors for an attentional decoder 330 .
  • the decoder 330 uses these to predict the mel-band log magnitude spectrograms 342 that correspond to the output audio.
  • the dotted arrows 346 depict the autoregressive synthesis process during inference.
  • the hidden state of the decoder is fed to a converter network 350 to output linear spectrograms for Griffin-Lim 352 A or parameters for WORLD 352 B, which can be used to synthesize the final waveform.
  • weight normalization is applied to all convolution filters and fully connected layer weight matrices in the model. As illustrated in the embodiment depicted in FIG. 3 , WaveNet 352 does not require a separate converter as it takes as input mel-band log magnitude spectrograms.
  • FIG. 4 depicts a general overview methodology for using a text-to-speech architecture, such as depicted in FIG. 1A , FIG. 2 , or FIG. 3 , according to embodiments of the present disclosure.
  • an input text in converted ( 405 ) into trainable embedding representations using an embedding model, such as text embedding model 210 .
  • the embedding representations are converted ( 410 ) into attention key representations 220 and attention value representations 220 using an encoder network 205 , which comprises a series 214 of one or more convolution blocks 216 .
  • attention key representations 220 and attention value representations 220 are used by an attention-based decoder network, which comprises a series 234 of one or more decoder blocks 234 , in which a decoder block 234 comprises a convolution block 236 that generates a query 238 and an attention block 240 , to generate ( 415 ) low-dimensional audio representations (e.g., 242 ) of the input text.
  • the low-dimensional audio representations of the input text may undergo additional processing by a post-processing network (e.g., 250 A/ 252 A, 250 B/ 252 B, or 252 C) that predicts ( 420 ) final audio synthesis of the input text.
  • speaker embeddings 270 may be used in the process 105 , 200 , or 300 to cause the synthesized audio to exhibit one or more audio characteristics (e.g., a male voice, a female voice, a particular accent, etc.) associated with a speaker identifier or speaker embedding.
  • one or more audio characteristics e.g., a male voice, a female voice, a particular accent, etc.
  • Text preprocessing can be important for good performance. Feeding raw text (characters with spacing and punctuation) yields acceptable performance on many utterances. However, some utterances may have mispronunciations of rare words, or may yield skipped words and repeated words. In one or more embodiments, these issues may be alleviated by normalizing the input text as follows:
  • the pause durations may be obtained through either manual labeling or estimated by a text-audio aligner such as Gentle.
  • a text-audio aligner such as Gentle.
  • the single-speaker dataset was labeled by hand, and the multi-speaker datasets were annotated using Gentle.
  • Deployed TTS systems may, in one or more embodiments, preferably include a way to modify pronunciations to correct common mistakes (which typically involve, for example, proper nouns, foreign words, and domain-specific jargon).
  • a conventional way to do this is to maintain a dictionary to map words to their phonetic representations.
  • the model can directly convert characters (including punctuation and spacing) to acoustic features, and hence learns an implicit grapheme-to-phoneme model. This implicit conversion can be difficult to correct when the model makes mistakes.
  • phoneme-only models and/or mixed character-and-phoneme models may be trained by allowing phoneme input option explicitly.
  • these models may be identical to character-only models, except that the input layer of the encoder sometimes receives phoneme and phoneme stress embeddings instead of character embeddings.
  • a phoneme-only model requires a preprocessing step to convert words to their phoneme representations (e.g., by using an external phoneme dictionary or a separately trained grapheme-to-phoneme model). For embodiments, Carnegie Mellon University Pronouncing Dictionary, CMUDict 0.6b, was used.
  • a mixed character-and-phoneme model requires a similar preprocessing step, except for words not in the phoneme dictionary. These out-of-vocabulary/out-of-dictionary words may be input as characters, allowing the model to use its implicitly learned grapheme-to-phoneme model.
  • the text embedding model may comprise a phoneme-only model and/or a mixed character-and-phoneme model.
  • stacked convolutional layers can utilize long-term context information in sequences without introducing any sequential dependency in computation.
  • a convolution block is used as a main sequential processing unit to encode hidden representations of text and audio.
  • FIG. 5 graphically depicts a convolution block comprising a one-dimensional (1D) convolution with gated linear unit, and residual connection, according to embodiments of the present disclosure.
  • the convolution block 500 comprises a one-dimensional (1D) convolution filter 510 , a gated-linear unit 515 as a learnable nonlinearity, a residual connection 520 to the input 505 , and a scaling factor 525 .
  • the scaling factor is ⁇ square root over (0.5) ⁇ , although different values may be used. The scaling factor helps ensures that the input variance is preserved early in training.
  • FIG. 1 the depicted embodiment in FIG.
  • c ( 530 ) denotes the dimensionality of the input 505
  • the convolution output of size 2 ⁇ c ( 535 ) may be split 540 into equal-sized portions: the gate vector 545 and the input vector 550 .
  • the gated linear unit provides a linear path for the gradient flow, which alleviates the vanishing gradient issue for stacked convolution blocks while retaining non-linearity.
  • a speaker-dependent embedding 555 may be added as a bias to the convolution filter output, after a softsign function.
  • a softsign nonlinearity is used because it limits the range of the output while also avoiding the saturation problem that exponential-based nonlinearities sometimes exhibit.
  • the convolution filter weights are initialized with zero-mean and unit-variance activations throughout the entire network.
  • the convolutions in the architecture may be either non-causal (e.g., in encoder 205 / 305 and converter 250 / 350 ) or causal (e.g., in decoder 230 / 330 ).
  • inputs are padded with k ⁇ 1 timesteps of zeros on the left for causal convolutions and (k ⁇ 1)/2 timesteps of zeros on the left and on the right for non-causal convolutions, where k is an odd convolution filter width (in embodiments, odd convolution widths were used to simplify the convolution arithmetic, although even convolutions widths and even k values may be used).
  • dropout 560 is applied to the inputs prior to the convolution for regularization.
  • the encoder network begins with an embedding layer, which converts characters or phonemes into trainable vector representations, h e .
  • these embeddings h e are first projected via a fully-connected layer from the embedding dimension to a target dimensionality. Then, in one or more embodiments, they are processed through a series of convolution blocks to extract time-dependent text information. Lastly, in one or more embodiments, they are projected back to the embedding dimension to create the attention key vectors h k .
  • the key vectors h k are used by each attention block to compute attention weights, whereas the final context vector is computed as a weighted average over the value vectors h v .
  • the decoder network (e.g., decoder 230 / 330 ) generates audio in an autoregressive manner by predicting a group of r future audio frames conditioned on the past audio frames. Since the decoder is autoregressive, in embodiments, it uses causal convolution blocks. In one or more embodiments, a mel-band log-magnitude spectrogram was chosen as the compact low-dimensional audio frame representation, although other representations may be used. It was empirically observed that decoding multiple frames together (i.e., having r>1) yields better audio quality.
  • the decoder network starts with a plurality of fully-connected layers with rectified linear unit (ReLU) nonlinearities to preprocess input mel-spectrograms (denoted as “PreNet” in FIG. 2 ). Then, in one or more embodiments, it is followed by a series of decoder blocks, in which a decoder block comprises a causal convolution block and an attention block. These convolution blocks generate the queries used to attend over the encoder's hidden states. Lastly, in one or more embodiments, a fully-connected layer outputs the next group of r audio frames and also a binary “final frame” prediction (indicating whether the last frame of the utterance has been synthesized). In one or more embodiments, dropout is applied before each fully-connected layer prior to the attention blocks, except for the first one.
  • ReLU rectified linear unit
  • L1 loss may be computed using the output mel-spectrograms, and a binary cross-entropy loss may be computed using the final-frame prediction.
  • L1 loss was selected since it yielded the best result empirically.
  • Other losses, such as L2 may suffer from outlier spectral features, which may correspond to non-speech noise.
  • FIG. 6 graphically depicts an embodiment of an attention block, according to embodiments of the present disclosure.
  • positional encodings may be added to both keys 620 and query 638 vectors, with rates of ⁇ key 405 and ⁇ query 410 , respectively.
  • Forced monotonocity may be applied at inference by adding a mask of large negative values to the logits.
  • One of two possible attention schemes may be used: softmax or monotonic attention.
  • attention weights are dropped out.
  • a dot-product attention mechanism (depicted in FIG. 6 ) is used.
  • the attention mechanism uses a query vector 638 (the hidden states of the decoder) and the per-timestep key vectors 620 from the encoder to compute attention weights, and then outputs a context vector 615 computed as the weighted average of the value vectors 621 .
  • a positional encoding was added to both the key and the query vectors.
  • the position rate dictates the average slope of the line in the attention distribution, roughly corresponding to speed of speech.
  • ⁇ s may be set to one for the query and may be fixed for the key to the ratio of output timesteps to input timesteps (computed across the entire dataset).
  • ⁇ s may be computed for both the key and the query from the speaker embedding for each speaker (e.g., depicted in FIG. 6 ).
  • sine and cosine functions form an orthonormal basis, this initialization yields an attention distribution in the form of a diagonal line.
  • the fully connected layer weights used to compute hidden attention vectors are initialized to the same values for the query projection and the key projection.
  • Positional encodings may be used in all attention blocks.
  • a context normalization was used.
  • a fully connected layer is applied to the context vector to generate the output of the attention block.
  • positional encodings improve the convolutional attention mechanism.
  • the softmax may be computed over a fixed window starting at the last attended-to position and going forward several timesteps. In experiments, a window size of three was used, although other window sizes may be used. In one or more embodiments, the initial position is set to zero and is later computed as the index of the highest attention weight within the current window.
  • FIG. 7 depicts a non-autoregressive model architecture (i.e., a ParaNet embodiment), according to embodiments of the present disclosure.
  • the model architecture 700 may use the same or similar encoder architecture 705 as an autoregressive model—embodiments of which were presented in the prior section.
  • the decoder 730 of ParaNet conditioned solely on the hidden representation from the encoder, predicts the entire sequence of log-mel spectrograms in a feed-forward manner. As a result, both its training and synthesis may be done in parallel.
  • the encoder 705 provides key and value 710 as the textual representation
  • the first attention block 715 in the decoder gets positional encoding 720 as the query and followed by a set of decoder blocks 734 , which comprise a non-causal convolution block 725 and an attention block 735 .
  • FIG. 8 graphically depicts a convolution block, such as a convolution block 725 , according to embodiments of the present disclosure.
  • the output of the convolution block comprises a query and an intermediate output, in which the query may be sent to an attention block and the intermediate output may be combined with a context representation from an attention block.
  • FIG. 9 graphically depicts an attention block, such as attention block 735 , according to embodiments of the present disclosure.
  • the convolution block 800 and the attention block 900 are similar to the convolution block 500 in FIG. 5 and the attention block 600 in FIG. 6 , with some exceptions: (1) elements related to the speaking embedding have been removed in both blocks (although embodiments may include them), and (2) the embodiment of the attention block in FIG. 9 depicts a different masking embodiment, i.e., an attention masking which is described in more detail, below.
  • the following major architecture modifications of an autoregressive seq2seq model, such as DV3, may be made to create a non-autoregressive model:
  • Non-Autoregressive Decoder 730 Embodiments:
  • the decoder can use non-causal convolution blocks to take advantage of future context information and to improve model performance. In addition to log-mel spectrograms, it also predicts log-linear spectrograms with an l 1 loss for slightly better performance.
  • the output of the convolution block 725 comprises a query and an intermediate output, which may be split in which the query is sent to an attention block and the intermediate output is combined with a context representation coming from the attention block 735 in order to form a decoder block 730 output.
  • the decoder block output is sent to the next decoder block, or if it is the last decoder block, may be sent to a fully connected layer to obtain the final output representation (e.g., a linear spectrogram output, mel spectrogram output, etc.).
  • the final output representation e.g., a linear spectrogram output, mel spectrogram output, etc.
  • Non-autoregressive model embodiments remove the non-causal converter since they already employ a non-causal decoder. Note that, a motivation for introducing non-causal converter in Deep Voice 3 embodiments was to refine the decoder predictions based on bidirectional context information provided by non-causal convolutions.
  • a non-autoregressive model embodiment may learn the accurate alignment between the input text and output spectrogram.
  • Previous non-autoregressive decoders rely on an external alignment system, or an autoregressive latent variable model.
  • several simple & effective techniques are presented that obtain accurate and stable alignment with a multi-step attention.
  • Embodiments of the non-autoregressive decoder herein can iteratively refine the attention alignment between text and mel spectrogram in a layer-by-layer manner as illustrated in FIG. 10 .
  • a non-autoregressive decoder adopts a dot-product attention mechanism and comprises K attention blocks (see FIG.
  • each attention block uses the per-time-step query vectors from a convolution block and per-time-step key vectors from the encoder to compute the attention weights.
  • the attention block then computes context vectors as the weighted average of the value vectors from the encoder.
  • the decoder starts with an attention block, in which the query vectors are solely positional encoding (see Section C.3.b for additional details).
  • the first attention block then provides the input for the convolution block at the next attention-based layer.
  • FIG. 10 depicts a ParaNet embodiment iteratively refining the attention alignment in a layer-by-layer way, according to embodiments of the present disclosure.
  • 1st layer attention is mostly dominated by the positional encoding prior. It becomes more and more confident about the alignment in the subsequent layers.
  • the attention alignments from a pretrained autoregressive model are used to guide the training of non-autoregressive model.
  • the cross entropy between the attention distributions from the non-autoregressive ParaNet and a pretrained autoregressive model are minimized.
  • the attention loss may be computed as the average cross entropy between the student and teacher's attention distributions:
  • the final loss function is a linear combination of l atten and l 1 losses from spectrogram predictions.
  • the coefficient of l atten as 4, and other coefficients as 1.
  • a positional encoding such as in Deep Voice 3 embodiments, may be used at every attention block.
  • the positional encoding may be added to both key and query vectors in the attention block, which forms an inductive bias for monotonic attention.
  • the non-autoregressive model relies on its attention mechanism to decode mel spectrograms from the encoded textual features, without any autoregressive input. This makes the positional encoding even more important in guiding the attention to follow a monotonic progression over time at the beginning of training.
  • ⁇ s may be set in the following ways:
  • the non-autoregressive ParaNet embodiments at synthesis may use a different attention masking than was used in autoregressive DV3 embodiments.
  • the softmax is computed over a fixed window centered around the target position and going forward and backward several timesteps (e.g., 3 timesteps).
  • the target position may be calculated as
  • the parallel neural TTS system feeds the predicted mel spectrogram from the non-autoregressive ParaNet model embodiment to the IAF-based parallel vocoder similar to ClariNet embodiments referenced above.
  • the method uses an auto-encoding variational Bayes/variational autoencoder (VAE) framework, thus it may be referred to for convenience as WaveVAE.
  • VAE auto-encoding variational Bayes/variational autoencoder
  • WaveVAE embodiments may be trained from scratch by jointly optimizing the encoder q ⁇ (z
  • FIG. 11 depicts a simplified block diagram of a variational autoencoder (VAE) framework, according to embodiments of the present disclosure.
  • x) is parameterized by a Gaussian autoregressive WaveNet embodiment that maps the ground truth audio x into the same length latent representation z.
  • the Gaussian WaveNet embodiment models x t given the previous samples x ⁇ t as x t ⁇ ( ⁇ (x ⁇ t ; ⁇ ), ⁇ (x ⁇ t ; ⁇ )), where the mean ⁇ (x ⁇ t ; ⁇ ) and scale ⁇ (x ⁇ t ; ⁇ ) are predicted by the WaveNet, respectively.
  • the encoder posterior may be constructed as:
  • the mean ⁇ (x ⁇ t ; ⁇ ) and scale ⁇ (x ⁇ t ; ⁇ ) are applied for “whitening” the posterior distribution.
  • x) admits parallel sampling of latents z.
  • IAF inverse autoregressive flow
  • z ⁇ t (0) ) also follows Gaussian with scale and mean as,
  • z) ( ⁇ tot , ⁇ tot )
  • the goal is to maximize the evidence lower bound (ELBO) for observed x in VAE:
  • a stochastic optimization may be performed by drawing a sample z from the encoder q ⁇ (z
  • an annealing strategy for KL divergence was applied, where its weight is gradually increased from 0 to 1, via a sigmoid function.
  • the encoder can encode sufficient information into the latent representations at the early training, and then gradually regularize the latent representation by increasing the weight of the KL divergence.
  • a short-term Fourier transform (STFT) based loss may be added to improve the quality of synthesized speech.
  • the STFT loss may be defined as the summation of l 2 loss on the magnitudes of STFT and l 1 loss on the log-magnitudes of STFT between the output audio and ground truth audio.
  • ms millisecond
  • FFT size was set to 2048.
  • the two STFT losses were considered in the objective: (i) the STFT loss between ground truth audio and reconstructed audio using encoder q ⁇ (z
  • the final loss is a linear combination of the terms in Eq. (5) and the STFT losses. The corresponding coefficients are simply set to be one in experiments herein.
  • FIG. 12 depicts a general method for using a ParaNet embodiment for synthesizing a speech representation from input text, according to embodiments of the present disclosure.
  • a computer-implemented method for synthesizing speech from an input comprises encoding ( 1205 ) the input text into hidden representations comprising a set of key representations and a set of value representations using the encoder, which comprises one or more convolution layers.
  • the hidden representations are used ( 1210 ) by a non-autoregressive decoder to obtain a synthesized representation, which may be a linear spectrogram output, a mel spectrogram output, or a waveform.
  • the non-autoregressive decoder comprises an attention block that uses positional encoding and the set of key representations to generate a context representation for each time step, which context representations are supplied as inputs to a first decoder block in a plurality of decoder blocks.
  • the positional encoding is used by the attention block to affect attention alignment weighting.
  • a decoder block comprises: a non-casual convolution block, which receives as an input the context representation if it is the first decoder block in the plurality of decoder blocks and receives as an input a decoder block output from a prior decoder block if it is the second or subsequent decoder block in the plurality of decoder blocks and outputs a decoder block output comprising a query and an intermediary output; and an attention block, which uses the query output from the non-casual convolution block and positional encoding to compute a context representation that is combined with the intermediary output to create a decoder block output for the decode block.
  • the set of decoder block outputs are used ( 1215 ) to generate a set of audio representation frames representing the input text.
  • the set of audio representation frames may be linear spectrograms, mel spectrograms, or a waveform.
  • obtaining the waveform may involve using a vocoder.
  • the TTS system may comprise a vocoder, such as an the IAF-based parallel vocoder, that converts the set of audio representation frames into a signal representing synthesized speech of the input text.
  • the IAF-based parallel vocoder may be a WaveVAE embodiment that is trained without distillation.
  • the vocoder decoder may be trained without distillation by using the encoder of the vocoder to guide training of the vocoder decoder.
  • a benefit of such a methodology is that the encoder can be jointly trained with the vocoder decoder.
  • a ClariNet embodiment two training method embodiments, a ClariNet embodiment and a WaveVAE embodiment, for IAF-based waveform synthesis.
  • the IAF is conditioned on log-mel spectrograms with two layers of transposed 2-D convolution as in the ClariNet embodiment.
  • the same teacher-student setup for ClariNet and a 20-layer Gaussian autoregressive WaveNet was trained as the teacher model.
  • the crowdMOS toolkit developed by F. Ribeiro, D. Florêncio, C. Zhang, and M. Seltzer in “CrowdMOS: An approach for crowdsourcing mean opinion score studies,” in ICASSP, 2011
  • MOS subjective Mean Opinion Score
  • Table 2 The MOS results are presented in Table 2.
  • the WaveVAE (prior) model performs worse than ClariNet at synthesis, it is trained from scratch and does not require any pre-training.
  • further improvement of WaveVAE may be achieved by introducing a learned prior network, which will minimize the quality gap between the reconstructed speech with encoder and synthesized speech with prior.
  • a non-autoregressive ParaNet embodiment was compared with an autoregressive DV3 embodiment in terms of inference latency.
  • a custom sentence test set was constructed and run inference for 50 runs on each of the sentences in the test set (batch size is set to 1).
  • the average inference latencies over 50 runs and sentence test set are 0.024 and 1.12 seconds on NVIDIA GeForce GTX 1080 Ti produced by Nvidia of Santa Clara, Calif., for the non-autoregressive and autoregressive model embodiments, respectively.
  • the ParaNet embodiment yielded about 46.7 times speed-up compared to its autoregressive counterpart at synthesis.
  • the non-autoregressive ParaNet embodiment has much fewer attention errors than its autoregressive counterpart at synthesis (12 vs. 37).
  • the ParaNet embodiment distills the (teacher-forced) attentions from an autoregressive model, it only takes textual inputs at both training and synthesis and does not have the similar discrepancy as in an autoregressive model.
  • attention masking was applied to enforce the monotonic attentions and reduce attention errors, and it was demonstrated to be effective in Deep Voice 3 embodiments. It was found that the tested non-autoregressive ParaNet embodiment still had fewer attention errors than the tested autoregressive DV3 embodiment (6 vs. 8 in Table 4), when both of them were using the attention masking techniques.
  • the MOS evaluation results of the TTS system embodiments are reported in Table 5.
  • Table 5 The MOS evaluation results of the TTS system embodiments.
  • the WaveNet vocoders were trained on predicted mel spectrograms from DV3 and non-autoregressive model embodiments for better quality, respectively.
  • Both the ClariNet vocoder embodiment and the WaveVAE embodiment were trained on ground-truth mel spectrograms for stable optimization. At synthesis, all of them were conditioned on the predicted mel spectrograms from the text-to-spectrogram model embodiment.
  • the non-autoregressive ParaNet embodiment can provide comparable quality of speech as the autoregressive DV3 with WaveNet vocoder embodiment.
  • the qualities of speech degenerate partly because the mismatch between the ground truth mel spectrogram used for training and predicted mel spectrogram for synthesis. Further improvement may be achieved by successfully training IAF-based neural vocoders on predicted mel spectrogram.
  • MOS Mean Opinion Score
  • Embodiments of a fully parallel neural text-to-speech system comprising a non-autoregressive text-to-spectrogram model and applying IAF-based parallel vocoders.
  • Embodiments of the novel non-autoregressive system (which may be generally referred to for convenience as ParaNet) have fewer attention errors.
  • a test embodiment obtained 46.7 times speed-up over its autoregressive counterpart at synthesis without minor degeneration of speech quality.
  • embodiments of an alternative vocoder (which may be generally referred to as WaveVAE) was developed to train inverse autoregressive flow (IAF) for parallel waveform synthesis. WaveVAE embodiments avoid the need for distillation from a separately trained autoregressive WaveNet and can be trained from scratch.
  • aspects of the present patent document may be directed to, may include, or may be implemented on one or more information handling systems/computing systems.
  • a computing system may include any instrumentality or aggregate of instrumentalities operable to compute, calculate, determine, classify, process, transmit, receive, retrieve, originate, route, switch, store, display, communicate, manifest, detect, record, reproduce, handle, or utilize any form of information, intelligence, or data.
  • a computing system may be or may include a personal computer (e.g., laptop), tablet computer, phablet, personal digital assistant (PDA), smart phone, smart watch, smart package, server (e.g., blade server or rack server), a network storage device, camera, or any other suitable device and may vary in size, shape, performance, functionality, and price.
  • the computing system may include random access memory (RAM), one or more processing resources such as a central processing unit (CPU) or hardware or software control logic, ROM, and/or other types of memory.
  • Additional components of the computing system may include one or more disk drives, one or more network ports for communicating with external devices as well as various input and output (I/O) devices, such as a keyboard, a mouse, touchscreen and/or a video display.
  • the computing system may also include one or more buses operable to transmit communications between the various hardware components.
  • FIG. 13 depicts a simplified block diagram of a computing device/information handling system (or computing system) according to embodiments of the present disclosure. It will be understood that the functionalities shown for system 1300 may operate to support various embodiments of a computing system—although it shall be understood that a computing system may be differently configured and include different components, including having fewer or more components as depicted in FIG. 13 .
  • the computing system 1300 includes one or more central processing units (CPU) 1301 that provides computing resources and controls the computer.
  • CPU 1301 may be implemented with a microprocessor or the like, and may also include one or more graphics processing units (GPU) 1319 and/or a floating-point coprocessor for mathematical computations.
  • System 1300 may also include a system memory 1302 , which may be in the form of random-access memory (RAM), read-only memory (ROM), or both.
  • RAM random-access memory
  • ROM read-only memory
  • An input controller 1303 represents an interface to various input device(s) 1304 , such as a keyboard, mouse, touchscreen, and/or stylus.
  • the computing system 1300 may also include a storage controller 1307 for interfacing with one or more storage devices 1308 each of which includes a storage medium such as magnetic tape or disk, or an optical medium that might be used to record programs of instructions for operating systems, utilities, and applications, which may include embodiments of programs that implement various aspects of the present disclosure.
  • Storage device(s) 1308 may also be used to store processed data or data to be processed in accordance with the disclosure.
  • the system 1300 may also include a display controller 1309 for providing an interface to a display device 1311 , which may be a cathode ray tube (CRT), a thin film transistor (TFT) display, organic light-emitting diode, electroluminescent panel, plasma panel, or other type of display.
  • the computing system 1300 may also include one or more peripheral controllers or interfaces 1305 for one or more peripherals 1306 . Examples of peripherals may include one or more printers, scanners, input devices, output devices, sensors, and the like.
  • a communications controller 1314 may interface with one or more communication devices 1315 , which enables the system 1300 to connect to remote devices through any of a variety of networks including the Internet, a cloud resource (e.g., an Ethernet cloud, a Fiber Channel over Ethernet (FCoE)/Data Center Bridging (DCB) cloud, etc.), a local area network (LAN), a wide area network (WAN), a storage area network (SAN) or through any suitable electromagnetic carrier signals including infrared signals.
  • a cloud resource e.g., an Ethernet cloud, a Fiber Channel over Ethernet (FCoE)/Data Center Bridging (DCB) cloud, etc.
  • FCoE Fiber Channel over Ethernet
  • DCB Data Center Bridging
  • LAN local area network
  • WAN wide area network
  • SAN storage area network
  • electromagnetic carrier signals including infrared signals.
  • bus 1316 which may represent more than one physical bus.
  • various system components may or may not be in physical proximity to one another.
  • input data and/or output data may be remotely transmitted from one physical location to another.
  • programs that implement various aspects of the disclosure may be accessed from a remote location (e.g., a server) over a network.
  • Such data and/or programs may be conveyed through any of a variety of machine-readable medium including, but are not limited to: magnetic media such as hard disks, floppy disks, and magnetic tape; optical media such as CD-ROMs and holographic devices; magneto-optical media; and hardware devices that are specially configured to store or to store and execute program code, such as application specific integrated circuits (ASICs), programmable logic devices (PLDs), flash memory devices, and ROM and RAM devices.
  • ASICs application specific integrated circuits
  • PLDs programmable logic devices
  • flash memory devices ROM and RAM devices.
  • aspects of the present disclosure may be encoded upon one or more non-transitory computer-readable media with instructions for one or more processors or processing units to cause steps to be performed.
  • the one or more non-transitory computer-readable media may include volatile and/or non-volatile memory.
  • alternative implementations are possible, including a hardware implementation or a software/hardware implementation.
  • Hardware-implemented functions may be realized using ASIC(s), programmable arrays, digital signal processing circuitry, or the like. Accordingly, the “means” terms in any claims are intended to cover both software and hardware implementations.
  • computer-readable medium or media includes software and/or hardware having a program of instructions embodied thereon, or a combination thereof.
  • embodiments of the present disclosure may further relate to computer products with a non-transitory, tangible computer-readable medium that have computer code thereon for performing various computer-implemented operations.
  • the media and computer code may be those specially designed and constructed for the purposes of the present disclosure, or they may be of the kind known or available to those having skill in the relevant arts.
  • Examples of tangible computer-readable media include, but are not limited to: magnetic media such as hard disks, floppy disks, and magnetic tape; optical media such as CD-ROMs and holographic devices; magneto-optical media; and hardware devices that are specially configured to store or to store and execute program code, such as application specific integrated circuits (ASICs), programmable logic devices (PLDs), flash memory devices, and ROM and RAM devices.
  • ASICs application specific integrated circuits
  • PLDs programmable logic devices
  • flash memory devices and ROM and RAM devices.
  • Examples of computer code include machine code, such as produced by a compiler, and files containing higher level code that are executed by a computer using an interpreter.
  • Embodiments of the present disclosure may be implemented in whole or in part as machine-executable instructions that may be in program modules that are executed by a processing device.
  • Examples of program modules include libraries, programs, routines, objects, components, and data structures. In distributed computing environments, program modules may be physically located in settings that are local, remote, or both.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Computational Mathematics (AREA)
  • Signal Processing (AREA)
  • Probability & Statistics with Applications (AREA)
  • Operations Research (AREA)
  • Evolutionary Biology (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Machine Translation (AREA)

Abstract

Presented herein are embodiments of a non-autoregressive sequence-to-sequence model that converts text to an audio representation. Embodiment are fully convolutional, and a tested embodiment obtained about 46.7 times speed-up over a prior model at synthesis while maintaining comparable speech quality using a WaveNet vocoder. Interestingly, a tested embodiment also has fewer attention errors than the autoregressive model on challenging test sentences. In one or more embodiments, the first fully parallel neural text-to-speech system was built by applying the inverse autoregressive flow (IAF) as the parallel neural vocoder. System embodiments can synthesize speech from text through a single feed-forward pass. Also disclosed herein are embodiments of a novel approach to train the IAF from scratch as a generative model for raw waveform, which avoids the need for distillation from a separately trained WaveNet.

Description

CROSS-REFERENCE TO RELATED APPLICATION
The present application is a continuation-in-part application of and claims priority benefit of commonly-owned to U.S. patent application Ser. No. 16/277,919, filed on Feb. 15, 2019, entitled “SYSTEMS AND METHODS FOR PARALLEL WAVE GENERATION IN END-TO-END TEXT-TO-SPEECH,” listing Wei Ping, Kainan Peng, and Jitong Chen as inventors, which is a continuation-in-part application of and claims priority benefit of commonly-owned to U.S. patent application Ser. No. 16/058,265, filed on Aug. 8, 2018, entitled “SYSTEMS AND METHODS FOR NEURAL TEXT-TO-SPEECH USING CONVOLUTIONAL SEQUENCE LEARNING,” listing Sercan Arik, Wei Ping, Kainan Peng, Sharan Narang, Ajay Kannan, Andrew Gibiansky, Jonathan Raiman, and John Miller as inventors, which claimed priority under 35 U.S.C. § 119(e) to U.S. Provisional Application No. 62/574,382, filed on Oct. 19, 2017, entitled “SYSTEMS AND METHODS FOR NEURAL TEXT-TO-SPEECH USING CONVOLUTIONAL SEQUENCE LEARNING,” listing Sercan Arik, Wei Ping, Kainan Peng, Sharan Narang, Ajay Kannan, Andrew Gibiansky, Jonathan Raiman, and John Miller as inventors. Each patent document is incorporated in its entirety herein by reference and for all purposes.
BACKGROUND A. Technical Field
The present disclosure relates generally to systems and methods for computer learning that can provide improved computer performance, features, and uses. More particularly, the present disclosure relates to systems and methods for text-to-speech through deep neutral networks.
B. Background
Artificial speech synthesis systems, commonly known as text-to-speech (TTS) systems, convert written language into human speech. TTS systems are used in a variety of applications, such as human-technology interfaces, accessibility for the visually-impaired, media, and entertainment. Fundamentally, it allows human-technology interaction without requiring visual interfaces. Traditional TTS systems are based on complex multi-stage hand-engineered pipelines. Typically, these systems first transform text into a compact audio representation, and then convert this representation into audio using an audio waveform synthesis method called a vocoder.
Due to their complexity, developing TTS systems can be very labor intensive and difficult. Recent work on neural TTS has demonstrated impressive results, yielding pipelines with somewhat simpler features, fewer components, and higher quality synthesized speech. There is not yet a consensus on the optimal neural network architecture for TTS.
Accordingly, what is needed are systems and methods for creating, developing, and/or deploying improved speaker text-to-speech systems.
BRIEF DESCRIPTION OF THE DRAWINGS
References will be made to embodiments of the disclosure, examples of which may be illustrated in the accompanying figures. These figures are intended to be illustrative, not limiting. Although the disclosure is generally described in the context of these embodiments, it should be understood that it is not intended to limit the scope of the disclosure to these particular embodiments. Items in the figures may not be to scale.
FIG. 1A depicts an autoregressive sequence-to-sequence model, according to embodiments of the present disclosure.
FIG. 1B depicts a non-autoregressive model, which distills the attention from a pretrained autoregressive model, according to embodiments of the present disclosure.
FIG. 2 graphical depicts an autoregressive architecture 200, according to embodiments of the present disclosure.
FIG. 3 graphically depicts an alternative autoregressive model architecture, according to embodiments of the present disclosure.
FIG. 4 depicts a general overview methodology for using a text-to-speech architecture, according to embodiments of the present disclosure.
FIG. 5 graphically depicts a convolution block comprising a one-dimensional (1D) convolution with gated linear unit, and residual connection, according to embodiments of the present disclosure.
FIG. 6 graphically depicts an attention block, according to embodiments of the present disclosure.
FIG. 7 depicts a non-autoregressive model architecture (i.e., a ParaNet embodiment), according to embodiments of the present disclosure.
FIG. 8 graphically depicts a convolution block, according to embodiments of the present disclosure.
FIG. 9 graphically depicts an attention block, according to embodiments of the present disclosure.
FIG. 10 depicts a ParaNet embodiment iteratively refining the attention alignment in a layer-by-layer way, according to embodiments of the present disclosure.
FIG. 11 depicts a simplified block diagram of a variational autoencoder (VAE) framework, according to embodiments of the present disclosure.
FIG. 12 depicts a general method for using a ParaNet embodiment for synthesizing a speech representation from input text, according to embodiments of the present disclosure.
FIG. 13 depicts a simplified block diagram of a computing device/information handling system, according to embodiments of the present disclosure.
DETAILED DESCRIPTION OF EMBODIMENTS
In the following description, for purposes of explanation, specific details are set forth in order to provide an understanding of the disclosure. It will be apparent, however, to one skilled in the art that the disclosure can be practiced without these details. Furthermore, one skilled in the art will recognize that embodiments of the present disclosure, described below, may be implemented in a variety of ways, such as a process, an apparatus, a system, a device, or a method on a tangible computer-readable medium.
Components, or modules, shown in diagrams are illustrative of exemplary embodiments of the disclosure and are meant to avoid obscuring the disclosure. It shall also be understood that throughout this discussion that components may be described as separate functional units, which may comprise sub-units, but those skilled in the art will recognize that various components, or portions thereof, may be divided into separate components or may be integrated together, including integrated within a single system or component. It should be noted that functions or operations discussed herein may be implemented as components. Components may be implemented in software, hardware, or a combination thereof.
Furthermore, connections between components or systems within the figures are not intended to be limited to direct connections. Rather, data between these components may be modified, re-formatted, or otherwise changed by intermediary components. Also, additional or fewer connections may be used. It shall also be noted that the terms “coupled,” “connected,” or “communicatively coupled” shall be understood to include direct connections, indirect connections through one or more intermediary devices, and wireless connections.
Reference in the specification to “one embodiment,” “preferred embodiment,” “an embodiment,” or “embodiments” means that a particular feature, structure, characteristic, or function described in connection with the embodiment is included in at least one embodiment of the disclosure and may be in more than one embodiment. Also, the appearances of the above-noted phrases in various places in the specification are not necessarily all referring to the same embodiment or embodiments.
The use of certain terms in various places in the specification is for illustration and should not be construed as limiting. A service, function, or resource is not limited to a single service, function, or resource; usage of these terms may refer to a grouping of related services, functions, or resources, which may be distributed or aggregated.
The terms “include,” “including,” “comprise,” and “comprising” shall be understood to be open terms and any lists the follow are examples and not meant to be limited to the listed items. A “layer” may comprise one or more operations.
Any headings used herein are for organizational purposes only and shall not be used to limit the scope of the description or the claims. Each reference/document mentioned in this patent document is incorporated by reference herein in its entirety.
Furthermore, one skilled in the art shall recognize that: (1) certain steps may optionally be performed; (2) steps may not be limited to the specific order set forth herein; (3) certain steps may be performed in different orders; and (4) certain steps may be done concurrently.
It shall be noted that any experiments and results provided herein are provided by way of illustration and were performed under specific conditions using a specific embodiment or embodiments; accordingly, neither these experiments nor their results shall be used to limit the scope of the disclosure of the current patent document.
A. General Introduction
Text-to-speech (TTS), also called speech synthesis, has long been a vital tool in a variety of applications, such as human-computer interactions, virtual assistant, and content creation. Traditional TTS systems are based on multi-stage hand-engineered pipelines. In recent years, deep neural networks based autoregressive models have attained state-of-the-art results, including high-fidelity audio synthesis, and much simpler sequence-to-sequence (seq2seq) pipelines. In particular, embodiments of one of the most popular neural TTS pipeline comprises two components (embodiments of which are disclosed in U.S. patent application Ser. No. 16/058,265, filed on Aug. 8, 2018, entitled “SYSTEMS AND METHODS FOR NEURAL TEXT-TO-SPEECH USING CONVOLUTIONAL SEQUENCE LEARNING,” listing Sercan Arιk, Wei Ping, Kainan Peng, Sharan Narang, Ajay Kannan, Andrew Gibiansky, Jonathan Raiman, and John Miller as inventors, which patent document is incorporated in its entirety herein by reference) (which disclosure may be referred to, for convenience, as “Deep Voice 3” or “DV3”): (i) an autoregressive seq2seq model that generates mel spectrogram from text, and (ii) an autoregressive neural vocoder (e.g., WaveNet) that generates raw waveform from mel spectrogram. This pipeline requires much less expert knowledge and uses pairs of audio and transcript as training data.
However, the autoregressive nature of these models makes them quite slow at synthesis, because they operate sequentially at a high temporal resolution of waveform samples or acoustic features (e.g., spectrogram). Most recently, parallel WaveNet and embodiments disclosed in U.S. patent application Ser. No. 16/277,919, filed on Feb. 15, 2019, entitled “SYSTEMS AND METHODS FOR PARALLEL WAVE GENERATION IN END-TO-END TEXT-TO-SPEECH,” listing Wei Ping, Kainan Peng, and Jitong Chen as inventors, which patent document is incorporated in its entirety herein by reference) (which disclosure may be referred to, for convenience, as “ClariNet”) are proposed for parallel waveform synthesis, but they still rely on autoregressive or recurrent components to predict the frame-level acoustic features (e.g., 100 frames per second), which can be slow at synthesis on modern hardware optimized for parallel execution.
In this patent document, embodiments of a non-autoregressive text-to-spectrogram model—a fully parallel neural TTS system—are presented. Some of the contributions presented herein include but are not limited to:
1. Embodiments of the first non-autoregressive attention-based architecture for TTS, which is fully convolutional and converts text to mel spectrogram. For convenience, the various embodiments may be referred to generally as “ParaNet.” Embodiments of ParaNet iteratively refine the attention alignment between text and spectrogram in a layer-by-layer manner.
2. The non-autoregressive ParaNet embodiments are compared with autoregressive counterpart in terms of speech quality, synthesis speed and attention stability. A ParaNet embodiment achieves ˜46.7 times speed-up over at one autoregressive model embodiment at synthesis, while maintaining comparable speech quality using WaveNet vocoder. Interestingly, the non-autoregressive ParaNet embodiments produce fewer attention errors on a challenging test sentences as compared to an autoregressive model embodiment, because it does not have the troublesome discrepancy between the teacher-forced training and autoregressive inference.
3. The first fully parallel neural TTS system was built by combining a non-autoregressive ParaNet embodiment with the inverse autoregressive flow (IAF)-based neural vocoder (e.g., ClariNet embodiments). It generates speech from text through a single feed-forward pass.
4. In addition, a novel approach, referred to for convenience as WaveVAE, was developed for training the IAF as a generative model for waveform samples. In contrast to probability density distillation methods, WaveVAE may be trained from scratch by using the IAF as the decoder in the variational autoencoder (VAE) framework.
The remainder of this patent document is as follows. Section B discusses related work. Embodiments of the non-autoregressive ParaNet architecture are described in Section C. WaveVAE embodiments are presented in Section D. Implementation details and experimental results are provided in Section E, and some conclusions are provided in Section F.
B. Related Work
Neural speech synthesis has obtained the state-of-the-art results and gained a lot of attention. Several neural TTS systems were proposed, including: novel architectures disclosed in commonly-assigned U.S. patent application Ser. No. 15/882,926, filed on 29 Jan. 2018, entitled “SYSTEMS AND METHODS FOR REAL-TIME NEURAL TEXT-TO-SPEECH,” and U.S. Prov. Pat. App. No. 62/463,482, filed on 24 Feb. 2017, entitled “SYSTEMS AND METHODS FOR REAL-TIME NEURAL TEXT-TO-SPEECH,” each of the aforementioned patent documents is incorporated by reference herein in its entirety (which disclosures may be referred to, for convenience, as “Deep Voice 1 or DV1”); novel architectures disclosed in commonly-assigned U.S. patent application Ser. No. 15/974,397, filed on 8 May 2018, entitled “SYSTEMS AND METHODS FOR MULTI-SPEAKER NEURAL TEXT-TO-SPEECH,” and U.S. Prov. Pat. App. No. 62/508,579, filed on 19 May 2017, entitled “SYSTEMS AND METHODS FOR MULTI-SPEAKER NEURAL TEXT-TO-SPEECH,” each of the aforementioned patent documents is incorporated by reference herein in its entirety (which disclosures may be referred to, for convenience, as “Deep Voice 2” or “DV2”); novel architectures disclosed in Deep Voice 3 (referenced above); novel architectures disclosed in ClariNet (referenced above); and in Tacotron, Tacotron 2, Char2Wav, and VoiceLoop.
In particular, Tacotron, Char2Wav, and embodiments of Deep Voice 3 employ seq2seq framework with the attention mechanism, yielding much simpler pipeline compared to traditional multi-stage pipeline. Their excellent extensibility leads to promising results for several challenging tasks, such as voice cloning. All of these state-of-the-art TTS systems are based on autoregressive models.
RNN-based autoregressive models, such as Tacotron and WaveRNN, lack parallelism at both training and synthesis. CNN-based autoregressive models, such as WaveNet and embodiments of Deep Voice 3, enable parallel processing at training, but they still operate sequentially at synthesis since each output element must be generated before it can be passed in as input at the next time-step. Recently, there are some non-autoregressive models proposed for neural machine translation. Gu et al. (J. Gu, J. Bradbury, C. Xiong, V. O. Li, and R. Socher. Non-autoregressive neural machine translation. In ICLR, 2018) train a feed-forward neural network conditioned on fertility values, which is obtained from an external alignment system. Kaiser et al. (L. Kaiser, A. Roy, A. Vaswani, N. Pamar, S. Bengio, J. Uszkoreit, and N. Shazeer. Fast decoding in sequence models using discrete latent variables. In ICML, 2018) proposed a latent variable model for fast decoding, while it remains autoregressiveness between latent variables. Lee et al. (J. Lee, E. Mansimov, and K. Cho. Deterministic non-autoregressive neural sequence modeling by iterative refinement. In EMNLP, 2018) iteratively refines the output sequence through a denoising autoencoder framework. Arguably, non-autoregressive model plays a more important role in text-to-speech, where the output speech spectrogram consists of hundreds of time-steps for a short text with a few words. To the best of our knowledge, this work is the first non-autoregressive seq2seq model for TTS and provides as much as 46.7 times speed-up at synthesis over its autoregressive counterpart.
Normalizing flows are a family of generative models, in which a simple initial distribution is transformed into a more complex one by applying a series of invertible transformations. Inverse autoregressive flow (IAF) is a special type of normalizing flow where each invertible transformation is based on an autoregressive neural network. IAF performs synthesis in parallel and can easily reuse the expressive autoregressive architecture, such as WaveNet, which leads to the state-of-the-art results for speech synthesis. Likelihood evaluation in IAF is autoregressive and slow, thus previous training methods rely on probability density distillation from a pretrained autoregressive model. RealNVP and Glow are different types of normalizing flows, where both synthesis and likelihood evaluation can be performed in parallel by enforcing bipartite architecture constraints. Most recently, both methods were applied as parallel neural vocoders. These models are less expressive than autoregressive and IAF models, because half of the variables are unchanged after each transformation. As a result, these bipartite flows usually require deeper layers, larger hidden size, and huge number of parameters. For example, WaveGlow has ˜200 M parameters, whereas WaveNet and ClariNet embodiments only use ˜1.7 M parameters, making them more preferred in production deployment. In this patent document, one focus is on autoregressive and IAF-based neural vocoders.
Variational autoencoder (VAE) has been applied for representation learning of natural speech for years. It models either the generative process of waveform samples or spectrograms. Autoregressive or recurrent neural networks have been employed as the decoder of VAE, but they can be quite slow at synthesis. In embodiments herein, the feed-forward IAF is employed as the decoder, which enables parallel waveform synthesis.
C. Non-Autoregressive seq2seq Model Embodiments
Embodiments of a parallel TTS system comprise two components: 1) a feed-forward text-to-spectrogram model, and 2) a parallel waveform synthesizer conditioned on spectrogram. In this section, an autoregressive text-to-spectrogram model, such as one derived from Deep Voice 3, is first presented. Then, ParaNet embodiments—non-autoregressive text-to-spectrogram models—are presented.
By way of general comparison, consider the high-level diagrams of FIG. 1A (autoregressive) and FIG. 1B (non-autoregressive). FIG. 1A depicts an autoregressive seq2seq model, according to embodiments of the present disclosure. The dashed line 145 depicts the autoregressive decoding of the mel spectrogram at inference. FIG. 1B depicts a non-autoregressive model, which distills the attention from a pretrained autoregressive model, according to embodiments of the present disclosure.
1. Autoregressive Architecture Embodiments
a) Example Model Architecture Embodiments
Embodiments of the autoregressive model may be based on a Deep Voice 3 embodiment or embodiments—a fully-convolutional text-to-spectrogram model, which comprises three components:
Encoder 115:
A convolutional encoder, which takes text inputs and encodes them into an internal hidden representation.
Decoder 125:
A causal convolutional decoder, which decodes the encoder representation with an attention mechanism 120 to log-mel spectrograms 135 in an autoregressive manner, in which the output of the decoder at a timestep is used as an input for the next timestep for the decoder, with an l1 loss. It starts with 1×1 convolutions to preprocess the input log-mel spectrograms.
Converter 130:
A non-causal convolutional post-processing network, which processes the hidden representation from the decoder using both past and future context information and predicts the log-linear spectrograms with an l1 loss. It enables bidirectional processing.
In one or more embodiments, all these components use the same 1-D convolution with a gated linear unit. A major difference between embodiments of ParaNet model and the DV3 embodiment is the decoder architecture. The decoder 125 of the DV3 embodiment 100 has multiple attention-based layers, where each layer comprises a causal convolution block followed by an attention block. To simplify the attention distillation described in Section C.3.a, embodiments of the autoregressive decoder herein have one attention block at its first layer. It was found that that reducing the number of attention blocks did not hurt the generated speech quality in general.
FIG. 2 graphical depicts an example autoregressive architecture 200, according to embodiments of the present disclosure. In one or more embodiments, the architecture 200 uses residual convolutional layers in an encoder 205 to encode text into per-timestep key and value vectors 220 for an attention-based decoder 230. In one or more embodiments, the decoder 230 uses these to predict the mel-scale log magnitude spectrograms 242 that correspond to the output audio. In FIG. 2, the dotted arrow 246 depicts the autoregressive synthesis process during inference (during training, mel-spectrogram frames from the ground truth audio corresponding to the input text are used). In one or more embodiments, the hidden states of the decoder 230 are then fed to a converter network 250 to predict the vocoder parameters for waveform synthesis to produce an output wave 260.
In one or more embodiments, the overall objective function to be optimized may be a linear combination of the losses from the decoder and the converter. In one or more embodiments, the decoder 210 and converter 215 are separated and multi-task training is applied, because it makes attention learning easier in practice. To be specific, in one or more embodiments, the loss for mel-spectrogram prediction guides training of the attention mechanism, because the attention is trained with the gradients from mel-spectrogram prediction (e.g., using an L1 loss for the mel-spectrograms) besides vocoder parameter prediction.
In a multi-speaker embodiment, trainable speaker embeddings 270 as in Deep Voice 2 embodiments may be used across encoder 205, decoder 230, and converter 250.
FIG. 3 graphically depicts an alternative autoregressive model architecture, according to embodiments of the present disclosure. In one or more embodiments, the model 300 uses a deep residual convolutional network to encode text and/or phonemes into per-timestep key 320 and value 322 vectors for an attentional decoder 330. In one or more embodiments, the decoder 330 uses these to predict the mel-band log magnitude spectrograms 342 that correspond to the output audio. The dotted arrows 346 depict the autoregressive synthesis process during inference. In one or more embodiments, the hidden state of the decoder is fed to a converter network 350 to output linear spectrograms for Griffin-Lim 352A or parameters for WORLD 352B, which can be used to synthesize the final waveform. In one or more embodiments, weight normalization is applied to all convolution filters and fully connected layer weight matrices in the model. As illustrated in the embodiment depicted in FIG. 3, WaveNet 352 does not require a separate converter as it takes as input mel-band log magnitude spectrograms.
Example hyperparameters of for a model embodiment provided in Table 1, below.
TABLE 1
Example Hyperparameters
Parameter Single-Speaker
FFT Size 4096
FFT Window Size/Shift 2400/600 
Audio Sample Rate 48000
Reduction Factor r 4
Mel Bands 80
Sharpening Factor 1.4
Character Embedding Dim. 256
Encoder Layers/Conv. Width/Channels 7/5/64 
Decoder Affine Size 128,256
Decoder Layers/Conv. Width 4/5
Attention Hidden Size 128
Position Weight/Initial Rate 1.0/6.3
Converter Layers/Conv. Width/Channels 5/5/256
Dropout Probability 0.95
Number of Speakers 1
Speaker Embedding Dim.
ADAM Learning Rate 0.001
Anneal Rate/Anneal Interval
Batch Size 16
Max Gradient Norm 100
Gradient Clipping Max. Value 5
FIG. 4 depicts a general overview methodology for using a text-to-speech architecture, such as depicted in FIG. 1A, FIG. 2, or FIG. 3, according to embodiments of the present disclosure. In one or more embodiments, an input text in converted (405) into trainable embedding representations using an embedding model, such as text embedding model 210. The embedding representations are converted (410) into attention key representations 220 and attention value representations 220 using an encoder network 205, which comprises a series 214 of one or more convolution blocks 216. These attention key representations 220 and attention value representations 220 are used by an attention-based decoder network, which comprises a series 234 of one or more decoder blocks 234, in which a decoder block 234 comprises a convolution block 236 that generates a query 238 and an attention block 240, to generate (415) low-dimensional audio representations (e.g., 242) of the input text. In one or more embodiments, the low-dimensional audio representations of the input text may undergo additional processing by a post-processing network (e.g., 250A/252A, 250B/252B, or 252C) that predicts (420) final audio synthesis of the input text. As noted above, speaker embeddings 270 may be used in the process 105, 200, or 300 to cause the synthesized audio to exhibit one or more audio characteristics (e.g., a male voice, a female voice, a particular accent, etc.) associated with a speaker identifier or speaker embedding.
b) Text Preprocessing Embodiments
Text preprocessing can be important for good performance. Feeding raw text (characters with spacing and punctuation) yields acceptable performance on many utterances. However, some utterances may have mispronunciations of rare words, or may yield skipped words and repeated words. In one or more embodiments, these issues may be alleviated by normalizing the input text as follows:
1. Uppercase all characters in the input text.
2. Remove all intermediate punctuation marks.
3. End every utterance with a period or question mark.
4. Replace spaces between words with special separator characters which indicate the duration of pauses inserted by the speaker between words. In one or more embodiments, four different word separators may be used, indicating (i) slurred-together words, (ii) standard pronunciation and space characters, (iii) a short pause between words, and (iv) a long pause between words. For example, the sentence “Either way, you should shoot very slowly,” with a long pause after “way” and a short pause after “shoot”, would be written as “Either way % you should shoot/very slowly %.” with % representing a long pause and/representing a short pause for encoding convenience. In one or more embodiments, the pause durations may be obtained through either manual labeling or estimated by a text-audio aligner such as Gentle. In one or more embodiments, the single-speaker dataset was labeled by hand, and the multi-speaker datasets were annotated using Gentle.
c) Joint Representation of Characters and Phonemes Embodiments
Deployed TTS systems may, in one or more embodiments, preferably include a way to modify pronunciations to correct common mistakes (which typically involve, for example, proper nouns, foreign words, and domain-specific jargon). A conventional way to do this is to maintain a dictionary to map words to their phonetic representations.
In one or more embodiments, the model can directly convert characters (including punctuation and spacing) to acoustic features, and hence learns an implicit grapheme-to-phoneme model. This implicit conversion can be difficult to correct when the model makes mistakes. Thus, in addition to character models, in one or more embodiments, phoneme-only models and/or mixed character-and-phoneme models may be trained by allowing phoneme input option explicitly. In one or more embodiments, these models may be identical to character-only models, except that the input layer of the encoder sometimes receives phoneme and phoneme stress embeddings instead of character embeddings.
In one or more embodiments, a phoneme-only model requires a preprocessing step to convert words to their phoneme representations (e.g., by using an external phoneme dictionary or a separately trained grapheme-to-phoneme model). For embodiments, Carnegie Mellon University Pronouncing Dictionary, CMUDict 0.6b, was used. In one or more embodiments, a mixed character-and-phoneme model requires a similar preprocessing step, except for words not in the phoneme dictionary. These out-of-vocabulary/out-of-dictionary words may be input as characters, allowing the model to use its implicitly learned grapheme-to-phoneme model. While training a mixed character-and-phoneme model, every word is replaced with its phoneme representation with some fixed probability at each training iteration. It was found that this improves pronunciation accuracy and minimizes attention errors, especially when generalizing to utterances longer than those seen during training. More importantly, models that support phoneme representation allow correcting mispronunciations using a phoneme dictionary, a desirable feature of deployed systems.
In one or more embodiments, the text embedding model may comprise a phoneme-only model and/or a mixed character-and-phoneme model.
d) Convolution Blocks for Sequential Processing Embodiments
By providing a sufficiently large receptive field, stacked convolutional layers can utilize long-term context information in sequences without introducing any sequential dependency in computation. In one or more embodiments, a convolution block is used as a main sequential processing unit to encode hidden representations of text and audio.
FIG. 5 graphically depicts a convolution block comprising a one-dimensional (1D) convolution with gated linear unit, and residual connection, according to embodiments of the present disclosure. In one or more embodiments, the convolution block 500 comprises a one-dimensional (1D) convolution filter 510, a gated-linear unit 515 as a learnable nonlinearity, a residual connection 520 to the input 505, and a scaling factor 525. In the depicted embodiment, the scaling factor is √{square root over (0.5)}, although different values may be used. The scaling factor helps ensures that the input variance is preserved early in training. In the depicted embodiment in FIG. 5, c (530) denotes the dimensionality of the input 505, and the convolution output of size 2·c (535) may be split 540 into equal-sized portions: the gate vector 545 and the input vector 550. The gated linear unit provides a linear path for the gradient flow, which alleviates the vanishing gradient issue for stacked convolution blocks while retaining non-linearity. In one or more embodiments, to introduce speaker-dependent control, a speaker-dependent embedding 555 may be added as a bias to the convolution filter output, after a softsign function. In one or more embodiments, a softsign nonlinearity is used because it limits the range of the output while also avoiding the saturation problem that exponential-based nonlinearities sometimes exhibit. In one or more embodiments, the convolution filter weights are initialized with zero-mean and unit-variance activations throughout the entire network.
The convolutions in the architecture may be either non-causal (e.g., in encoder 205/305 and converter 250/350) or causal (e.g., in decoder 230/330). In one or more embodiments, to preserve the sequence length, inputs are padded with k−1 timesteps of zeros on the left for causal convolutions and (k−1)/2 timesteps of zeros on the left and on the right for non-causal convolutions, where k is an odd convolution filter width (in embodiments, odd convolution widths were used to simplify the convolution arithmetic, although even convolutions widths and even k values may be used). In one or more embodiments, dropout 560 is applied to the inputs prior to the convolution for regularization.
e) Encoder Embodiments
In one or more embodiments, the encoder network (e.g., encoder 205/305) begins with an embedding layer, which converts characters or phonemes into trainable vector representations, he. In one or more embodiments, these embeddings he are first projected via a fully-connected layer from the embedding dimension to a target dimensionality. Then, in one or more embodiments, they are processed through a series of convolution blocks to extract time-dependent text information. Lastly, in one or more embodiments, they are projected back to the embedding dimension to create the attention key vectors hk. The attention value vectors may be computed from attention key vectors and text embeddings, hv=√{square root over (0.5)}(hk+he), to jointly consider the local information in he and the long-term context information in hk. The key vectors hk are used by each attention block to compute attention weights, whereas the final context vector is computed as a weighted average over the value vectors hv.
f) Decoder Embodiments
In one or more embodiments, the decoder network (e.g., decoder 230/330) generates audio in an autoregressive manner by predicting a group of r future audio frames conditioned on the past audio frames. Since the decoder is autoregressive, in embodiments, it uses causal convolution blocks. In one or more embodiments, a mel-band log-magnitude spectrogram was chosen as the compact low-dimensional audio frame representation, although other representations may be used. It was empirically observed that decoding multiple frames together (i.e., having r>1) yields better audio quality.
In one or more embodiments, the decoder network starts with a plurality of fully-connected layers with rectified linear unit (ReLU) nonlinearities to preprocess input mel-spectrograms (denoted as “PreNet” in FIG. 2). Then, in one or more embodiments, it is followed by a series of decoder blocks, in which a decoder block comprises a causal convolution block and an attention block. These convolution blocks generate the queries used to attend over the encoder's hidden states. Lastly, in one or more embodiments, a fully-connected layer outputs the next group of r audio frames and also a binary “final frame” prediction (indicating whether the last frame of the utterance has been synthesized). In one or more embodiments, dropout is applied before each fully-connected layer prior to the attention blocks, except for the first one.
An L1 loss may be computed using the output mel-spectrograms, and a binary cross-entropy loss may be computed using the final-frame prediction. L1 loss was selected since it yielded the best result empirically. Other losses, such as L2, may suffer from outlier spectral features, which may correspond to non-speech noise.
g) Attention Block Embodiments
FIG. 6 graphically depicts an embodiment of an attention block, according to embodiments of the present disclosure. As shown in FIG. 6, in one or more embodiments, positional encodings may be added to both keys 620 and query 638 vectors, with rates of ω key 405 and ω query 410, respectively. Forced monotonocity may be applied at inference by adding a mask of large negative values to the logits. One of two possible attention schemes may be used: softmax or monotonic attention. In one or more embodiments, during training, attention weights are dropped out.
In one or more embodiments, a dot-product attention mechanism (depicted in FIG. 6) is used. In one or more embodiments, the attention mechanism uses a query vector 638 (the hidden states of the decoder) and the per-timestep key vectors 620 from the encoder to compute attention weights, and then outputs a context vector 615 computed as the weighted average of the value vectors 621.
In one or more embodiments, empirical benefits were observed from introducing an inductive bias where the attention follows a monotonic progression in time. Thus, in one or more embodiments, a positional encoding was added to both the key and the query vectors. These positional encodings hp may be chosen as hp(i)=sin(ωsi/10000k/d) (for even i) or cos(ωsi/10000k/d) (for odd i), where i is the timestep index, k is the channel index in the positional encoding, d is the total number of channels in the positional encoding, and ωs is the position rate of the encoding. In one or more embodiments, the position rate dictates the average slope of the line in the attention distribution, roughly corresponding to speed of speech. For a single speaker, ωs may be set to one for the query and may be fixed for the key to the ratio of output timesteps to input timesteps (computed across the entire dataset). For multi-speaker datasets, ωs may be computed for both the key and the query from the speaker embedding for each speaker (e.g., depicted in FIG. 6). As sine and cosine functions form an orthonormal basis, this initialization yields an attention distribution in the form of a diagonal line. In one or more embodiments, the fully connected layer weights used to compute hidden attention vectors are initialized to the same values for the query projection and the key projection. Positional encodings may be used in all attention blocks. In one or more embodiments, a context normalization was used. In one or more embodiments, a fully connected layer is applied to the context vector to generate the output of the attention block. Overall, positional encodings improve the convolutional attention mechanism.
Production-quality TTS systems have very low tolerance for attention errors. Hence, besides positional encodings, additional strategies were considered to eliminate the cases of repeating or skipping words. One approach which may be used is to substitute the canonical attention mechanism with the monotonic attention mechanism, which approximates hard-monotonic stochastic decoding with soft-monotonic attention by training in expectation. Hard monotonic attention may also be accomplished by sampling. It aims was to improve the inference speed by attending over states that are selected via sampling, and thus avoiding compute over future states. Embodiments herein do not benefit from such speedup, and poor attention behavior in some cases, e.g., being stuck on the first or last character, were observed. Despite the improved monotonicity, this strategy may yield a more diffused attention distribution. In some cases, several characters are attended at the same time and high-quality speech could not be obtained. This may be attributed to the unnormalized attention coefficients of the soft alignment, potentially resulting in weak signal from the encoder. Thus, in one or more embodiments, an alternative strategy of constraining attention weights only at inference to be monotonic, preserving the training procedure without any constraints, was used. Instead of computing the softmax over the entire input, the softmax may be computed over a fixed window starting at the last attended-to position and going forward several timesteps. In experiments, a window size of three was used, although other window sizes may be used. In one or more embodiments, the initial position is set to zero and is later computed as the index of the highest attention weight within the current window.
2. Non-Autoregressive Architecture Embodiments
FIG. 7 depicts a non-autoregressive model architecture (i.e., a ParaNet embodiment), according to embodiments of the present disclosure. In one or more embodiments, the model architecture 700 may use the same or similar encoder architecture 705 as an autoregressive model—embodiments of which were presented in the prior section. In one or more embodiments, the decoder 730 of ParaNet, conditioned solely on the hidden representation from the encoder, predicts the entire sequence of log-mel spectrograms in a feed-forward manner. As a result, both its training and synthesis may be done in parallel. In one or more embodiments, the encoder 705 provides key and value 710 as the textual representation, and the first attention block 715 in the decoder gets positional encoding 720 as the query and followed by a set of decoder blocks 734, which comprise a non-causal convolution block 725 and an attention block 735. FIG. 8 graphically depicts a convolution block, such as a convolution block 725, according to embodiments of the present disclosure. In embodiments, the output of the convolution block comprises a query and an intermediate output, in which the query may be sent to an attention block and the intermediate output may be combined with a context representation from an attention block. FIG. 9 graphically depicts an attention block, such as attention block 735, according to embodiments of the present disclosure. It shall be noted that the convolution block 800 and the attention block 900 are similar to the convolution block 500 in FIG. 5 and the attention block 600 in FIG. 6, with some exceptions: (1) elements related to the speaking embedding have been removed in both blocks (although embodiments may include them), and (2) the embodiment of the attention block in FIG. 9 depicts a different masking embodiment, i.e., an attention masking which is described in more detail, below.
In one or more embodiments, the following major architecture modifications of an autoregressive seq2seq model, such as DV3, may be made to create a non-autoregressive model:
Non-Autoregressive Decoder 730 Embodiments:
Without the autoregressive generative constraint, the decoder can use non-causal convolution blocks to take advantage of future context information and to improve model performance. In addition to log-mel spectrograms, it also predicts log-linear spectrograms with an l1 loss for slightly better performance. In embodiments, the output of the convolution block 725 comprises a query and an intermediate output, which may be split in which the query is sent to an attention block and the intermediate output is combined with a context representation coming from the attention block 735 in order to form a decoder block 730 output. The decoder block output is sent to the next decoder block, or if it is the last decoder block, may be sent to a fully connected layer to obtain the final output representation (e.g., a linear spectrogram output, mel spectrogram output, etc.).
No Converter:
Non-autoregressive model embodiments remove the non-causal converter since they already employ a non-causal decoder. Note that, a motivation for introducing non-causal converter in Deep Voice 3 embodiments was to refine the decoder predictions based on bidirectional context information provided by non-causal convolutions.
3. Attention Mechanism Embodiments
It may be challenging for a non-autoregressive model embodiment to learn the accurate alignment between the input text and output spectrogram. Previous non-autoregressive decoders rely on an external alignment system, or an autoregressive latent variable model. In one or more embodiments, several simple & effective techniques are presented that obtain accurate and stable alignment with a multi-step attention. Embodiments of the non-autoregressive decoder herein can iteratively refine the attention alignment between text and mel spectrogram in a layer-by-layer manner as illustrated in FIG. 10. In one or more embodiments, a non-autoregressive decoder adopts a dot-product attention mechanism and comprises K attention blocks (see FIG. 7), where each attention block uses the per-time-step query vectors from a convolution block and per-time-step key vectors from the encoder to compute the attention weights. The attention block then computes context vectors as the weighted average of the value vectors from the encoder. In one or more embodiments, the decoder starts with an attention block, in which the query vectors are solely positional encoding (see Section C.3.b for additional details). The first attention block then provides the input for the convolution block at the next attention-based layer.
FIG. 10 depicts a ParaNet embodiment iteratively refining the attention alignment in a layer-by-layer way, according to embodiments of the present disclosure. One can see the 1st layer attention is mostly dominated by the positional encoding prior. It becomes more and more confident about the alignment in the subsequent layers.
a) Attention Distillation Embodiments
In one or more embodiments, the attention alignments from a pretrained autoregressive model are used to guide the training of non-autoregressive model. In one or more embodiments, the cross entropy between the attention distributions from the non-autoregressive ParaNet and a pretrained autoregressive model are minimized. The attention weights from the non-autoregressive ParaNet may be denoted as Wi,j (k), where i and j index the time-step of encoder and decoder respectively, and k refers to the k-th attention block within the decoder. Note that, the attention weights {Wi,j (k)}i=1 M form a valid distribution. The attention loss may be computed as the average cross entropy between the student and teacher's attention distributions:
l atten = - 1 KN k = 1 K j = 1 N i = 1 M W i , j T log W i , j ( k ) , ( 1 )
where Wi,j T are the attention weights from the autoregressive teacher, M and N are the lengths of encoder and decoder, respectively. In one or more embodiments, the final loss function is a linear combination of latten and l1 losses from spectrogram predictions. In one or more embodiments, the coefficient of latten as 4, and other coefficients as 1.
b) Positional Encoding Embodiments
In one or more embodiments, a positional encoding, such as in Deep Voice 3 embodiments, may be used at every attention block. The positional encoding may be added to both key and query vectors in the attention block, which forms an inductive bias for monotonic attention. Note that, the non-autoregressive model relies on its attention mechanism to decode mel spectrograms from the encoded textual features, without any autoregressive input. This makes the positional encoding even more important in guiding the attention to follow a monotonic progression over time at the beginning of training. The positional encodings hp(i)=sin(ωsi/10000k/d) (for even i) or cos(ωsi/10000k/d) (for odd i), where i is the timestep index, k is the channel index, d is the total number of channels in the positional encoding, and ωs is the position rate which indicates the average slope of the line in the attention distribution and roughly corresponds to the speed of speech. In one or more embodiments, ωs may be set in the following ways:
    • For the autoregressive model, ωs is set to one for the positional encoding of query. For the key, it is set to the averaged ratio of the time-steps of spectrograms to the time-steps of textual features, which is around 6.3 across the training dataset used herein. Taking into account that a reduction factor of 4 is used to simplify the learning of the attention mechanism, ωs is simply set as 6.3/4 for the key at both training and synthesis.
    • For non-autoregressive ParaNet model embodiments, ωs may also be set to one for the query, while ωs for the key is calculated differently. At training, ωs is set to the ratio of the lengths of spectrograms and text for each individual training instance, which is also divided by a reduction factor of 4. At synthesis, the length of output spectrogram and the corresponding ωs should be specified, which controls the speech rate of the generated audios. For comparison, ωs was set to be 6.3/4 as in autoregressive model, and the length of output spectrogram was set as 6.3/4 times the length of input text. Such a setup yields an initial attention in the form of a diagonal line and guides the non-autoregressive decoder to refine its attention layer by layer (see FIG. 10).
c) Attention Masking Embodiments
The non-autoregressive ParaNet embodiments at synthesis may use a different attention masking than was used in autoregressive DV3 embodiments. In one or more embodiments, for each query from the decoder, instead of computing the softmax over the entire set of encoder key vectors, in one or more embodiments, the softmax is computed over a fixed window centered around the target position and going forward and backward several timesteps (e.g., 3 timesteps). The target position may be calculated as
i query × 4 6.3 ,
where iquery is the timestep index of the query vector, and └ ┐ is the rounding operator. It was observed that this strategy reduces serious attention errors, such as repeating or skipping words, and also yields clearer pronunciations, thanks to its more condensed attention distribution. This attention masking may be shared across all attention blocks once it is generated, and it does not prevent the parallel synthesis of the non-autoregressive model.
D. WaveVAE Embodiments
In one or more embodiments, the parallel neural TTS system feeds the predicted mel spectrogram from the non-autoregressive ParaNet model embodiment to the IAF-based parallel vocoder similar to ClariNet embodiments referenced above. In this section, an alternative embodiment for training the IAF as a generative model for raw waveform x is presented. In one or more embodiments, the method uses an auto-encoding variational Bayes/variational autoencoder (VAE) framework, thus it may be referred to for convenience as WaveVAE. In contrast to probability density distillation methods, WaveVAE embodiments may be trained from scratch by jointly optimizing the encoder qϕ(z|x, c) and decoder pθ(x|z, c), where z is latent variables and c is the mel spectrogram conditioner. c is omitted for concise notation afterwards. FIG. 11 depicts a simplified block diagram of a variational autoencoder (VAE) framework, according to embodiments of the present disclosure.
1. Encoder Embodiments
In one or more embodiments, the encoder of WaveVAE qϕ(z|x) is parameterized by a Gaussian autoregressive WaveNet embodiment that maps the ground truth audio x into the same length latent representation z. Specifically, the Gaussian WaveNet embodiment models xt given the previous samples x<t as xt˜
Figure US11017761-20210525-P00001
(μ(x<t; ϕ), σ(x<t; ϕ)), where the mean μ(x<t; ϕ) and scale σ(x<t; ϕ) are predicted by the WaveNet, respectively. The encoder posterior may be constructed as:
q ϕ ( z | x ) = t q ϕ ( z t | x t ) , where q ϕ ( z t | x t ) = 𝒩 ( x t - μ ( x < t ; ϕ ) σ ( x t ; ϕ ) , ɛ ) . ( 2 )
Note that, the mean μ(x<t; ϕ) and scale σ(x<t; ϕ) are applied for “whitening” the posterior distribution. In one or more embodiments, a trainable scalar ε>0 to capture the global variation, which will ease the optimization process. Given the observed x, the qϕ(z|x) admits parallel sampling of latents z. One may build the connection between the encoder of WaveVAE and teacher model of a ClariNet embodiment, as both of them use a Gaussian WaveNet to guide the training of the inverse autoregressive flow (IAF) for parallel wave generation.
2. Decoder Embodiments
In one or more embodiment, the decoder pθ(x|z) is an IAF. Let z(0)=z and apply a stack of IAF transformations from Z(0)→ . . . z(i)→ . . . z(n) and each transformation z(i)=f(z(i-1); θ) is defined as:
z (i) =z (i-1)·σ(i)(i),  (3)
where μt (i)=μ(z<t (i-1); θ) and σt (i)=σ(z<t (i-1); θ) are shifting and scaling variables modeled by a Gaussian WaveNet. As a result, given z(0)
Figure US11017761-20210525-P00001
(0)(0)) from the Gaussian prior or encoder, the per-step p(zt (n)|z<t (0)) also follows Gaussian with scale and mean as,
σ tot = i = 0 b σ ( i ) , μ tot = i = 0 n μ ( i ) j > i n σ ( j ) ( 4 )
Lastly, x may be set as x=∈·σtottot, where ∈˜
Figure US11017761-20210525-P00001
(0, I). Thus, pθ(x|z)=
Figure US11017761-20210525-P00001
tottot) For the generative process, in one or more embodiments, the standard Gaussian prior p(z)=
Figure US11017761-20210525-P00001
(0, I) was used.
3. VAE Objective Embodiments
In one or more embodiments, the goal is to maximize the evidence lower bound (ELBO) for observed x in VAE:
max ϕ , θ 𝔼 q ϕ ( z | x ) [ log p θ ( x | z ) ] - KL ( q ϕ ( z | x ) p ( z ) ) , ( 5 )
where the KL divergence can be calculated in closed-form as both qϕ(z|x) and p(z) are Gaussians:
KL ( q ϕ ( z | x ) p ( z ) ) = t log 1 ɛ + 1 2 ( ɛ 2 - 1 + ( x t - μ ( x < t ) σ ( x < t ) ) 2 ) . ( 6 )
The reconstruction term in Eq. (5) is intractable to compute exactly. In one or more embodiments, a stochastic optimization may be performed by drawing a sample z from the encoder qϕ(z|x) through reparameterization, and evaluating the likelihood log pθ(x|z). To avoid the “posterior collapse,” in which the posterior distribution qϕ(z|x) quickly collapses to the white noise prior p (z) at the early stage of training, in one or more embodiments, an annealing strategy for KL divergence was applied, where its weight is gradually increased from 0 to 1, via a sigmoid function. Through it, the encoder can encode sufficient information into the latent representations at the early training, and then gradually regularize the latent representation by increasing the weight of the KL divergence.
4. Short-Term Fourier Transform (STFT) Embodiments
Similar to ClariNet embodiments, a short-term Fourier transform (STFT) based loss may be added to improve the quality of synthesized speech. In one or more embodiments, the STFT loss may be defined as the summation of l2 loss on the magnitudes of STFT and l1 loss on the log-magnitudes of STFT between the output audio and ground truth audio. In one or more embodiments, for STFT, a 12.5 millisecond (ms) frame-shift, 50 ms Hanning window length, and the FFT size was set to 2048. The two STFT losses were considered in the objective: (i) the STFT loss between ground truth audio and reconstructed audio using encoder qϕ(z|x); and (ii) the STFT loss between ground truth audio and synthesized audio using the prior p(z), with the purpose of reducing the gap between reconstruction and synthesis. In one or more embodiments, the final loss is a linear combination of the terms in Eq. (5) and the STFT losses. The corresponding coefficients are simply set to be one in experiments herein.
E. Example Implementation Methodology
FIG. 12 depicts a general method for using a ParaNet embodiment for synthesizing a speech representation from input text, according to embodiments of the present disclosure. As illustrated in FIG. 12, a computer-implemented method for synthesizing speech from an input comprises encoding (1205) the input text into hidden representations comprising a set of key representations and a set of value representations using the encoder, which comprises one or more convolution layers. In one or more embodiments, the hidden representations are used (1210) by a non-autoregressive decoder to obtain a synthesized representation, which may be a linear spectrogram output, a mel spectrogram output, or a waveform. In one or more embodiments, the non-autoregressive decoder comprises an attention block that uses positional encoding and the set of key representations to generate a context representation for each time step, which context representations are supplied as inputs to a first decoder block in a plurality of decoder blocks. In one or more embodiments, the positional encoding is used by the attention block to affect attention alignment weighting.
In one or more embodiments, a decoder block comprises: a non-casual convolution block, which receives as an input the context representation if it is the first decoder block in the plurality of decoder blocks and receives as an input a decoder block output from a prior decoder block if it is the second or subsequent decoder block in the plurality of decoder blocks and outputs a decoder block output comprising a query and an intermediary output; and an attention block, which uses the query output from the non-casual convolution block and positional encoding to compute a context representation that is combined with the intermediary output to create a decoder block output for the decode block.
In one or more embodiments, the set of decoder block outputs are used (1215) to generate a set of audio representation frames representing the input text. The set of audio representation frames may be linear spectrograms, mel spectrograms, or a waveform. In embodiments in which the output is a waveform, obtaining the waveform may involve using a vocoder. In one or more embodiments, the TTS system may comprise a vocoder, such as an the IAF-based parallel vocoder, that converts the set of audio representation frames into a signal representing synthesized speech of the input text. As noted above, the IAF-based parallel vocoder may be a WaveVAE embodiment that is trained without distillation. For example, in one or more embodiments, the vocoder decoder may be trained without distillation by using the encoder of the vocoder to guide training of the vocoder decoder. A benefit of such a methodology is that the encoder can be jointly trained with the vocoder decoder.
F. Experiments
It shall be noted that these experiments and results are provided by way of illustration and were performed under specific conditions using a specific embodiment or embodiments; accordingly, neither these experiments nor their results shall be used to limit the scope of the disclosure of the current patent document.
In this section, several experiments that evaluate embodiments are presented. In experiments, an internal English speech dataset containing about 20 hours of speech data from a female speaker with a sampling rate of 48 kHz was used. The audios were downsampled to 24 kHz.
1. IAF-Based Waveform Synthesis
First, two training method embodiments, a ClariNet embodiment and a WaveVAE embodiment, for IAF-based waveform synthesis. The same IAF architecture as described in the ClariNet patent application, which was incorporated by referenced above, was used. It comprises four stacked Gaussian IAF blocks, which are parameterized by [10, 10, 10, 30]-layer WaveNets respectively, with the 64 residual & skip channels and filter size 3 in dilated convolutions. The IAF is conditioned on log-mel spectrograms with two layers of transposed 2-D convolution as in the ClariNet embodiment. The same teacher-student setup for ClariNet and a 20-layer Gaussian autoregressive WaveNet was trained as the teacher model. For the encoder in WaveVAE, a 20-layers Gaussian WaveNet conditioned on log-mel spectrograms was used. Note that, in the tested embodiments, both the encoder and decoder of WaveVAE shared the same conditioner network. Adam optimizer with 1000K steps was used for both methods. The learning rate was set to 0.001 in the beginning and annealed by half for every 200K steps.
The crowdMOS toolkit (developed by F. Ribeiro, D. Florêncio, C. Zhang, and M. Seltzer in “CrowdMOS: An approach for crowdsourcing mean opinion score studies,” in ICASSP, 2011) was used for subjective Mean Opinion Score (MOS) evaluation, where batches of samples from these models were presented to workers on Mechanical Turk. The MOS results are presented in Table 2. Although the WaveVAE (prior) model performs worse than ClariNet at synthesis, it is trained from scratch and does not require any pre-training. In one or more embodiments, further improvement of WaveVAE may be achieved by introducing a learned prior network, which will minimize the quality gap between the reconstructed speech with encoder and synthesized speech with prior.
TABLE 2
Mean Opinion Score (MOS) ratings with 95% confidence intervals
for waveform synthesis. We use the same Gaussian IAF architecture
for ClariNet andWaveVAE. Note that, WaveVAE (recons.) refers to
reconstructed speech by using latents from the encoder.
Neural Vocoder Subjective 5-scale MOS
WaveNet 4.40 ± 0.21
ClariNet 4.21 ± 0.18
WaveVAE (recons.) 4.37 ± 0.23
WaveVAE (prior) 4.02 ± 0.24
Ground-truth (24 kHz) 4.51 ± 0.16
2. Text-to-Speech
An embodiment of the text-to-spectrogram ParaNet model and the parallel neural TTS system with IAF-based vocoders, including ClariNet and WaveVAE, were evaluated. The mixed representation of characters and phonemes introduced in the DV3 patent application was used. All hyperparameters of autoregressive and non-autoregressive ParaNet embodiments are shown in Table 3, below. It was found that larger kernel width and deeper layers generally helped to improve the speech quality. The tested non-autoregressive model was ˜2.57 times larger than the autoregressive model in terms of the number of parameters, but it obtained significant speedup at synthesis.
TABLE 3
Hyperparameters of the autoregressive seq2seq
model and non-autoregressive seq2seq model
embodiments tested in the experiments.
Non-
Autoregressive autoregressive
Hyperparameter Model Model
FFT Size 2048 2048
FFT Window Size/Shift 1200/300  1200/300
Audio Sample Rate 24000 24000
Reduction Factor r 4 4
Mel Bands 80 80
Character Embedding Dim. 256 256
Encoder Layers/Conv. Width/Channels 7/5/64  7/9/64
Decoder PreNet Affine Size 128,256 N/A
Decoder Layers/Conv. Width 4/5 17/7
Attention Hidden Size 128 128
Position Weight/Initial Rate 1.0/6.3  1.0/6.3
PostNet Layers/Conv. Width/Channels 5/5/256 N/A
Dropout Keep Probability 0.95 1.0
ADAM Learning Rate 0.001 0.001
Batch Size 16 16
Max Gradient Norm 100 100
Gradient Clipping Max. Value 5.0 5.0
Total Number of Parameters 6.85M 17.61M
a) Speedup at Synthesis
A non-autoregressive ParaNet embodiment was compared with an autoregressive DV3 embodiment in terms of inference latency. A custom sentence test set was constructed and run inference for 50 runs on each of the sentences in the test set (batch size is set to 1). The average inference latencies over 50 runs and sentence test set are 0.024 and 1.12 seconds on NVIDIA GeForce GTX 1080 Ti produced by Nvidia of Santa Clara, Calif., for the non-autoregressive and autoregressive model embodiments, respectively. Hence, the ParaNet embodiment yielded about 46.7 times speed-up compared to its autoregressive counterpart at synthesis.
b) Attention Error Analysis
In autoregressive models, there tends to be a noticeable discrepancy between the teacher-forced training and autoregressive inference, which can yield accumulated errors along the generated sequence at synthesis. In neural TTS systems, this discrepancy leads to miserable attention errors at autoregressive inference, including (i) repeated words, (ii) mispronunciations, and (iii) skipped words, which can be a critical problem for online deployment of attention-based neural TTS systems. An attention error analysis was performed for the non-autoregressive ParaNet model embodiment on a 100-sentence test set, which includes particularly challenging cases from deployed TTS systems (e.g., dates, acronyms, URLs, repeated words, proper nouns, foreign words, etc.).
As illustrated in Table 4, it was found that the non-autoregressive ParaNet embodiment has much fewer attention errors than its autoregressive counterpart at synthesis (12 vs. 37). Although the ParaNet embodiment distills the (teacher-forced) attentions from an autoregressive model, it only takes textual inputs at both training and synthesis and does not have the similar discrepancy as in an autoregressive model. Previously, attention masking was applied to enforce the monotonic attentions and reduce attention errors, and it was demonstrated to be effective in Deep Voice 3 embodiments. It was found that the tested non-autoregressive ParaNet embodiment still had fewer attention errors than the tested autoregressive DV3 embodiment (6 vs. 8 in Table 4), when both of them were using the attention masking techniques.
TABLE 4
Attention error counts for text-to-spectrogram models on the
100-sentence test set. One or more mispronunciations, skips,
and repeats count as a single mistake per utterance. All models
use Griffin-Lim as vocoder for convenience. The non-autoregressive
ParaNet with attention mask embodiment obtained the fewest
attention errors in total at synthesis.
Model Attention mask
Embodiment at inference Repeat Mispronounce Skip Total
Deep Voice 3 No 12 10 15 37
Deep Voice 3 Yes 1 4 3 8
ParaNet No 1 4 7 12
ParaNet Yes 2 4 0 6
c) MOS Evaluation
The MOS evaluation results of the TTS system embodiments are reported in Table 5. Experiments were conducted by pairing autoregressive and non-autoregressive text-to-spectrogram models with different neural vocoders. The WaveNet vocoders were trained on predicted mel spectrograms from DV3 and non-autoregressive model embodiments for better quality, respectively. Both the ClariNet vocoder embodiment and the WaveVAE embodiment were trained on ground-truth mel spectrograms for stable optimization. At synthesis, all of them were conditioned on the predicted mel spectrograms from the text-to-spectrogram model embodiment. Note that the non-autoregressive ParaNet embodiment can provide comparable quality of speech as the autoregressive DV3 with WaveNet vocoder embodiment. When the parallel neural vocoder was applied, the qualities of speech degenerate, partly because the mismatch between the ground truth mel spectrogram used for training and predicted mel spectrogram for synthesis. Further improvement may be achieved by successfully training IAF-based neural vocoders on predicted mel spectrogram.
TABLE 5
Mean Opinion Score (MOS) ratings with 95% confidence intervals
for comparison. The crowdMOS toolkit, as in Table 2, was used.
Neural TTS System Embodiments MOS score
Deep Voice 3 + WaveNet (predicted Mel) 4.09 ± 0.26
Deep Voice 3 + ClariNet (true Mel) 3.93 ± 0.27
Deep Voice 3 + WaveVAE (true Mel) 3.70 ± 0.29
ParaNet + WaveNet (predicted Mel) 4.01 ± 0.24
ParaNet + ClariNet (true Mel) 3.52 ± 0.28
ParaNet + WaveVAE (true Mel) 3.25 ± 0.34
G. Some Conclusions
Presented herein were embodiments of a fully parallel neural text-to-speech system comprising a non-autoregressive text-to-spectrogram model and applying IAF-based parallel vocoders. Embodiments of the novel non-autoregressive system (which may be generally referred to for convenience as ParaNet) have fewer attention errors. A test embodiment obtained 46.7 times speed-up over its autoregressive counterpart at synthesis without minor degeneration of speech quality. In addition, embodiments of an alternative vocoder (which may be generally referred to as WaveVAE) was developed to train inverse autoregressive flow (IAF) for parallel waveform synthesis. WaveVAE embodiments avoid the need for distillation from a separately trained autoregressive WaveNet and can be trained from scratch.
H. Computing System Embodiments
In one or more embodiments, aspects of the present patent document may be directed to, may include, or may be implemented on one or more information handling systems/computing systems. A computing system may include any instrumentality or aggregate of instrumentalities operable to compute, calculate, determine, classify, process, transmit, receive, retrieve, originate, route, switch, store, display, communicate, manifest, detect, record, reproduce, handle, or utilize any form of information, intelligence, or data. For example, a computing system may be or may include a personal computer (e.g., laptop), tablet computer, phablet, personal digital assistant (PDA), smart phone, smart watch, smart package, server (e.g., blade server or rack server), a network storage device, camera, or any other suitable device and may vary in size, shape, performance, functionality, and price. The computing system may include random access memory (RAM), one or more processing resources such as a central processing unit (CPU) or hardware or software control logic, ROM, and/or other types of memory. Additional components of the computing system may include one or more disk drives, one or more network ports for communicating with external devices as well as various input and output (I/O) devices, such as a keyboard, a mouse, touchscreen and/or a video display. The computing system may also include one or more buses operable to transmit communications between the various hardware components.
FIG. 13 depicts a simplified block diagram of a computing device/information handling system (or computing system) according to embodiments of the present disclosure. It will be understood that the functionalities shown for system 1300 may operate to support various embodiments of a computing system—although it shall be understood that a computing system may be differently configured and include different components, including having fewer or more components as depicted in FIG. 13.
As illustrated in FIG. 13, the computing system 1300 includes one or more central processing units (CPU) 1301 that provides computing resources and controls the computer. CPU 1301 may be implemented with a microprocessor or the like, and may also include one or more graphics processing units (GPU) 1319 and/or a floating-point coprocessor for mathematical computations. System 1300 may also include a system memory 1302, which may be in the form of random-access memory (RAM), read-only memory (ROM), or both.
A number of controllers and peripheral devices may also be provided, as shown in FIG. 13. An input controller 1303 represents an interface to various input device(s) 1304, such as a keyboard, mouse, touchscreen, and/or stylus. The computing system 1300 may also include a storage controller 1307 for interfacing with one or more storage devices 1308 each of which includes a storage medium such as magnetic tape or disk, or an optical medium that might be used to record programs of instructions for operating systems, utilities, and applications, which may include embodiments of programs that implement various aspects of the present disclosure. Storage device(s) 1308 may also be used to store processed data or data to be processed in accordance with the disclosure. The system 1300 may also include a display controller 1309 for providing an interface to a display device 1311, which may be a cathode ray tube (CRT), a thin film transistor (TFT) display, organic light-emitting diode, electroluminescent panel, plasma panel, or other type of display. The computing system 1300 may also include one or more peripheral controllers or interfaces 1305 for one or more peripherals 1306. Examples of peripherals may include one or more printers, scanners, input devices, output devices, sensors, and the like. A communications controller 1314 may interface with one or more communication devices 1315, which enables the system 1300 to connect to remote devices through any of a variety of networks including the Internet, a cloud resource (e.g., an Ethernet cloud, a Fiber Channel over Ethernet (FCoE)/Data Center Bridging (DCB) cloud, etc.), a local area network (LAN), a wide area network (WAN), a storage area network (SAN) or through any suitable electromagnetic carrier signals including infrared signals.
In the illustrated system, all major system components may connect to a bus 1316, which may represent more than one physical bus. However, various system components may or may not be in physical proximity to one another. For example, input data and/or output data may be remotely transmitted from one physical location to another. In addition, programs that implement various aspects of the disclosure may be accessed from a remote location (e.g., a server) over a network. Such data and/or programs may be conveyed through any of a variety of machine-readable medium including, but are not limited to: magnetic media such as hard disks, floppy disks, and magnetic tape; optical media such as CD-ROMs and holographic devices; magneto-optical media; and hardware devices that are specially configured to store or to store and execute program code, such as application specific integrated circuits (ASICs), programmable logic devices (PLDs), flash memory devices, and ROM and RAM devices.
Aspects of the present disclosure may be encoded upon one or more non-transitory computer-readable media with instructions for one or more processors or processing units to cause steps to be performed. It shall be noted that the one or more non-transitory computer-readable media may include volatile and/or non-volatile memory. It shall be noted that alternative implementations are possible, including a hardware implementation or a software/hardware implementation. Hardware-implemented functions may be realized using ASIC(s), programmable arrays, digital signal processing circuitry, or the like. Accordingly, the “means” terms in any claims are intended to cover both software and hardware implementations. Similarly, the term “computer-readable medium or media” as used herein includes software and/or hardware having a program of instructions embodied thereon, or a combination thereof. With these implementation alternatives in mind, it is to be understood that the figures and accompanying description provide the functional information one skilled in the art would require to write program code (i.e., software) and/or to fabricate circuits (i.e., hardware) to perform the processing required.
It shall be noted that embodiments of the present disclosure may further relate to computer products with a non-transitory, tangible computer-readable medium that have computer code thereon for performing various computer-implemented operations. The media and computer code may be those specially designed and constructed for the purposes of the present disclosure, or they may be of the kind known or available to those having skill in the relevant arts. Examples of tangible computer-readable media include, but are not limited to: magnetic media such as hard disks, floppy disks, and magnetic tape; optical media such as CD-ROMs and holographic devices; magneto-optical media; and hardware devices that are specially configured to store or to store and execute program code, such as application specific integrated circuits (ASICs), programmable logic devices (PLDs), flash memory devices, and ROM and RAM devices. Examples of computer code include machine code, such as produced by a compiler, and files containing higher level code that are executed by a computer using an interpreter. Embodiments of the present disclosure may be implemented in whole or in part as machine-executable instructions that may be in program modules that are executed by a processing device. Examples of program modules include libraries, programs, routines, objects, components, and data structures. In distributed computing environments, program modules may be physically located in settings that are local, remote, or both.
One skilled in the art will recognize no computing system or programming language is critical to the practice of the present disclosure. One skilled in the art will also recognize that a number of the elements described above may be physically and/or functionally separated into sub-modules or combined together.
It will be appreciated to those skilled in the art that the preceding examples and embodiments are exemplary and not limiting to the scope of the present disclosure. It is intended that all permutations, enhancements, equivalents, combinations, and improvements thereto that are apparent to those skilled in the art upon a reading of the specification and a study of the drawings are included within the true spirit and scope of the present disclosure. It shall also be noted that elements of any claims may be arranged differently including having multiple dependencies, configurations, and combinations.

Claims (20)

What is claimed is:
1. A computer-implemented method for synthesizing speech from an input text using a text-to-speech (TTS) system comprising an encoder and a non-autoregressive decoder, the method comprising:
encoding the input text into hidden representations comprising a set of key representations and a set of value representations using the encoder, which comprises one or more convolution layers, of the TTS system;
decoding the hidden representations using the non-autoregressive decoder of the TTS system, the non-autoregressive decoder comprising:
an attention block that uses positional encoding and the set of key representations to generate a context representation for each time step, which context representations are supplied as inputs to a first decoder block in a plurality of decoder blocks; and
the plurality of decoder blocks, in which a decoder block comprising:
a non-casual convolution block, which receives as an input the context representation if it is the first decoder block in the plurality of decoder blocks and receives as an input a decoder block output from a prior decoder block if it is the second or subsequent decoder block in the plurality of decoder blocks and outputs a decoder block output comprising a query and an intermediary output; and
an attention block, which uses the query output from the non-casual convolution block and positional encoding to compute a context representation that is combined with the intermediary output to create a decoder block output for the decode block; and
using a set of decoder block outputs to generate a set of audio representation frames representing the input text.
2. The computer-implemented method of claim 1 wherein the attention blocks of the plurality of decoder blocks compute a context representation by performing the steps comprising:
using a per-time-step query from the non-casual convolution block of the decoder block and a per-time-step key representation from the encoder to compute attention weights; and
obtaining the context representation as a weighted average of one or more value representations from the encoder.
3. The computer-implemented method of claim 1 wherein the attention blocks of the plurality of decoder blocks comprise an attention masking layer that performs the step comprising:
for a query from the non-casual convolution block, computing a softmax of attention weights over a fixed window centered around a target position, in which the target position is calculated as related to a time-step index of the query.
4. The computer-implemented method of claim 1 wherein the positional encoding is used by the attention block to affect attention alignment weighting.
5. The computer-implemented method of claim 1 wherein the TTS system further comprises a vocoder and the method further comprises:
using the vocoder to convert the set of audio representation frames into a signal representing synthesized speech of the input text.
6. The computer-implemented method of claim 5 wherein the vocoder comprises a vocoder decoder comprising an inverse autoregressive flow (IAF) that was trained without distillation.
7. The computer-implemented method of claim 6 wherein the step of training the vocoder decoder without distillation comprises:
using an encoder of the vocoder to guide training of the vocoder decoder and the encoder is jointly trained with the vocoder decoder.
8. The computer-implemented method of claim 5 further comprising:
implementing the TTS system fully in parallel.
9. A non-transitory computer-readable medium or media comprising one or more sequences of instructions which, when executed by one or more processors, causes steps to be performed comprising:
encoding an input text into hidden representations comprising a set of key representations and a set of value representations using an encoder, which comprises one or more convolution layers, of a text-to-speech (TTS) system;
decoding the hidden representations using a non-autoregressive decoder of the TTS system, the non-autoregressive decoder comprising:
an attention block that uses positional encoding and the set of key representations to generate a context representation for each time step, which context representations are supplied as inputs to a first decoder block in a plurality of decoder blocks; and
the plurality of decoder blocks, in which a decoder block comprising:
a non-casual convolution block, which receives as an input the context representation if it is the first decoder block in the plurality of decoder blocks and receives as an input a decoder block output from a prior decoder block if it is the second or subsequent decoder block in the plurality of decoder blocks and outputs a decoder block output comprising a query and an intermediary output; and
an attention block, which uses the query output from the non-casual convolution block and positional encoding to compute a context representation that is combined with the intermediary output to create a decoder block output for the decode block; and
using a set of decoder block outputs to generate a set of audio representation frames representing the input text.
10. The non-transitory computer-readable medium or media of claim 9 wherein the attention blocks of the plurality of decoder blocks compute a context representation by performing the steps comprising:
using a per-time-step query from the non-casual convolution block of the decoder block and a per-time-step key representation from the encoder to compute attention weights; and
obtaining the context representation as a weighted average of one or more value representations from the encoder.
11. The non-transitory computer-readable medium or media of claim 9 wherein the attention blocks of the plurality of decoder blocks comprise an attention masking layer that performs the step comprising:
for a query from the non-casual convolution block, computing a softmax of attention weights over a fixed window centered around a target position, in which the target position is calculated as related to a time-step index of the query.
12. The non-transitory computer-readable medium or media of claim 9 further comprising one or more sequences of instructions which, when executed by one or more processors, causes steps to be performed comprising:
using a vocoder to convert the set of audio representation frames into a signal representing synthesized speech of the input text.
13. The non-transitory computer-readable medium or media of claim 12 wherein the vocoder comprises a vocoder decoder comprising an inverse autoregressive flow (IAF) that was trained without distillation.
14. The non-transitory computer-readable medium or media of claim 13 wherein the step of training the vocoder decoder without distillation comprises:
using an encoder of the vocoder to guide training of the vocoder decoder and the encoder is jointly trained with the vocoder decoder.
15. A system comprising:
one or more processors; and
a non-transitory computer-readable medium or media comprising one or more sets of instructions which, when executed by one or more processors, causes steps to be performed comprising:
encoding an input text into hidden representations comprising a set of key representations and a set of value representations using an encoder, which comprises one or more convolution layers, of a text-to-speech (TTS) system;
decoding the hidden representations using a non-autoregressive decoder of the TTS system, the non-autoregressive decoder comprising:
an attention block that uses positional encoding and the set of key representations to generate a context representation for each time step, which context representations are supplied as inputs to a first decoder block in a plurality of decoder blocks; and
the plurality of decoder blocks, in which a decoder block comprising:
a non-casual convolution block, which receives as an input the context representation if it is the first decoder block in the plurality of decoder blocks and receives as an input a decoder block output from a prior decoder block if it is the second or subsequent decoder block in the plurality of decoder blocks and outputs a decoder block output comprising a query and an intermediary output; and
an attention block, which uses the query output from the non-casual convolution block and positional encoding to compute a context representation that is combined with the intermediary output to create a decoder block output for the decode block; and
using a set of decoder block outputs to generate a set of audio representation frames representing the input text.
16. The system of claim 15 wherein the attention blocks of the plurality of decoder blocks compute a context representation by performing the step comprising:
using a per-time-step query from the non-casual convolution block of the decoder block and a per-time-step key representation from the encoder to compute attention weights; and
obtaining the context representation as a weighted average of one or more value representations from the encoder.
17. The system of claim 15 wherein the attention blocks of the plurality of decoder blocks comprise an attention masking layer that performs the step comprising:
for a query from the non-casual convolution block, computing a softmax of attention weights over a fixed window centered around a target position, in which the target position is calculated as related to a time-step index of the query.
18. The system of claim 15 wherein the TTS system further comprises a vocoder and wherein the non-transitory computer-readable medium or media further comprises one or more sequences of instructions which, when executed by one or more processors, causes steps to be performed comprising:
using a vocoder to convert the set of audio representation frames into a signal representing synthesized speech of the input text.
19. The system of claim 18 wherein the vocoder comprises a vocoder decoder comprising an inverse autoregressive flow (IAF) that was trained without distillation by using an encoder of the vocoder to guide training of the vocoder decoder and the encoder is jointly trained with the vocoder decoder.
20. The system of claim 18 further comprising:
executing the TTS system fully in parallel.
US16/654,955 2017-10-19 2019-10-16 Parallel neural text-to-speech Active 2038-08-29 US11017761B2 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US16/654,955 US11017761B2 (en) 2017-10-19 2019-10-16 Parallel neural text-to-speech
CN202010518795.0A CN112669809A (en) 2019-10-16 2020-06-09 Parallel neural text to speech conversion

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US201762574382P 2017-10-19 2017-10-19
US16/058,265 US10796686B2 (en) 2017-10-19 2018-08-08 Systems and methods for neural text-to-speech using convolutional sequence learning
US16/277,919 US10872596B2 (en) 2017-10-19 2019-02-15 Systems and methods for parallel wave generation in end-to-end text-to-speech
US16/654,955 US11017761B2 (en) 2017-10-19 2019-10-16 Parallel neural text-to-speech

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US16/277,919 Continuation-In-Part US10872596B2 (en) 2017-10-19 2019-02-15 Systems and methods for parallel wave generation in end-to-end text-to-speech

Publications (2)

Publication Number Publication Date
US20200066253A1 US20200066253A1 (en) 2020-02-27
US11017761B2 true US11017761B2 (en) 2021-05-25

Family

ID=69586394

Family Applications (1)

Application Number Title Priority Date Filing Date
US16/654,955 Active 2038-08-29 US11017761B2 (en) 2017-10-19 2019-10-16 Parallel neural text-to-speech

Country Status (1)

Country Link
US (1) US11017761B2 (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210089924A1 (en) * 2019-09-24 2021-03-25 Nec Laboratories America, Inc Learning weighted-average neighbor embeddings
US20210225362A1 (en) * 2020-01-22 2021-07-22 Google Llc Attention-Based Joint Acoustic and Text On-Device End-to-End Model
US20210312326A1 (en) * 2020-04-02 2021-10-07 South University Of Science And Technology Of China Travel prediction method and apparatus, device, and storage medium
US20210383789A1 (en) * 2020-06-05 2021-12-09 Deepmind Technologies Limited Generating audio data using unaligned text inputs with an adversarial network
US11222620B2 (en) * 2020-05-07 2022-01-11 Google Llc Speech recognition using unspoken text and speech synthesis
US11295721B2 (en) * 2019-11-15 2022-04-05 Electronic Arts Inc. Generating expressive speech audio from text data
US20220108680A1 (en) * 2020-10-02 2022-04-07 Google Llc Text-to-speech using duration prediction
US20230121683A1 (en) * 2021-06-15 2023-04-20 Nanjing Silicon Intelligence Technology Co., Ltd. Text output method and system, storage medium, and electronic device
US20230215420A1 (en) * 2020-07-21 2023-07-06 Ai Speech Co., Ltd. Speech synthesis method and system
RU2803488C2 (en) * 2021-06-03 2023-09-14 Общество С Ограниченной Ответственностью «Яндекс» Method and server for waveform generation
US20230317059A1 (en) * 2022-03-20 2023-10-05 Google Llc Alignment Prediction to Inject Text into Automatic Speech Recognition Training
US12020726B2 (en) * 2022-08-31 2024-06-25 Kabushiki Kaisha Toshiba Magnetic reproduction processing device, magnetic recording/reproducing device, and magnetic reproducing method
US12079230B1 (en) * 2024-01-31 2024-09-03 Clarify Health Solutions, Inc. Computer network architecture and method for predictive analysis using lookup tables as prediction models
US12175995B2 (en) 2021-06-03 2024-12-24 Y.E. Hub Armenia LLC Method and a server for generating a waveform

Families Citing this family (39)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109754778B (en) * 2019-01-17 2023-05-30 平安科技(深圳)有限公司 Text speech synthesis method and device and computer equipment
CA3154698A1 (en) * 2019-09-25 2021-04-01 Deepmind Technologies Limited High fidelity speech synthesis with adversarial networks
WO2021127978A1 (en) * 2019-12-24 2021-07-01 深圳市优必选科技股份有限公司 Speech synthesis method and apparatus, computer device and storage medium
US20210366461A1 (en) * 2020-05-20 2021-11-25 Resemble.ai Generating speech signals using both neural network-based vocoding and generative adversarial training
CN112837669B (en) * 2020-05-21 2023-10-24 腾讯科技(深圳)有限公司 Speech synthesis method, device and server
CN111724809A (en) * 2020-06-15 2020-09-29 苏州意能通信息技术有限公司 Vocoder implementation method and device based on variational self-encoder
CN113903324A (en) * 2020-06-18 2022-01-07 新加坡依图有限责任公司(私有) Method, device, equipment and machine readable medium for text-to-speech
CN112037758A (en) * 2020-06-19 2020-12-04 四川长虹电器股份有限公司 Voice synthesis method and device
US11295725B2 (en) * 2020-07-09 2022-04-05 Google Llc Self-training WaveNet for text-to-speech
CN112036231B (en) * 2020-07-10 2022-10-21 武汉大学 A detection and recognition method of lane lines and road signs based on vehicle video
CN111916049B (en) * 2020-07-15 2021-02-09 北京声智科技有限公司 Voice synthesis method and device
JP7708793B2 (en) 2020-09-02 2025-07-15 グーグル エルエルシー End-to-end speech waveform generation by estimating the gradient of data density
CN112071325B (en) * 2020-09-04 2023-09-05 中山大学 A Many-to-Many Speech Conversion Method Based on Dual Voiceprint Feature Vectors and Sequence-to-Sequence Modeling
EP4229629B1 (en) 2020-10-15 2024-11-27 Dolby International AB Real-time packet loss concealment using deep generative networks
CN112270917B (en) * 2020-10-20 2024-06-04 网易(杭州)网络有限公司 Speech synthesis method, device, electronic equipment and readable storage medium
KR102804442B1 (en) * 2020-10-21 2025-05-09 구글 엘엘씨 Parallel Tacotron: Non-autoregressive and controllable TTS
CN112562655A (en) * 2020-12-03 2021-03-26 北京猎户星空科技有限公司 Residual error network training and speech synthesis method, device, equipment and medium
CN112652293A (en) * 2020-12-24 2021-04-13 上海优扬新媒信息技术有限公司 Speech synthesis model training and speech synthesis method, device and speech synthesizer
CN112992129B (en) * 2021-03-08 2022-09-30 中国科学技术大学 A method for preserving the monotonicity of the attention mechanism in speech recognition tasks
JP7709545B2 (en) * 2021-03-22 2025-07-16 グーグル エルエルシー Unsupervised Parallel Tacotron: Non-autoregressive and Controllable Text-to-Speech
EP4298631A1 (en) * 2021-03-26 2024-01-03 Google LLC Conformer-based speech conversion model
EP4586246A1 (en) * 2021-04-27 2025-07-16 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Decoder
WO2022253999A1 (en) * 2021-06-04 2022-12-08 Widex A/S Method of operating a hearing aid system and a hearing aid system
CN113436607B (en) * 2021-06-12 2024-04-09 西安工业大学 A fast voice cloning method
CN113450761B (en) * 2021-06-17 2023-09-22 清华大学深圳国际研究生院 A parallel speech synthesis method and device based on variational autoencoders
CN113488029B (en) * 2021-06-23 2024-06-11 中科极限元(杭州)智能科技股份有限公司 Non-autoregressive speech recognition training decoding method and system based on parameter sharing
CN113611309B (en) * 2021-07-13 2024-05-10 北京捷通华声科技股份有限公司 Tone conversion method and device, electronic equipment and readable storage medium
CN113299267B (en) * 2021-07-26 2021-10-15 北京语言大学 Voice stimulation continuum synthesis method and device based on variational self-encoder
CN114220456B (en) * 2021-11-29 2025-03-14 北京捷通华声科技股份有限公司 Method, device and electronic device for generating speech synthesis model
CN114007075B (en) * 2021-11-30 2024-11-29 沈阳雅译网络技术有限公司 Gradual compression method for acoustic coding
CN113920989B (en) * 2021-12-13 2022-04-01 中国科学院自动化研究所 End-to-end system and equipment for voice recognition and voice translation
CN114242034B (en) * 2021-12-28 2025-03-18 深圳市优必选科技股份有限公司 A speech synthesis method, device, terminal equipment and storage medium
CN115346543B (en) * 2022-08-17 2024-09-24 广州市百果园信息技术有限公司 Audio processing method, model training method, device, equipment, medium and product
CN117011620A (en) * 2023-06-29 2023-11-07 华为技术有限公司 A target detection method and related equipment
CN117376634B (en) * 2023-12-08 2024-03-08 湖南快乐阳光互动娱乐传媒有限公司 Short video music distribution method and device, electronic equipment and storage medium
CN118471195B (en) * 2024-07-10 2024-10-18 厦门蝉羽网络科技有限公司 Voice synthesis method and system based on discrete Diffusion
CN120164454B (en) * 2025-02-26 2025-09-05 北京宇信科技集团股份有限公司 A low-delay speech synthesis method, device, equipment and medium
CN120877703B (en) * 2025-09-25 2025-12-05 上海稀宇科技有限公司 Audio generation method and device based on audio processing model
CN121393840A (en) * 2025-12-24 2026-01-23 浪潮软件科技有限公司 A state assessment system based on a lightweight Chinese speech rehabilitation big data model

Citations (38)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5970453A (en) 1995-01-07 1999-10-19 International Business Machines Corporation Method and system for synthesizing speech
US6078885A (en) 1998-05-08 2000-06-20 At&T Corp Verbal, fully automatic dictionary updates by end-users of speech synthesis and recognition systems
US20010012999A1 (en) 1998-12-16 2001-08-09 Compaq Computer Corp., Computer apparatus for text-to-speech synthesizer dictionary reduction
US20020026315A1 (en) * 2000-06-02 2002-02-28 Miranda Eduardo Reck Expressivity of voice synthesis
US20030212555A1 (en) 2002-05-09 2003-11-13 Oregon Health & Science System and method for compressing concatenative acoustic inventories for speech synthesis
US20040039570A1 (en) 2000-11-28 2004-02-26 Steffen Harengel Method and system for multilingual voice recognition
US20040193398A1 (en) 2003-03-24 2004-09-30 Microsoft Corporation Front-end architecture for a multi-lingual text-to-speech system
US20050033575A1 (en) 2002-01-17 2005-02-10 Tobias Schneider Operating method for an automated language recognizer intended for the speaker-independent language recognition of words in different languages and automated language recognizer
US20050119890A1 (en) 2003-11-28 2005-06-02 Yoshifumi Hirose Speech synthesis apparatus and speech synthesis method
US20050137870A1 (en) 2003-11-28 2005-06-23 Tatsuya Mizutani Speech synthesis method, speech synthesis system, and speech synthesis program
US20050182629A1 (en) 2004-01-16 2005-08-18 Geert Coorman Corpus-based speech synthesis based on segment recombination
US20050192807A1 (en) 2004-02-26 2005-09-01 Ossama Emam Hierarchical approach for the statistical vowelization of Arabic text
US20060149543A1 (en) 2004-12-08 2006-07-06 France Telecom Construction of an automaton compiling grapheme/phoneme transcription rules for a phoneticizer
US20070005337A1 (en) 2005-02-03 2007-01-04 John Mount Systems and methods for using automated translation and other statistical methods to convert a classifier in one language to another language
US20070094030A1 (en) 2005-10-20 2007-04-26 Kabushiki Kaisha Toshiba Prosodic control rule generation method and apparatus, and speech synthesis method and apparatus
US20070118377A1 (en) 2003-12-16 2007-05-24 Leonardo Badino Text-to-speech method and system, computer program product therefor
US20070168189A1 (en) 2006-01-19 2007-07-19 Kabushiki Kaisha Toshiba Apparatus and method of processing speech
US20080114598A1 (en) 2006-11-09 2008-05-15 Volkswagen Of America, Inc. Motor vehicle with a speech interface
US20080167862A1 (en) 2007-01-09 2008-07-10 Melodis Corporation Pitch Dependent Speech Recognition Engine
US20090157383A1 (en) 2007-12-18 2009-06-18 Samsung Electronics Co., Ltd. Voice query extension method and system
US20100004934A1 (en) * 2007-08-10 2010-01-07 Yoshifumi Hirose Speech separating apparatus, speech synthesizing apparatus, and voice quality conversion apparatus
US20100312562A1 (en) 2009-06-04 2010-12-09 Microsoft Corporation Hidden markov model based text to speech systems employing rope-jumping algorithm
US20110087488A1 (en) 2009-03-25 2011-04-14 Kabushiki Kaisha Toshiba Speech synthesis apparatus and method
US20110202355A1 (en) * 2008-07-17 2011-08-18 Bernhard Grill Audio Encoding/Decoding Scheme Having a Switchable Bypass
US20120035933A1 (en) 2010-08-06 2012-02-09 At&T Intellectual Property I, L.P. System and method for synthetic voice generation and modification
US20120143611A1 (en) 2010-12-07 2012-06-07 Microsoft Corporation Trajectory Tiling Approach for Text-to-Speech
US20120265533A1 (en) 2011-04-18 2012-10-18 Apple Inc. Voice assignment for text-to-speech output
US20130325477A1 (en) 2011-02-22 2013-12-05 Nec Corporation Speech synthesis system, speech synthesis method and speech synthesis program
US20140046662A1 (en) 2012-08-07 2014-02-13 Interactive Intelligence, Inc. Method and system for acoustic data selection for training the parameters of an acoustic model
US8898062B2 (en) 2007-02-19 2014-11-25 Panasonic Intellectual Property Corporation Of America Strained-rough-voice conversion device, voice conversion device, voice synthesis device, voice conversion method, voice synthesis method, and program
US20150243275A1 (en) 2014-02-26 2015-08-27 Microsoft Corporation Voice font speaker and prosody interpolation
US20150279358A1 (en) 2014-03-31 2015-10-01 International Business Machines Corporation Method and system for efficient spoken term detection using confusion networks
US20160078859A1 (en) 2014-09-11 2016-03-17 Microsoft Corporation Text-to-speech with emotional content
US9508341B1 (en) 2014-09-03 2016-11-29 Amazon Technologies, Inc. Active learning for lexical annotations
US20170148433A1 (en) 2015-11-25 2017-05-25 Baidu Usa Llc Deployed end-to-end speech recognition
US10134388B1 (en) 2015-12-23 2018-11-20 Amazon Technologies, Inc. Word generation for speech recognition
US20190122651A1 (en) * 2017-10-19 2019-04-25 Baidu Usa Llc Systems and methods for neural text-to-speech using convolutional sequence learning
US10319364B2 (en) 2017-05-18 2019-06-11 Telepathy Labs, Inc. Artificial intelligence-based text-to-speech system and method

Patent Citations (38)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5970453A (en) 1995-01-07 1999-10-19 International Business Machines Corporation Method and system for synthesizing speech
US6078885A (en) 1998-05-08 2000-06-20 At&T Corp Verbal, fully automatic dictionary updates by end-users of speech synthesis and recognition systems
US20010012999A1 (en) 1998-12-16 2001-08-09 Compaq Computer Corp., Computer apparatus for text-to-speech synthesizer dictionary reduction
US20020026315A1 (en) * 2000-06-02 2002-02-28 Miranda Eduardo Reck Expressivity of voice synthesis
US20040039570A1 (en) 2000-11-28 2004-02-26 Steffen Harengel Method and system for multilingual voice recognition
US20050033575A1 (en) 2002-01-17 2005-02-10 Tobias Schneider Operating method for an automated language recognizer intended for the speaker-independent language recognition of words in different languages and automated language recognizer
US20030212555A1 (en) 2002-05-09 2003-11-13 Oregon Health & Science System and method for compressing concatenative acoustic inventories for speech synthesis
US20040193398A1 (en) 2003-03-24 2004-09-30 Microsoft Corporation Front-end architecture for a multi-lingual text-to-speech system
US20050119890A1 (en) 2003-11-28 2005-06-02 Yoshifumi Hirose Speech synthesis apparatus and speech synthesis method
US20050137870A1 (en) 2003-11-28 2005-06-23 Tatsuya Mizutani Speech synthesis method, speech synthesis system, and speech synthesis program
US20070118377A1 (en) 2003-12-16 2007-05-24 Leonardo Badino Text-to-speech method and system, computer program product therefor
US20050182629A1 (en) 2004-01-16 2005-08-18 Geert Coorman Corpus-based speech synthesis based on segment recombination
US20050192807A1 (en) 2004-02-26 2005-09-01 Ossama Emam Hierarchical approach for the statistical vowelization of Arabic text
US20060149543A1 (en) 2004-12-08 2006-07-06 France Telecom Construction of an automaton compiling grapheme/phoneme transcription rules for a phoneticizer
US20070005337A1 (en) 2005-02-03 2007-01-04 John Mount Systems and methods for using automated translation and other statistical methods to convert a classifier in one language to another language
US20070094030A1 (en) 2005-10-20 2007-04-26 Kabushiki Kaisha Toshiba Prosodic control rule generation method and apparatus, and speech synthesis method and apparatus
US20070168189A1 (en) 2006-01-19 2007-07-19 Kabushiki Kaisha Toshiba Apparatus and method of processing speech
US20080114598A1 (en) 2006-11-09 2008-05-15 Volkswagen Of America, Inc. Motor vehicle with a speech interface
US20080167862A1 (en) 2007-01-09 2008-07-10 Melodis Corporation Pitch Dependent Speech Recognition Engine
US8898062B2 (en) 2007-02-19 2014-11-25 Panasonic Intellectual Property Corporation Of America Strained-rough-voice conversion device, voice conversion device, voice synthesis device, voice conversion method, voice synthesis method, and program
US20100004934A1 (en) * 2007-08-10 2010-01-07 Yoshifumi Hirose Speech separating apparatus, speech synthesizing apparatus, and voice quality conversion apparatus
US20090157383A1 (en) 2007-12-18 2009-06-18 Samsung Electronics Co., Ltd. Voice query extension method and system
US20110202355A1 (en) * 2008-07-17 2011-08-18 Bernhard Grill Audio Encoding/Decoding Scheme Having a Switchable Bypass
US20110087488A1 (en) 2009-03-25 2011-04-14 Kabushiki Kaisha Toshiba Speech synthesis apparatus and method
US20100312562A1 (en) 2009-06-04 2010-12-09 Microsoft Corporation Hidden markov model based text to speech systems employing rope-jumping algorithm
US20120035933A1 (en) 2010-08-06 2012-02-09 At&T Intellectual Property I, L.P. System and method for synthetic voice generation and modification
US20120143611A1 (en) 2010-12-07 2012-06-07 Microsoft Corporation Trajectory Tiling Approach for Text-to-Speech
US20130325477A1 (en) 2011-02-22 2013-12-05 Nec Corporation Speech synthesis system, speech synthesis method and speech synthesis program
US20120265533A1 (en) 2011-04-18 2012-10-18 Apple Inc. Voice assignment for text-to-speech output
US20140046662A1 (en) 2012-08-07 2014-02-13 Interactive Intelligence, Inc. Method and system for acoustic data selection for training the parameters of an acoustic model
US20150243275A1 (en) 2014-02-26 2015-08-27 Microsoft Corporation Voice font speaker and prosody interpolation
US20150279358A1 (en) 2014-03-31 2015-10-01 International Business Machines Corporation Method and system for efficient spoken term detection using confusion networks
US9508341B1 (en) 2014-09-03 2016-11-29 Amazon Technologies, Inc. Active learning for lexical annotations
US20160078859A1 (en) 2014-09-11 2016-03-17 Microsoft Corporation Text-to-speech with emotional content
US20170148433A1 (en) 2015-11-25 2017-05-25 Baidu Usa Llc Deployed end-to-end speech recognition
US10134388B1 (en) 2015-12-23 2018-11-20 Amazon Technologies, Inc. Word generation for speech recognition
US10319364B2 (en) 2017-05-18 2019-06-11 Telepathy Labs, Inc. Artificial intelligence-based text-to-speech system and method
US20190122651A1 (en) * 2017-10-19 2019-04-25 Baidu Usa Llc Systems and methods for neural text-to-speech using convolutional sequence learning

Non-Patent Citations (168)

* Cited by examiner, † Cited by third party
Title
2015 IEEE International Conference, 2015.(5 pgs).
A.van den Oord et al.,"Conditional image generation with PixelCNN decoders," In NIPS, 2016. (9pgs).
A.van den Oord et al.,"Parallel WaveNet: Fast high-fidelity speech synthesis," In ICML, 2018. (9pgs).
A.van den Oord et al.,"WaveNet: A generative model for raw audio," arXiv preprint arXiv:1609.03499, 2016. (15pgs).
Aaron et al. Nov. 28, 2017, Parallel WaveNet: Fast High-Fidelity Speech Synthesis, 2017 (Year: 2017). *
Abadi et al.,"TensorFlow: Large-scale machine learning on heterogeneous systems," Retrieved from Internet <URL: http://download.tensorflow.org/paper/whitepaper2015.pdf>, 2015. (19pgs).
Abdel-Hamid et al.,"Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code," In ICASSP, 2013. (5pgs).
Amodei et al.,"Deep speech 2: End-to-End speech recognition in English and Mandarin," arXiv preprint arXiv:1512.02595, 2015. (28pgs).
Arik et al.,"Deep Voice 2: Multi-speaker neural text-to-speech," arXiv preprint arXiv:1705.08947, 2017. (15pgs).
Arik et al.,"Deep Voice 2: Multi-speaker neural text-to-speech," arXiv preprint arXiv:1705.08947v1, 2017. (15 pgs).
Arik et al.,"Deep Voice 2: Multi-speaker neural text-to-speech," In NIPS, 2017. (15 pgs).
Arik et al.,"Deep Voice: Real-time neural text-to-speech," arXiv preprint arXiv:1702.07825, 2017. (17 pgs).
Arik et al.,"Deep Voice: Real-time neural text-to-speech," arXiv preprint arXiv:1702.07825v2, 2017. (17 pgs).
Arik et al.,"Deep Voice: Real-time neural text-to-speech," arXiv preprint arXiv:arXiv:1702.07825,2017. (17pgs).
Arik et al.,"Deep Voice: Real-time neural text-to-speech," In ICML, 2017. (17pgs).
Arik et al.,"Fast spectrogram inversion using multi-head convolutional neural networks," arXiv preprint arXiv:1808.06719, 2018. (6pgs).
Arik et al.,"Neural voice cloning with a few samples," arXiv preprint arXiv:1802.06006, 2018. (18pgs).
Bahdanau et al.,"Neural machine translation by jointly learning to align and translate," arXiv preprint arXiv:1409.0473v1, 2014. (15pgs).
Bahdanau et al.,"Neural machine translation by jointly learning to align and translate," In ICLR, 2015. (15 pgs).
Bandanau et al.,"Neural machine translation by jointly learning to align and translate," arXiv preprint arXiv:1409.0473, 2016. (15 pgs).
Bengio et al.,"Scheduled sampling for sequence prediction with recurrent neural networks," arXiv preprint arXiv:1506.03099, 2015. (9pgs).
Boersma et al.,"PRAAT, a system for doing phonetics by computer," Glot international, vol. 5, No. 9/10, Nov./Dec. 2001 (341-347). (7pgs).
Bowman et al.,"Generating sentences from a continuous space," In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, 2016. (12pgs).
Bradbury et al.,"Quasi-recurrent neural networks," arXiv preprint arXiv:1611.01576, 2016. (11pgs).
Bradbury et al.,"Quasi-Recurrent Neural Networks," In ICLR, 2017. (12pgs).
Bucilua et al.,"Model Compression," In ACM SIGKDD, 2006. (7 pgs).
C. Bagwell,"SoX—Sound eXchange," [online], [Retrieved Jul. 22, 2019]. Retrieved from Internet <URL:https://sourceforge.net/p/sox/code/ci/master/tree/> (3 pgs).
Capes et al.,"Siri On-Device Deep Learning-Guided Unit Selection Text-to-Speech System," In Interspeech, 2017. (5pgs).
Chen et al.,"Sample efficient adaptive text-to-speech," arXiv preprint arXiv:1809.10460, 2019. (16pgs).
Cho et al.,"Learning Phrase Representations using RNN Encoder-Decoder for statistical machine translation," arXiv:1406.1078, 2014. (14 pgs).
Cho et al.,"Learning phrase representations using RNN encoder-decoder for statistical machine translation," In EMNLP, 2014. (11pgs).
Chorowski et al., "Attention-based models for speech recognition," In NIPS, 2015. (9pgs).
Chung et al.,"A recurrent latent variable model for sequential data," arXiv preprint arXiv:1506.02216, 2016. (9pgs).
Chung et al.,"A recurrent latent variable model for sequential data," In NIPS, 2015. (9pgs).
Chung et al.,"Empirical evaluation of gated recurrent neural networks on sequence modeling," arXiv preprint arXiv:1412.3555, 2014. (9 pgs).
Corrected Notice of Allowance and Fee Due dated Oct. 6, 2020, in related U.S. Appl. No. 15/974,397. (4 pgs).
Dauphin et al.,"Language modeling with gated convolutional networks," arXiv preprint arXiv:1612.08083v1, 2016. (8pgs).
Denton et al.,"Stochastic video generation with a learned prior," arXiv preprint arXiv:1802.07687, 2018. (12pgs).
Diamos et al.,"Persistent RNNS: Stashing recurrent weights On-Chip," In Proceedings of The 33rd International Conference on Machine Learning, 2016. (10pgs).
Dinh et al.,"Density estimation using real NVP," In ICLR, 2017.(32pgs).
Dinh et al.,"NICE: Non-linear independent components estimation," arXiv preprint arXiv:1410.8516, 2015. (13pgs).
Divay et al., "Algorithms for Grapheme-Phoneme Translation for English and French: Applications for Database Searches and Speech Synthesis," Association for Computational Linguistics,1997. (29 pgs).
Dukhan et al.,"PeachPy meets Opcodes: direct machine code generation from Python," In Proceedings of the 5th Workshop on Python for High-Performance and Scientific Computing, 2015. (2 pgs).
Fan et al.,"Multi-speaker modeling and speaker adaptation for DNN-based TTS synthesis," In IEEE ICASSP, 2015. (2 pgs).
Final Office Ation dated Jun. 15, 2020, in related U.S. Appl. No. 15/882,926. (10 pgs).
Final Office Ation dated Jun. 24, 2020, in related U.S. Appl. No. 15/974,397. (11 pgs).
Gehring et al.,"Convolutional sequence to sequence learning," arXiv preprint arXiv:1705.03122v1, 2017. (15 pgs).
Gehring et al.,"Convolutional sequence to sequence learning," In ICML, 2017. (15pgs).
Gehring,"Convolutional sequence to sequence learing," In ICML, 2017. (10 pgs).
Gonzalvo et al.,"Recent advances in Google real-time HMM-driven unit selection synthesizer," In Interspeech, 2016. (5 pgs).
Graves et al.,"Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks," In Proceedings of the 23rd International Conference on Machine Learning (ICML), 2006. (8 pgs).
Graves et al.,"Connectionist temporal classification:Labelling unsegmented sequence data with recurrent neural networks," In Proceedings of the 23rd International Conference.
Griffin et al.,"Signal estimation from modified short-time fourier transform," IEEE Transactions on Acoustics, Speech, and Signal Processing, 1984. (8pgs).
Gu et al.,"Non-autoregressive neural machine translation," In ICLR, 2018. (13pgs).
Hinton et al.,"Distilling the knowledge in a neural network," arXiv preprint arXiv:1503.02531, 2015. (9 pgs).
Hsu et al.,"Hierarchical generative modeling for controllable speech synthesis," In ICLR, 2019. (27pgs).
Hsu et al.,"Voice conversion from unaligned corpora using variational autoencoding wasserstein generative adversarial networks," arXiv:1704.00849, 2017. (5 pgs).
Ioffe et al.,"Batch normalization: Accelerating deep network training by reducing internal covariate shift," arXiv preprint arXiv:1502.03167, 2015. (10 pgs).
Jia et al.,"Transfer learning from speaker verification to multispeaker text-to-speech synthesis," arXiv preprint arXiv:1806.04558, 2019. (15pgs).
K. Murphy,"Machine learning, A probabilistic perspective," 2012, [online], [Retrieved Sep. 3, 2019]. Retrieved from Internet <URL: <https://doc.lagout.org/science/Artificial%20Intelligence/Machine%20learning/Machine%20Learning_%20A%20Probabilistic%20Perspective%20%5BMurphy%202012-08-24%5D.pdf> (24 pgs).
Kaiser et al.,"Fast decoding in sequence models using discrete latent variables," arXiv preprint arXiv:1803.03382, 2018. (10pgs).
Kalchbrenner et al.,"Efficient neural audio synthesis," arXiv preprint arXiv:1802.08435, 2018. (10pgs).
Kawahara et al.,"Restructuring speech representations using a pitch-adaptive time—Frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds," Speech communication, 1999. (21pgs).
Kim et al.,"FloWaveNet: A generative flow for raw audio," arXiv preprint arXiv:1811.02155, 2019. (9pgs).
Kim et al.,"Sequence-level knowledge distillation," In EMNLP, 2016. (11pgs).
Kingma et al.,"ADAM: A method for stochastic optimization," arXiv preprint arXiv:1412.6980, 2014. (9 pgs).
Kingma et al.,"ADAM: A method for stochastic optimization," In ICLR, 2015. (15 pgs).
Kingma et al.,"Auto-encoding variational Bayes," arXiv preprint arXiv:1312.6114, 2014. (14pgs).
Kingma et al.,"Auto-Encoding variational Bayes," In ICLR, 2014. (14 pgs).
Kingma et al.,"Glow: Generative flow with invertible 1x1 convolutions," arXiv preprint arXiv:1807.03039, 2018. (15pgs).
Kingma et al.,"Improving variational inference with inverse autoregressive flow," In NIPS, 2016. (16pgs).
Kingma et al.,"Improving variational inference with inverse autoregressive flow," In NIPS, 2016. (9 pgs).
Lample et al.,"Neural architectures for named entity recognition," arXiv preprint arXiv:1603.01360, 2016. (10 pgs).
Lee et al.,"Deterministic non-autoregressive neural sequence modeling by iterative refinemen," arXiv preprint arXiv:1802.06901, 2018. (11 pgs).
Lee et al.,"Deterministic non-autoregressive neural sequence modeling by iterative refinement," arXiv preprint arXiv:1802.06901, 2018. (11 pgs).
Li et al.,"Deep speaker: an End-to-End neural speaker embedding system," arXiv preprint arXiv:1705.02304, 2017. (8 pgs).
Mehri et al.,"SampleRNN: An unconditional End-to-End neural audio generation model," arXiv preprint arXiv:1612.07837, 2016. (11 pgs).
Mehri et al.,"SampleRNN:An unconditional end-to-end neural audio generation model," In ICLR, 2017. (11pgs).
Morise et al.,"WORLD: a vocoder-based high-quality speech synthesis system for real-time applications," IEICE Transactions on Information & Systems, 2016. (8 pgs).
Morise et al.,"WORLD: a vocoder-based high-quality speech synthesis system for real-time applications," IEICE Transactions on Information and Systems, 2016. (8 pgs).
Nachmani et al.,"Fitting new speakers based on a short untranscribed sample," arXiv preprint arXiv:1802.06984, 2018. (9pgs).
Non-Final Office Action dated Aug. 30, 2019, in U.S. Appl. No. 15/882,926 (8 pgs).
Non-Final Office Action dated Feb. 3, 2020, in U.S. Appl. No. 15/882,926 (10 pgs).
Non-Final Office Action dated Feb. 7, 2020, in U.S. Appl. No. 15/974,397 (10 pgs).
Notice of Allowance and Fee Due dated Aug. 11, 2020, in related U.S. Appl. No. 16/277,919. (10 pgs).
Notice of Allowance and Fee Due dated Aug. 26, 2020, in related U.S. Appl. No. 15/882,926. (6 pgs).
Notice of Allowance and Fee Due dated Feb. 27, 2020, in related U.S. Appl. No. 16/058,265. (9 pgs).
Notice of Allowance and Fee Due dated May 8, 2020, in related U.S. Appl. No. 16/058,265. (9 pgs).
Notice of Allowance and Fee Due dated Oct. 2, 2020, in related U.S. Appl. No. 15/974,397. (10 pgs).
Ochshorn et al., "Gentle," Retrieved from Internet <URL: https://github.com/lowerquality/gentle> 2017. (2 pgs).
Odena et al.,"Deconvolution and checkerboard artifacts," 2016, [Retrieved Sep. 3, 2019]. Retrieved from Internet <URL:<https://distill.pub/2016/deconv-checkerboard/>.(10pgs).
On Machine Learning, ICML, USA, 2006. (8 pgs).
Oord et al.,"Pixel recurrent neural networks," arXiv preprint arXiv:1601.06759, 2016. (10 pgs).
Oord et al.,"Wavenet: A generative model for raw audio," arXiv preprint arXiv:1609.03499, 2016. (15 pgs).
P. Taylor,"Text-to-Speech Synthesis," Cambridge University Press, 2009. [online], [Retrieved Sep. 3, 2019]. Retrieved from Internet <URL: <http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.118.5905&rep=rep1&type=pdf>. (19 pgs).
P. Taylor"Text-to-Speech Synthesis," Cambridge University Press, 2009. (17pgs).
Paine et al.,"Fast wavenet generation algorithm," arXiv preprint arXiv:1611.09482, 2016. (6 pgs).
Panayotov et al.,"Librispeech: an ASR corpus based on public domain audio books," In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE. (5 pgs).
Pascanu et al.,"On the difficulty of training recurrent neural networks," In ICML, 2013. (9pgs).
Pascual et al.,"Multi-output RNN-LSTM for multiple speaker speech synthesis with interpolation model," 9th ISCA Speech Synthesis Workshop, 2016. (6 pgs).
Paul Taylor,"Text-to-Speech Synthesis," [online], [Retrieved Aug. 1, 2019]. Retrieved from Internet <URL: <http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.118.5905&rep=rep1&type=pdf> Cambridge University Press, 2009 (22 pgs).
Peng et al.,"Parallel Neural Text-to-Speech," arXiv preprint arXiv:1905.08459, 2019. (14pgs).
Ping et al,"Deep Voice 3: Scaling text-to-speech with convolutional sequence learning," arXiv preprint arXiv:1710.07654, 2018. (16pgs).
Ping et al.,"ClariNet: Parallel wave generation in end-to-end text-to-speech," arXiv preprint arXiv:1807.07281, 2019. (15pgs).
Ping et al.,"ClariNet: ParallelWave Generation in End-to-End Text-to-Speech," arXiv preprint arXiv:1807.07281, 2018. (12 pgs).
Ping et al.,"Deep Voice 3: Scaling text-to-speech with convolutional sequence learning," In ICLR, 2018. (16pgs).
Prahallad et al.,"The blizzard challenge 2013—Indian language task," Retrieved from Internet <URL: <http://festvox.org/blizzard/bc2013/blizzard_2013_summary_indian.pdf>, 2013. (11pgs).
Prenger et al.,"WaveGlow: A flow-based generative network for speech synthesis," [online], [Retrieved Mar. 3, 2020]. Retrieved from Internet <URL: https://ieeexplore.ieee.org/abstract/document/8683143>In ICASSP, 2019. (2pgs).
R. Yamamoto,"WaveNet vocoder," 2018 [online], [Retrieved Sep. 4, 2019]. Retrieved from Internet <URL: <https://github.com/r9y9/wavenet_vocoder>. (6pgs).
Raffel et al.,"Online and linear-time attention by enforcing monotonic alignments," arXiv:1704.00784v1, 2017. (19 pgs).
Rao et al.,"Grapheme-to-phoneme conversion using long short-term memory recurrent neural networks," In Acoustics, Speech and Signal Processing (ICASSP).
Response filed May 4, 2020, in U.S. Appl. No. 15/882,926 (11pgs).
Response filed May 7, 2020, in U.S. Appl. No. 15/974,397 (15 pgs).
Response filed Nov. 26, 2019, in U.S. Appl. No. 15/882,926 (11 pgs).
Response to Final Office Ation filed Aug. 17, 2020, in related U.S. Appl. No. 15/882,926. (10 pgs).
Response to Final Office Ation filed Aug. 23, 2020, in related U.S. Appl. No. 15/974,397. (15 pgs).
Response to Final Office Ation filed Sep. 24, 2020, in related U.S. Appl. No. 15/974,397. (12 pgs).
Retrieved from Internet <URL.
Reynolds et al.,"Speaker verification using adapted gaussian mixture models," Digital signal processing, 10(1-3):19-41, 2000. (23 pgs).
Rezende et al.,"Stochastic backpropagation and approximate inference in deep generative models," arXiv preprint arXiv:1401.4082, 2014. (14pgs).
Rezende et al.,"Variational inference with normalizing flows," arXiv preprint arXiv:1505.05770, 2016. (10pgs).
Rezende et al.,"Variational inference with normalizing flows," In ICML, 2015. (10 pgs).
Ribeiro et al.,"Crowdmos: An approach for crowdsourcing mean opinion score studies," In Acoustics, Speech & Signal Processing (ICASSP) IEEE Intr Conference, 2011. (4 pgs).
Ribeiro et al.,"CrowdMOS: An approach for crowdsourcing mean opinion score studies," In ICASSP, 2011. (4pgs).
Ribeiro et al.,"Crowdmos: An approach for crowdsourcing mean opinion score studies," In IEEE ICASSP, 2011. (4 pgs).
Ronanki et al.,"A template-based approach for speech synthesis intonation generation using LSTMs," Interspeech 2016, pp. 2463-2467, 2016. (5pgs).
Ronanki et al.,"Median-based generation of synthetic speech durations using a non-parametric approach," arXiv preprint arXiv:1608.06134, 2016. (7 pgs).
Roy et al,"Theory and experiments on vector quantized autoencoders," arXiv preprint arXiv:1805.11063, 2018. (11pgs).
Rush et al.,"A neural attention model for abstractive sentence summarization," In EMNLP, 2015. (11 pgs).
Salimans et al.,"Improved techniques for training GANs," In NIPS, 2016. (9 pgs).
Salimans et al.,"PixelCNN++: Improving the PixelCNN with discretized logistic mixture likelihood and other modifications," In ICLR, 2017. (10pgs).
Salimans et al.,"Weight normalization: A simple reparameterization to accelerate training of deep neural networks," In NIPS, arXiv:1602.07868v3, 2016. (11 pgs).
Shen et al.,"Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions," In ICASSP, 2018. (5pgs).
Sotelo et al.,"Char2wav: End-to-end speech synthesis," In ICLR workshop, 2017. (6 pgs).
Sotelo et al.,"CHAR2WAV: End-to-End speech synthesis," In ICLR2017 workshop submission, 2017. (6pgs).
Sotelo et al.,"Char2wav: End-to-End speech synthesis," Retrieved from Internet <URL:<https://openreview.net/pdf?id=B1VWyySKx>, 2017. (6pgs).
Sotelo et al.,"Char2wav:End-to-end speech synthesis," ICLR workshop, 2017. (6pgs).
Stephenson et al.,"Production Rendering, Design and Implementation," Springer, 2005. (5pgs).
Sutskever et al., "Sequence to Sequence Learning with Neural Networks", In NIPS, 2014. (9 pgs).
Taigman et al., "Voiceloop: Voicefitting andsynthesis via Aphonologicalloop", arXiv preprint arXiv:1707.06588, 2017. (12pgs).
Taigman et al.,"VoiceLoop: Voice fitting and synthesis via a phonological loop," In ICLR, 2018. (14pgs).
Taylor et al.,"Text-to-Speech Synthesis," Cambridge University Press, New York, NY, USA, 1st edition, 2009. ISBN 0521899273, 9780521899277. (17 pgs).
Theis et al.,"A note on the evaluation of generative models," arXiv preprint arXiv:1511.01844, 2015. (9 pgs).
Uria et al.,"RNADE: The real-valued neural autoregressive density-estimator," In Advances in Neural Information Processing Systems, pp. 2175-2183, 2013. (10pgs).
Van den Oord et al.,"Neural discrete representation learning," arXiv preprint arXiv:1711.00937, 2018. (11pgs).
Van den Oord et al.,"Parallel WaveNet: Fast high-fidelity speech synthesis," arXiv preprint arXiv:1711.10433, 2017. (11 pgs).
Van den Oord et al.,"Parallel WaveNet: Fast high-fidelity speech synthesis," In ICML, 2018. (9pgs).
Van den Oord et al.,"WaveNet: A generative model for raw audio," arXiv preprint arXiv:1609.03499, 2016. (15pgs).
Van den Oord et al.,"WaveNet: A generative model for raw audio," arXiv:1609.03499, 2016. (15 pgs).
Vaswani et al., "Attention Is All You Need", arXiv preprint arXiv:1706.03762, 2017.(15 pgs).
Wang et al.,"Neural source-filter-based waveform model for statistical parametric speech synthesis," arXiv preprint arXiv:1904.12088, 2019. (14pgs).
Wang et al.,"Tacotron: Towards end-to-end speech synthesis," arXiv preprint arXiv:1703.10135, 2017. (10pgs).
Wang et al.,"Tacotron: Towards end-to-end speech synthesis," In Interspeech, 2017. (3 pgs).
Wang et al.,"Tacotron:Towards End-to-End speech synthesis", In Interspeech, 2017.(5 pgs).
Weide et al.,"The CMU pronunciation dictionary," Retrieved from Internet <URL: <http://www.speech.cs.cmu.edu/cgi-bin/cmudict>, 2008. (2pgs).
Wu et al.,"A study of speaker adaptation for DNN-based speech synthesis," In Interspeech, 2015. (5 pgs).
Y. Agiomyrgiannakis,"Vocaine the vocoder and applications in speech synthesis," In ICASSP, 2015. (5 pgs).
Yamagishi et al., "Robust Speaker-Adaptive HMM-Based Text-to-Speech Synthesis," In IEEE Transactions on Audio, and Language Processing, 2009. (23pgs).
Yamagishi et al., "Thousands of Voices for HMM-Based Speech Synthesis-Analysis and Application of TTS Systems Built on Various ASR Corpora", In IEEE Transactions on Audio, Speech, and Language Processing, 2010. (21 pgs).
Yamagishi et al.,"Robust speaker-adaptive HMM-based text-to-speech synthesis," IEEE Transactions on Audio, Speech, and Language Processing, 2009. (23 pgs).
Yamagishi et al.,"Robust Speaker-Adaptive HMM-Based Text-to-Speech Synthesis," IEEE Transactions on Audio, Speech, and Language Processing, vol. 17, No. 6, Aug. 2009, [online], [Retrieved Jul. 8, 2018]. Retrieved from Internet <URL: <https://www.researchgate.net/publication/224558048> (24 pgs).
Yang et al.,"On the training of DNN-based average voice model for speech synthesis," In Signal & Info. Processing Association Annual Summit & Conference (APSIPA), Retrieved from Internet <URL: <http://www.nwpu-aslp.org/lxie/papers/2016APSIPA-YS.pdf>, 2016. (6 pgs).
Yao et al.,"Sequence-to-sequence neural net models for grapheme-to-phoneme conversion," arXiv preprint arXiv:1506.00196, 2015. (5 pgs).
Zen et al,"Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis," Retrieved from Internet <URL: <https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43266.pdf>, 2015. (5pgs).
Zen et al.,"Fast, Compact, and High quality LSTM-RNN based statistical parametric speech synthesizers for mobile devices," arXiv:1606.06061, 2016. (14 pgs).
Zen et al.,"Statistical parametric speech synthesis using deep neural networks," Retrieved from Internet <URL: <https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/40837.pdf>, 2013. (5pgs).
Zen et al.,"Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis," In IEEE ICASSP, 2015. (5 pgs).
Zhao et al.,"Wasserstein GAN & Waveform Loss-based acoustic model training for multi-speaker text-to-speech synthesis systems using a WaveNet vocoder," IEEE Access,2018.(10pgs).

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210089924A1 (en) * 2019-09-24 2021-03-25 Nec Laboratories America, Inc Learning weighted-average neighbor embeddings
US12340788B2 (en) 2019-11-15 2025-06-24 Electronic Arts Inc. Generating expressive speech audio from text data
US12033611B2 (en) 2019-11-15 2024-07-09 Electronic Arts Inc. Generating expressive speech audio from text data
US11295721B2 (en) * 2019-11-15 2022-04-05 Electronic Arts Inc. Generating expressive speech audio from text data
US20210225362A1 (en) * 2020-01-22 2021-07-22 Google Llc Attention-Based Joint Acoustic and Text On-Device End-to-End Model
US11594212B2 (en) * 2020-01-22 2023-02-28 Google Llc Attention-based joint acoustic and text on-device end-to-end model
US20210312326A1 (en) * 2020-04-02 2021-10-07 South University Of Science And Technology Of China Travel prediction method and apparatus, device, and storage medium
US11809970B2 (en) * 2020-04-02 2023-11-07 South University Of Science And Technology Of China Travel prediction method and apparatus, device, and storage medium
US11605368B2 (en) 2020-05-07 2023-03-14 Google Llc Speech recognition using unspoken text and speech synthesis
US11837216B2 (en) 2020-05-07 2023-12-05 Google Llc Speech recognition using unspoken text and speech synthesis
US11222620B2 (en) * 2020-05-07 2022-01-11 Google Llc Speech recognition using unspoken text and speech synthesis
US12288547B2 (en) * 2020-06-05 2025-04-29 Deepmind Technologies Limited Generating audio data using unaligned text inputs with an adversarial network
US20210383789A1 (en) * 2020-06-05 2021-12-09 Deepmind Technologies Limited Generating audio data using unaligned text inputs with an adversarial network
US20230215420A1 (en) * 2020-07-21 2023-07-06 Ai Speech Co., Ltd. Speech synthesis method and system
US11842722B2 (en) * 2020-07-21 2023-12-12 Ai Speech Co., Ltd. Speech synthesis method and system
US20220108680A1 (en) * 2020-10-02 2022-04-07 Google Llc Text-to-speech using duration prediction
US12100382B2 (en) * 2020-10-02 2024-09-24 Google Llc Text-to-speech using duration prediction
RU2803488C2 (en) * 2021-06-03 2023-09-14 Общество С Ограниченной Ответственностью «Яндекс» Method and server for waveform generation
US12175995B2 (en) 2021-06-03 2024-12-24 Y.E. Hub Armenia LLC Method and a server for generating a waveform
US11651139B2 (en) * 2021-06-15 2023-05-16 Nanjing Silicon Intelligence Technology Co., Ltd. Text output method and system, storage medium, and electronic device
US20230121683A1 (en) * 2021-06-15 2023-04-20 Nanjing Silicon Intelligence Technology Co., Ltd. Text output method and system, storage medium, and electronic device
US20230317059A1 (en) * 2022-03-20 2023-10-05 Google Llc Alignment Prediction to Inject Text into Automatic Speech Recognition Training
US12518741B2 (en) * 2022-03-20 2026-01-06 Google Llc Alignment prediction to inject text into automatic speech recognition training
US12020726B2 (en) * 2022-08-31 2024-06-25 Kabushiki Kaisha Toshiba Magnetic reproduction processing device, magnetic recording/reproducing device, and magnetic reproducing method
US12079230B1 (en) * 2024-01-31 2024-09-03 Clarify Health Solutions, Inc. Computer network architecture and method for predictive analysis using lookup tables as prediction models
US12271387B1 (en) 2024-01-31 2025-04-08 Clarify Health Solutions, Inc. Computer network architecture and method for predictive analysis using lookup tables as prediction models

Also Published As

Publication number Publication date
US20200066253A1 (en) 2020-02-27

Similar Documents

Publication Publication Date Title
US11017761B2 (en) Parallel neural text-to-speech
US11238843B2 (en) Systems and methods for neural voice cloning with a few samples
US11482207B2 (en) Waveform generation using end-to-end text-to-waveform system
US11705107B2 (en) Real-time neural text-to-speech
Ping et al. Deep voice 3: Scaling text-to-speech with convolutional sequence learning
CN112669809A (en) Parallel neural text to speech conversion
US10796686B2 (en) Systems and methods for neural text-to-speech using convolutional sequence learning
CN113574595B (en) Speech recognition system, method and non-transitory computer-readable storage medium
Zhang et al. Joint training framework for text-to-speech and voice conversion using multi-source tacotron and wavenet
Ping et al. Clarinet: Parallel wave generation in end-to-end text-to-speech
Van Den Oord et al. Wavenet: A generative model for raw audio
Oord et al. Wavenet: A generative model for raw audio
US11934935B2 (en) Feedforward generative neural networks
CN110246488B (en) Speech conversion method and device for semi-optimized CycleGAN model
Lim et al. Jdi-t: Jointly trained duration informed transformer for text-to-speech without explicit alignment
Kameoka et al. Many-to-many voice transformer network
CN110476206A (en) End-to-end text-to-speech
CN113205792A (en) Mongolian speech synthesis method based on Transformer and WaveNet
Fahmy et al. A transfer learning end-to-end arabic text-to-speech (tts) deep architecture
Masumura et al. Sequence-level consistency training for semi-supervised end-to-end automatic speech recognition
Chung et al. Reinforce-aligner: Reinforcement alignment search for robust end-to-end text-to-speech
CN116092475B (en) A stuttering speech editing method and system based on context-aware diffusion model
Ramos Voice conversion with deep learning
Zhang et al. Discriminatively trained sparse inverse covariance matrices for speech recognition
Gorodetskii et al. Zero-shot long-form voice cloning with dynamic convolution attention

Legal Events

Date Code Title Description
FEPP Fee payment procedure

Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

AS Assignment

Owner name: BAIDU USA LLC, CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:PENG, KAINAN;PING, WEI;SONG, ZHAO;AND OTHERS;REEL/FRAME:050884/0814

Effective date: 20191015

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STPP Information on status: patent application and granting procedure in general

Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS

STPP Information on status: patent application and granting procedure in general

Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT VERIFIED

STCF Information on status: patent grant

Free format text: PATENTED CASE

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

Year of fee payment: 4