EP3776531A1 - Clockwork hierarchical variational encoder - Google Patents
Clockwork hierarchical variational encoderInfo
- Publication number
- EP3776531A1 EP3776531A1 EP19720289.8A EP19720289A EP3776531A1 EP 3776531 A1 EP3776531 A1 EP 3776531A1 EP 19720289 A EP19720289 A EP 19720289A EP 3776531 A1 EP3776531 A1 EP 3776531A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- syllable
- level
- phoneme
- utterance
- fixed
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 claims abstract description 42
- 238000012545 processing Methods 0.000 claims description 121
- 230000015654 memory Effects 0.000 claims description 53
- 230000005236 sound signal Effects 0.000 claims description 48
- 238000012549 training Methods 0.000 claims description 23
- 238000013528 artificial neural network Methods 0.000 claims description 13
- 239000013598 vector Substances 0.000 claims description 12
- 230000006403 short-term memory Effects 0.000 claims description 11
- 238000004891 communication Methods 0.000 claims description 4
- 238000004590 computer program Methods 0.000 description 8
- 230000003287 optical effect Effects 0.000 description 6
- 238000013500 data storage Methods 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 230000015572 biosynthetic process Effects 0.000 description 3
- 230000001143 conditioned effect Effects 0.000 description 3
- 238000003786 synthesis reaction Methods 0.000 description 3
- 230000008859 change Effects 0.000 description 2
- 230000003750 conditioning effect Effects 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 230000006855 networking Effects 0.000 description 2
- 238000003062 neural network model Methods 0.000 description 2
- 230000007935 neutral effect Effects 0.000 description 2
- 230000033764 rhythmic process Effects 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 230000008451 emotion Effects 0.000 description 1
- 230000002996 emotional effect Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000036651 mood Effects 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 230000001902 propagating effect Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000001953 sensory effect Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000002194 synthesizing effect Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/04—Details of speech synthesis systems, e.g. synthesiser structure or memory management
- G10L13/047—Architecture of speech synthesisers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
- G10L13/10—Prosody rules derived from text; Stress or intonation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
- G10L13/10—Prosody rules derived from text; Stress or intonation
- G10L2013/105—Duration
Definitions
- This disclosure relates to a clockwork hierarchal variational encoder for predicting prosody.
- Speech synthesis systems use text-to-speech (TTS) models to generate speech from textual input.
- TTS text-to-speech
- the generated/synthesized speech should accurately convey the message (intelligibility) while sounding like human speech (naturalness) with an intended prosody (expressiveness).
- traditional concatenative and parametric synthesis models are capable of providing intelligible speech and recent advances in neural modeling of speech have significantly improved the naturalness of synthesized speech
- most existing TTS models are ineffective at modeling prosody, thereby causing synthesized speech used by important applications to lack expressiveness.
- it is desirable for applications such as conversational assistants and long-form readers to produce realistic speech by imputing prosody features not conveyed in textual input, such as intonation, stress, and rhythm and style.
- a simple statement can be spoken in many different ways depending on whether the statement is a question, an answer to a question, there is uncertainty in the statement, or to convey any other meaning about the environment or context which is unspec
- One aspect of the disclosure provides a method of representing an intended prosody in synthesized speech.
- the method includes receiving, at data processing hardware, a text utterance having at least one word, and selecting, by the data processing hardware, an utterance embedding for the text utterance.
- Each word in the text utterance has at least one syllable and each syllable has at least one phoneme.
- the utterance embedding represents an intended prosody.
- the method For each syllable, using the selected utterance embedding, the method also includes: predicting, by the data processing hardware, a duration of the syllable by encoding linguistic features of each phoneme of the syllable with a corresponding prosodic syllable embedding for the syllable;
- Each fixed-length predicted pitch frame represents part of the predicted pitch contour of the syllable.
- a network representing a hierarchical linguistic structure of the text utterance includes a first level including each syllable of the text utterance, a second level including each phoneme of the text utterance, and a third level including each fixed-length predicted pitch frame for each syllable of the text utterance.
- the first level of the network may include a long short-term memory (LSTM) processing cell representing each syllable of the text utterance
- the second level of the network may include a LSTM processing cell representing each phoneme of the text utterance
- the third level of the network may include a LSTM processing cell representing each fixed-length predicted pitch frame.
- the LSTM processing cells of the second level clock relative to and faster than the LSTM processing cells of the first level
- the LSTM processing cells of the third level clock relative to and faster than the LSTM processing cells of the second level.
- predicting the duration of the syllable includes: for each phoneme associated with the syllable, predicting a duration of the corresponding phoneme by encoding the linguistic features of the corresponding phoneme with the corresponding prosodic syllable embedding for the syllable; and determining the duration of the syllable by summing the predicted durations for each phoneme associated with the syllable.
- predicting the pitch contour of the syllable based on the predicted duration for the syllable may include combining the corresponding prosodic syllable embedding for the syllable with each encoding of the corresponding prosodic syllable embedding and the phone-level linguistic features of each corresponding phoneme associated with the syllable.
- the method also includes, for each syllable, using the selected utterance embedding: predicting, by the data processing hardware, an energy contour of each phoneme in the syllable based on a predicted duration for the phoneme; and for each phoneme associated with the syllable, generating, by the data processing hardware, a plurality of fixed-length predicted energy frames based on the predicted duration for the phoneme.
- each fixed-length energy frame represents the predicted energy contour of the corresponding phoneme.
- a hierarchical linguistic structure represents the text utterance and the hierarchical linguistic structure includes a first level including each syllable of the text utterance, a second level including each phoneme of the text utterance, a third level including each fixed-length predicted pitch frame for each syllable of the text utterance, and a fourth level parallel to the third level and including each fixed-length predicted energy frame for each phoneme of the text utterance.
- the first level may include a long short-term memory (LSTM) processing cell representing each syllable of the text utterance
- the second level may include a LSTM processing cell representing each phoneme of the text utterance
- the third level may include a LSTM processing cell representing each fixed-length predicted pitch frame
- the fourth level may include a LSTM processing cell representing each fixed- length predicted energy frame.
- LSTM long short-term memory
- the LSTM processing cells of the second level clock relative to and faster than the LSTM processing cells of the first level
- the LSTM processing cells of the third level clock relative to and faster than the LSTM processing cells of the second level
- the LSTM processing cells of the fourth level clock at the same speed as the LSTM processing cells of the third level and clock relative to and faster than the LSTM processing cells of the second level.
- the third level of the hierarchical linguistic structure includes a feed-forward layer that predicts the predicted pitch frames for each syllable in a single pass and/or the fourth level of the hierarchical linguistic structure includes a feed-forward layer that predicts the predicted energy frames for each phoneme in a single pass.
- the lengths of the fixed-length predicted energy frames and the fixed- length predicted pitch frames may be the same.
- a total number of fixed-length predicted energy frames generated for each phoneme of the received text utterance may be equal to a total number of the fixed-length predicted pitch frames generated for each syllable of the received text utterance.
- the method also includes: receiving, by the data processing hardware, training data including a plurality of reference audio signals, each reference audio signal including a spoken utterance of human speech and having a corresponding prosody; and training, by the data processing hardware, a deep neural network for a prosody model by encoding each reference audio signal into a
- the utterance embedding may include a fixed-length numerical vector.
- Another aspect of the disclosure provides a system for representing an intended prosody in synthesized speech.
- the system includes data processing hardware and memory hardware in communication with the data processing hardware.
- the memory hardware stores instructions that when executed on the data processing hardware cause the data processing hardware to perform operations.
- the operations include receiving a text utterance having at least one word, and selecting an utterance embedding for the text utterance.
- Each word in the text utterance has at least one syllable and each syllable has at least one phoneme.
- the utterance embedding represents an intended prosody.
- the operations also include: predicting a duration of the syllable by encoding linguistic features of each phoneme of the syllable with a corresponding prosodic syllable embedding for the syllable; predicting a pitch contour of the syllable based on the predicted duration for the syllable; and generating a plurality of fixed-length predicted pitch frames based on the predicted duration for the syllable.
- Each fixed-length predicted pitch frame represents part of the predicted pitch contour of the syllable.
- a network representing a hierarchical linguistic structure of the text utterance includes a first level including each syllable of the text utterance, a second level including each phoneme of the text utterance, and a third level including each fixed- length predicted pitch frame for each syllable of the text utterance.
- the first level of the network may include a long short-term memory (LSTM) processing cell representing each syllable of the text utterance
- the second level of the network may include a LSTM processing cell representing each phoneme of the text utterance
- the third level of the network may include a LSTM processing cell representing each fixed-length predicted pitch frame.
- the LSTM processing cells of the second level clock relative to and faster than the LSTM processing cells of the first level
- the LSTM processing cells of the third level clock relative to and faster than the LSTM processing cells of the second level.
- predicting the duration of the syllable includes: for each phoneme associated with the syllable, predicting a duration of the corresponding phoneme by encoding the linguistic features of the corresponding phoneme with the corresponding prosodic syllable embedding for the syllable; and determining the duration of the syllable by summing the predicted durations for each phoneme associated with the syllable.
- predicting the pitch contour of the syllable based on the predicted duration for the syllable may include combining the corresponding prosodic syllable embedding for the syllable with each encoding of the corresponding prosodic syllable embedding and the phone-level linguistic features of each corresponding phoneme associated with the syllable.
- the operations also include, for each syllable, using the selected utterance embedding: predicting an energy contour of each phoneme in the syllable based on a predicted duration for the phoneme; and for each phoneme associated with the syllable, generating a plurality of fixed-length predicted energy frames based on the predicted duration for the phoneme.
- each fixed-length energy frame represents the predicted energy contour of the corresponding phoneme.
- a hierarchical linguistic structure represents the text utterance and the hierarchical linguistic structure includes a first level including each syllable of the text utterance, a second level including each phoneme of the text utterance, a third level including each fixed-length predicted pitch frame for each syllable of the text utterance, and a fourth level parallel to the third level and including each fixed-length predicted energy frame for each phoneme of the text utterance.
- the first level may include a long short-term memory (LSTM) processing cell representing each syllable of the text utterance
- the second level may include a LSTM processing cell representing each phoneme of the text utterance
- the third level may include a LSTM processing cell representing each fixed-length predicted pitch frame
- the fourth level may include a LSTM processing cell representing each fixed-length predicted energy frame.
- LSTM long short-term memory
- the LSTM processing cells of the second level clock relative to and faster than the LSTM processing cells of the first level
- the LSTM processing cells of the third level clock relative to and faster than the LSTM processing cells of the second level
- the LSTM processing cells of the fourth level clock at the same speed as the LSTM processing cells of the third level and clock relative to and faster than the LSTM processing cells of the second level.
- the third level of the hierarchical linguistic structure includes a feed-forward layer that predicts the predicted pitch frames for each syllable in a single pass and/or the fourth level of the hierarchical linguistic structure includes a feed-forward layer that predicts the predicted energy frames for each phoneme in a single pass.
- the lengths of the fixed-length predicted energy frames and the fixed- length predicted pitch frames may be the same.
- a total number of fixed-length predicted energy frames generated for each phoneme of the received text utterance may be equal to a total number of the fixed-length predicted pitch frames generated for each syllable of the received text utterance.
- the operations also includes: receiving training data including a plurality of reference audio signals, each reference audio signal including a spoken utterance of human speech and having a corresponding prosody; and training a deep neural network for a prosody model by encoding each reference audio signal into a corresponding fixed-length utterance embedding representing the corresponding prosody of the reference audio signal.
- the utterance embedding may include a fixed-length numerical vector.
- FIG. l is a schematic view of an example system for training a deep neural network to provide a controllable prosody model for use in predicting a prosodic representation for a text utterance.
- FIG. 2A is a schematic view of a hierarchical linguistic structure for encoding prosody of a reference audio signal into a fixed-length utterance embedding.
- FIG. 2B is a schematic view of a hierarchical linguistic structure using a fixed-length utterance embedding to predict a prosodic representation of a text utterance.
- FIG. 2C is a schematic view of an encoder portion of a hierarchical linguistic structure configured to encode fixed-length reference frames directly into a fixed-length utterance embedding.
- FIGS. 3 A and 3B are schematic views of an example autoencoder for predicting duration and pitch contours for each syllable of a text utterance.
- FIG. 3C is a schematic view of an example autoencoder for predicting duration and energy contours for each phoneme of a text utterance.
- FIG. 4 is a flowchart of an example arrangement of operations for a method of predicting a prosodic representation of a received text utterance.
- FIG. 5 is a schematic view of an example computing device that may be used to implement the systems and methods described herein.
- Text-to-speech (TTS) models often used by speech synthesis systems, are generally only given text inputs without any reference acoustic representation at runtime, and must impute many linguistic factors that are not provided by the text inputs in order to produce realistically sounding synthesized speech.
- a subset of these linguistic factors are collectively referred to as prosody and may include intonation (pitch variation), stress (stressed syllables vs. non-stressed syllables), duration of sounds, loudness, tone, rhythm, and style of the speech.
- Prosody may indicate the emotional state of the speech, the form of the speech (e.g., statement, question, command, etc.), the presence of irony or sarcasm of the speech, uncertainty in the knowledge of the speech, or other linguistic elements incapable of being encoded by grammar or vocabulary choice of the input text.
- a given text input that is associated with a high degree of prosodic variation can produce synthesized speech with local changes in pitch and speaking duration to convey different semantic meanings, and also with global changes in the overall pitch trajectory to convey different moods and emotions.
- Neural network models provide potential for robustly synthesizing speech by predicting linguistic factors corresponding to prosody that are not provided by text inputs.
- applications such as audiobook narration, news readers, voice design software, and conversational assistants can produce realistically sounding synthesized speech that is not monotonous-sounding.
- Implementations herein are directed toward a neural network model that includes a variational autoencoder (VAE) having an encoder portion for encoding a reference audio signal corresponding to a spoken utterance into an utterance embedding that represents the prosody of the spoken utterance, and a decoder portion that decodes the utterance embedding to predict durations of phonemes and pitch and energy contours for each syllable.
- VAE variational autoencoder
- the encoder portion may train utterance embeddings representing prosody by encoding numerous reference audio signals conditioned on linguistic features
- the linguistic features may include, without limitation, individual sounds for each phoneme, whether each syllable is stressed or un-stressed, the type of each word (e.g., noun/adjective/verb) and/or the position of the word in the utterance, and whether the utterance is a question or phrase.
- Each utterance embedding is represented by a fixed-length numerical vector.
- the fixed- length numerical vector includes a value equal to 256. However, other implementations may use fixed-length numerical vectors having values greater than or less than 256.
- the decoder portion may decode a fixed-length utterance embedding into a sequence of phoneme durations via a first decoder and into a sequence of fixed-length frames (e.g., five millisecond) of pitch and energy using the phoneme durations.
- the phoneme durations and fixed-length frames of pitch and energy predicted by the decoder portion closely match the phoneme durations and fixed-length frames of pitch and energy sampled from the reference audio signal associated with the fixed-length utterance embedding.
- the VAE of the present disclosure includes a Clockwork Hierarchal
- Variational Autoencoder that incorporates hierarchical stacked layers of long- short term-memory (LSTM) cells, with each layer of LSTM cells incorporating structure of the utterance such that one layer represents phonemes, a next layer represents syllables, and another layer represents words.
- the hierarchy of stacked layers of LSTM cells are variably clocked to a length of hierarchical input data.
- the syllable layer of the CHiVE would clock three times relative to a single clock of the word layer for the first input word, and then the syllable layer would clock four more times relative to a subsequent single clock of the word layer for the second word.
- the CHiVE is configured to receive a text utterance and select an utterance embedding for the text utterance.
- the received text utterance has at least one word, each word has at least one syllable, and each syllable has at least one phoneme. Since the text utterance is missing context, semantic information, and pragmatic information to guide the appropriate prosody for producing synthesized speech from the utterance, the CHiVE uses that selected utterance embedding as the latent variable to represent an intended prosody.
- the CHiVE uses the selected utterance embedding to predict a duration of each syllable by encoding linguistic features of each phoneme contained in the syllable with a corresponding prosodic syllable embedding for the syllable, and predict a pitch of each syllable based on the predicted duration for the syllable.
- the CHiVE is configured to generate a plurality of fixed-length pitch frames based on the predicted duration for each syllable such that each fixed-length pitch frame represents the predicted pitch of the syllable.
- the CHiVE may similarly predict energy (e.g., loudness) of each syllable based on the predicted duration for the syllable and generate a plurality of fixed-length energy frames each representing the predicted energy of the syllable.
- the fixed-length pitch and/or energy frames may be provided to a unit- selection model or wave-net model of a TTS system to produce the synthesized speech with the intended prosody provided by the input fixed-length utterance embedding.
- FIG. 1 shows an example system 100 for training a deep neural network 200 to provide a controllable prosody model 300, and for predicting a prosodic representation 322 for a text utterance 320 using the prosody model 300.
- the system 100 includes a computing system 120 having data processing hardware 122 and memory hardware 124 in communication with the data processing hardware 122 and storing instructions that cause the data processing hardware 122 to perform operations.
- the computing system 120 e.g., the data processing hardware 122
- TTS text-to-speech
- the prosody model 300 may predict a prosodic representation 322 for the input text utterance 320 by conditioning the model 300 on linguistic features extracted from the text utterance 320 and using a fixed-length utterance embedding 260 as a latent variable representing an intended prosody for the text utterance 320.
- the computing system 120 implements the TTS system 150.
- the computing system 120 and the TTS system 150 are distinct and physically separate from one another.
- the computing system may include a distributed system (e.g., cloud computing environment).
- the deep neural network 200 is trained on a large set of reference audio signals 222.
- Each reference audio signal 222 may include a spoken utterance of human speech recorded by a microphone and having a prosodic
- the deep neural network 200 may receive multiple reference audio signals 222 for a same spoken utterance, but with varying prosodies (i.e., the same utterance can be spoken in multiple different ways).
- the reference audio signals 222 are of variable-length such that the duration of the spoken utterances varies even though the content is the same.
- the deep neural network 200 is configured to encode/compress the prosodic representation associated with each reference audio signal 222 into a corresponding fixed-length utterance embedding 260.
- the deep neural network 200 may store each fixed-length utterance embedding 260 in an utterance embedding storage 180 (e.g., on the memory hardware 124 of the computing system 120) along with a corresponding transcript 261 of the reference audio signal 222 associated the utterance embedding 260.
- the deep neural network 200 may be further trained by back- propagating the fixed-length utterance embeddings 260 conditioned upon linguistic features extracted from the transcripts 261 to generate fixed-length frames of pitch, energy, and duration of each syllable.
- the computing system 120 may use the prosody model 300 to predict a prosodic representation 322 for a text utterance 320.
- the prosody model 300 may select an utterance embedding 260 for the text utterance 320.
- the utterance embedding 260 represents an intended prosody of the text utterance 320. Described in greater detail below with reference to FIGS. 2A-2C and 3A-3C, the prosody model 300 may predict the prosodic representation 322 for the text utterance 320 using the selected utterance embedding 260.
- the prosodic representation 322 may include predicted pitch, predicted timing, and predicted loudness (e.g., energy) for the text utterance 320.
- the TTS system 150 uses the prosodic representation 322 to produce synthesized speech 152 from the text utterance 320 and having the intended prosody.
- FIGS. 2A and 2B show a hierarchical linguistic structure (e.g., deep neural network of FIG. 1) 200 for a clockwork hierarchal variational autoencoder (CHiVE) 300 (‘autoencoder 300’) that provides a controllable model of prosody that jointly predicts, for each syllable of given input text, a duration of all phonemes in the syllable and pitch (F0) and energy (CO) contours for the syllable without relying on any unique mappings from the given input text or other linguistic specification to produce synthesized speech 152 having an intended/selected prosody.
- the autoencoder 300 an encoder portion 302 (FIG.
- the autoencoder 300 is trained so that the number of predicted frames 280 output from the decoder portion 310 is equal to the number of reference frames 220 input to the encoder portion 302. Moreover, the autoencoder 300 is trained so that data associated with the reference and predicted frames 220, 280 substantially match one another.
- the encoder portion 302 receives the sequence of fixed- length reference frames 220 from the input reference audio signal 222.
- the input reference audio signal 222 may include a spoken utterance of human speech recorded by a microphone that includes a target prosody.
- the encoder portion 302 may receive multiple reference audio signals 222 for a same spoken utterance, but with varying prosodies (i.e., the same utterance can be spoken in multiple different ways). For example, the same spoken utterance may vary in prosody when the spoken reference is an answer to a question compared to when the spoken utterance is a question.
- the reference frames 220 may each include a duration of 5 milliseconds (ms) and represent one of a contour of pitch (F0) or a contour of energy (CO) for the reference audio signal 222.
- the encoder portion 302 may also receive a second sequence of reference frames 220 each including a duration of 5ms and representing the other one of the contour of pitch (F0) or the contour of energy (CO) for the reference audio signal 222.
- the sequence reference frames 220 sampled from the reference audio signal 222 provide a duration, pitch contour, and/or energy contour to represent prosody for the reference audio signal 222.
- the length or duration of the reference audio signal 222 correlates to a sum of the total number of reference frames 220.
- the encoder portion 302 includes hierarchical levels of reference frames 220, phonemes 230,230a, syllables 240, 240a, and words 250, 250a for the reference audio signal 222 that clock relative to one another. For instance, the level associated with the sequence of reference frames 220 clocks faster than the next level associated with the sequence of phonemes 230. Similarly, the level associated with the sequence of syllables 240 clocks slower than the level associated with the sequence of phonemes 230 and faster than the level associated with the sequence of words 250.
- the slower clocking layers receive, as input, an output from faster clocking layers so that the output after the final clock (i.e., state) of a faster layer is taken as the input to the corresponding slower layer to essentially provide a sequence-to-sequence encoder.
- the hierarchical levels include Long Short-Term Memory (LSTM) levels.
- the encoder portion 302 first encodes the sequence of reference frames 220 into the sequence of phonemes 230.
- Each phoneme 230 receives, as input, a corresponding encoding of a subset of reference frames 220 and includes a duration equal to the number of reference frames 220 in the encoded subset.
- the first four fixed-length reference frames 220 are encoded into phoneme 230Aal; the next three fixed-length reference frames 220 are encoded into phoneme 230Aa2; the next four fixed-length reference frames 220 are encoded into phoneme 230Abl; the next two fixed-length reference frames 220 are encoded into phoneme 230Bal, the next five fixed-length reference frames 220 are encoded into phoneme 230Ba2; the next four fixed-length reference frames 220 are encoded into phoneme 230Ba3; the next three fixed-length reference frames 220 are encoded into phoneme 230Cal; the next four fixed-length reference frames 220 are encoded into phoneme 230CM; and the final two fixed-length reference frames 220 are encoded into phoneme 230Cb2.
- each phoneme 230 in the sequence of phonemes 230 includes a corresponding duration based on the number of reference frames 220 encoded into the phoneme 230 and corresponding pitch and/or energy contours.
- phoneme 230Aal includes a duration equal to 20ms (i.e., four reference frames 220 each having the fixed-length of five milliseconds) and phoneme 230Aa2 includes a duration equal to l5ms (i.e., three reference frames 220 each having the fixed-length of five milliseconds).
- the level of reference frames 220 clocks a total of seven times for a single clocking between the phoneme 230Aal and the next phoneme 230Aa2 for the level of phonemes 230.
- the encoder portion 302 is further configured to encode the sequence of phonemes 230 into the sequence of syllables 240 for the reference audio signal 222.
- each syllable 240 receives, as input, a corresponding encoding of one or more phonemes 230 and includes a duration equal to a sum of the durations for the one or more phonemes 230 of the corresponding encoding.
- the duration of the syllables 240 may indicate timing of the syllables 240 and pauses in between adjacent syllables 240.
- the first two phonemes 230Aal, 230Aa2 are encoded into syllable 240Aa; the next phoneme 230Abl is encoded into syllable 240Ab; each of phonemes 230Bal, 230Ba2, 230Ba3 are encoded into syllable 240Ba; phoneme 230Cal is encoded into syllable 240Ca; and phonemes 230CM, 230Cb2 are encoded into syllable 240Cb.
- Each syllable 240 Aa- 240Cb in the level of syllables 240 may correspond to a respective syllable embedding (e.g., a numerical vector) that indicates a duration, pitch (F0), and/or energy (CO) associated with the corresponding syllable 240. Moreover, each syllable is indicative of a corresponding state for the level of syllables 240.
- a respective syllable embedding e.g., a numerical vector
- F0 duration, pitch
- CO energy
- syllable 240Aa includes a duration equal to 35ms (i.e., the sum of the 20ms duration for phoneme 230Aal and the l5ms duration for phone 230A2) and syllable 240Ab includes a duration equal to 20ms (i.e., the 20ms duration for phoneme 230Abl).
- the level of reference frames 220 clocks a total of eleven times and the level of phonemes 230 clocks a total of three times for a single clocking between the syllable 240Aa and the next syllable 240Ab for the level of syllables 240.
- the encoder portion 302 further encodes the sequence of syllables 240 into the sequence of words 250 for the reference audio signal 222.
- syllables 240Aa, 240Ab are encoded into word 250A
- syllable 240Ba is encoded into word 250B
- syllables 240Ca, 240Cb are encoded into word 250C.
- the encoder portion 302 encodes the sequence of words 250 into the fixed-length utterance embedding 260.
- the fixed-length utterance embedding 260 includes a numerical vector representing a prosody of the reference audio signal 222.
- the fixed-length utterance embedding 260 includes a numerical vector having a value equal to“256”.
- the encoder portion 302 may repeat this process for each reference audio signal 222.
- the encoder portion 302 encodes a plurality of reference audio signals 222 each corresponding to a same spoken utterance/phrase but with varying prosodies, i.e., each reference audio signal 222 conveys the same utterance but is spoken differently.
- the fixed-length utterance embedding 260 may be stored in the data storage 180 (FIG. 1) along with a respective transcript 261 (e.g., textual representation) of the reference audio signal 222.
- linguistic features may be extracted and stored for use in conditioning the training of the hierarchical linguistic structure 200.
- the linguistic features may include, without limitation, individual sounds for each phoneme, whether each syllable is stressed or un stressed, the type of each word (e.g., noun/adjective/verb) and/or the position of the word in the utterance, and whether the utterance is a question or phrase.
- the hierarchical linguistic structure 200 omits the level associated with the sequence of phonemes 230 and allows the encoder portion 302 to simply encode a corresponding subset of reference frames 220 into each syllable 240 of the syllable level 240 during training.
- the first seven reference frames 220 may be encoded directly into syllable 240Aa without having to encode into corresponding phonemes 230Aal, 230Aa2 (FIG. 2 A) as an intermediary step.
- the hierarchical linguistic structure 200 may optionally omit the level associated with the sequence of words 250 and allow the encoder portion 302 to encode the sequence of syllables 240 directly into the fixed-length utterance embedding 260.
- training may instead optionally include the level of associated with the sequence of phonemes 230 and allow the encoder portion 302 to simply encode a corresponding subset of reference frames 220 into each phoneme 230 of the level of phonemes 230 and then encode a corresponding subset of phonemes 230 directly into the fixed-length utterance embedding 260 without having to encode corresponding syllables 240 and/or words 250.
- the decoder portion 310 of the variational autoencoder 300 is configured to produce a plurality of fixed-length syllable embeddings 245 by initially decoding a fixed-length utterance embedding 260 that represents a prosody for an utterance.
- the utterance embedding 260 may include the utterance embedding 260 output from the encoder portion 302 of FIGS. 2 A and 2C by encoding the plurality of fixed-length reference frames 220 sampled from the reference audio signal 222.
- the decoder portion 310 is configured to back-propagate the utterance embedding 260 during training to generate the plurality of fixed-length predicted frames 280 that closely match the plurality of fixed-length reference frames 220.
- fixed-length predicted frames 280 for both pitch (F0) and energy (CO) may be generated in parallel to represent a target prosody (e.g., predicted prosody) that substantially matches the reference prosody of the reference audio signal 222 input to the encoder portion 302 as training data.
- a TTS system 150 uses the fixed-length predicted frames 280 to produce synthesized speech 152 with a selected prosody based on the fixed-length utterance embedding 260.
- a unit selection module or a WaveNet module of the TTS system 150 may use the frames 280 to produce the synthesized speech 152 having the intended prosody.
- the decoder portion 310 decodes the utterance embedding 260 (e.g., numerical value of“256”) received from the encoder portion 302 (FIGS. 2A or 2C) into hierarchical levels of words 250, 250b, syllables 240, 240b, phonemes 230, 230b, and the fixed-length predicted frames 280.
- the fixed- length utterance embedding 260 corresponds to a variational layer of hierarchical input data for the decoder portion 310 and each of the stacked hierarchical levels include Long Short-Term Memory (LSTM) processing cells variably clocked to a length of the hierarchical input data.
- LSTM Long Short-Term Memory
- the syllable level 240 clocks faster than the word level 250 and slower than the phoneme level 230.
- the rectangular blocks in each level correspond to LSTM processing cells for respective words, syllables, phonemes, or frames.
- the autoencoder 300 gives the LSTM processing cells of the word level 250 memory over the last 100 words, gives the LSTM cells of the syllable level 240 memory over the last 100 syllables, gives the LSTM cells of the phoneme level 230 memory over the last 100 phonemes, and gives the LSTM cells of the fixed-length pitch and/or energy frames 280 memory over the last 100 fixed-length frames 280.
- the fixed-length frames 280 include a duration (e.g., frame rate) of five milliseconds each
- the corresponding LSTM processing cells provide memory over the last 500 milliseconds (e.g., a half second).
- the decoder portion 310 of the hierarchical linguistic structure 200 simply back-propagates the fixed-length utterance embedding 260 encoded by the encoder portion 302 into the sequence of three words 250A-250C, the sequence of five syllables 240Aa-240Cb, and the sequence of nine phonemes 230Aal-230Cb2 to generate the sequence of predicted fixed-length frames 280.
- the decoder portion 310 is conditioned upon linguistic features of the input text. By contrast to the encoder portion 302 of FIGS.
- the decoder portion 310 includes outputs from slower clocking layers feeding faster clocking layers such that the output of a slower clocking layer is distributed to the input of the faster clocking layer at each clock cycle with a timing signal appended thereto.
- the autoencoder 300 uses the hierarchical linguistic structure 200 to predict a prosodic representation for a given text utterance 320 during inference by jointly predicting durations of phonemes 230 and pitch and/or energy contours for each syllable 240 of the given text utterance 320. Since the text utterance 320 does not provide any context, semantic information, or pragmatic information to indicate an appropriate prosody for the text utterance, the autoencoder 300 selects an utterance embedding 260 as a latent variable to represent an intended prosody for the text utterance 320.
- the utterance embedding 260 may be selected from the utterance embedding data storage 180 (FIG. 1). Each utterance embedding 260 in the storage 180 may be encoded by the encoder portion 302 (FIGS. 2 A and 2C) from a corresponding variable- length reference audio signal 222 (FIGS. 2 A and 2C) during training. Specifically, the encoder portion 302 compresses prosody of variable-length reference audio signals 222 into fixed-length utterance embeddings 260 during training and stores each utterance embedding 260 together with a transcript 261 of the corresponding reference audio signal 222 in the utterance embedding data storage 180 for use by the decoder portion 310 at inference.
- the autoencoder 300 may first locate utterance embeddings 260 having transcripts 261 that closely match the text utterance 320 and then select one of the utterance embeddings 260 to predict the prosodic representation 322 (FIG. 1) for the given text utterance 320.
- the fixed-length utterance embedding 260 is selected by picking a specific point in a latent space of embeddings 260 that likely represents particular semantics and pragmatics for a target prosody.
- the latent space is sampled to choose a random utterance embedding 260 for representing the intended prosody for the text utterance 320.
- the autoencoder 300 models the latent space as multidimensional unit Gaussian by choosing a mean of the utterance embeddings 260 having closely matching transcripts
- FIGS. 3A and 3C show the text utterance 320 having three words 250A, 250B, 250C represented in the word level 250 of the hierarchical linguistic structure 200.
- the first word 250A contains syllables 240Aa, 240Ab
- the second word 250B contains one syllable 240Ba
- the third word 250C contains syllables 240Ca, 240Cb.
- the syllable level 240 of the hierarchical linguistic structure 200 includes a sequence of five syllables 240Aa-240Cb of the text utterance 320.
- the autoencoder 300 is configured to produce/output a corresponding syllable embedding 245Aa, 245Ab, 245Ba, 245Ca, 245Cb for each syllable 240 from the following inputs: the fixed-length utterance embedding 260;
- the utterance-level linguistic features 262 may include, without limitation, whether or not the text utterance 320 is a question, an answer to a question, a phrase, a sentence, etc.
- the word-level linguistic features 252 may include, without limitation, a word type (e.g., noun, pronoun, verb, adjective, adverb, etc.) and a position of the word in the text utterance 320.
- the syllable- level linguistic features 242 may include, without limitation, whether the syllable 240 is stressed or unstressed.
- each syllable 240Aa, 240Ab, 240Ba, 240Ca, 240Cb in the syllable level 240 may be associated with a corresponding LTSM processing cell that outputs a corresponding syllable embedding 245Aa, 245Ab, 245Ba, 245Ca, 245Cb to the faster clocking phoneme level 230 for decoding the individual fixed-length predicted pitch (FO) frames 280, 280F0 (FIG. 3A) and for decoding the individual fixed-length predicted energy (CO) frames 280, 280C0 (FIG. 3C) in parallel.
- each syllable in the syllable level 240 including a plurality of fixed-length predicted pitch (F0) frames 280F0 that indicate a duration (timing and pauses) and a pitch contour for the syllable 240.
- the duration and pitch contour correspond to a prosodic
- FIG. 3C shows each phoneme in the phoneme level 230 including a plurality of fixed-length predicted energy (CO) frames 280C0 that indicate a duration and an energy contour for the phoneme.
- CO predicted energy
- the first syllable 240Aa (i.e., LTSM processing cell Aa) in the syllable level 240 receives the fixed-length utterance embedding 260, utterance-level linguistic features 262 associated with the text utterance 320, word-level linguistic features 252A associated with the first word 250A, and the syllable-level linguistic features 242Aa for the syllable 240Aa as inputs for producing the corresponding syllable embedding 245 Aa.
- the second syllable 240Ab in the syllable level 240 receives the fixed-length utterance embedding 260, the utterance-level linguistic features 262 associated with the text utterance 320, the word-level linguistic features 252A associated with the first word 250A, and
- corresponding syllable-level linguistic features 242 (not shown) for the syllable 240Ab as inputs for producing the corresponding syllable embedding 245Aa. While the example only shows syllable-level linguistic features 242 associated with the first syllable 240Aa, the corresponding syllable-level linguistic features 242 associated with each other syllable 240Ab-240Cb in the syllable level 240 are only omitted from the views of FIGS. 3 A and 3B for the sake of clarity.
- the corresponding syllable-level linguistic features 242 input to the processing block for syllable 240Ab are not shown.
- the LTSM processing cell e.g., rectangle Ab
- the remaining sequence of syllables 240Ba, 240Ca, 240Cb in the syllable level 240 each produce corresponding syllable embeddings 245Ba, 245Ca, 245Cb in a similar manner.
- each LTSM processing cell of the syllable level 240 receives the state of the immediately preceding LTSM processing cell of the syllable level 240.
- the phoneme level 230 of the hierarchical linguistic structure 200 includes the sequence of nine phonemes 230Aal-230Cb2 each associated with a corresponding predicted phoneme duration 234.
- the autoencoder 300 encodes the phoneme-level linguistic features 232 associated with each phoneme 230Aal-230Cb2 with the corresponding syllable embedding 245 for predicting the corresponding predicted phoneme duration 234 and for predicting the corresponding pitch (fO) contour for the syllable containing the phoneme.
- the phoneme-level linguistic features 232 may include, without limitation, an identity of sound for the corresponding phoneme 230.
- the first syllable 240 Aa contains phonemes 230Aal, 230Aa2 and includes a predicted syllable duration equal to the sum of the predicted phone durations 234 for the phonemes 230Aal, 230Aa2.
- the predicted syllable duration for the first syllable 240Aa determines the number of fixed-length predicted pitch (F0) frames 280F0 to decode for the first syllable 240 Aa.
- the autoencoder 300 decodes a total of seven fixed-length predicted pitch (F0) frames 280F0 for the first syllable 240Aa based on the sum of the predicted phoneme durations 234 for the phonemes 230Aal, 230Aa2. Accordingly, the faster clocking syllable layer 240 distributes the first syllable embedding 245 Aa as an input to each phoneme 230Aal, 230Aa2 included in the first syllable 240Aa. A timing signal may also be appended to the first syllable embedding 245Aa. The syllable level 240 also passes the state of the first syllable 240Aa to the second syllable 240 Ab.
- F0 fixed-length predicted pitch
- the second syllable 240Ab contains a single phoneme 230AM and therefore includes a predicted syllable duration equal to the predicted phoneme duration 234 for the phoneme 230Ab 1.
- the autoencoder 300 decodes a total of four fixed-length predicted pitch (F0) frames 280F0 for the second syllable 240Ab.
- the faster clocking syllable layer 240 distributes the second syllable embedding 245Ab as an input to the phoneme 230AM .
- a timing signal may also be appended to the second syllable embedding 245Aa.
- the syllable level 240 also passes the state of the second syllable 240Ab to the third syllable 240Ba.
- the third syllable 240Ba contains phonemes 230Bal, 230Ba2, 230Ba3 and includes a predicted syllable duration equal to the sum of the predicted phoneme durations 234 for the phonemes 230Bal, 230Ba2, 230Ba3.
- the autoencoder 300 decodes a total of eleven fixed-length predicted pitch (F0) frames 280F0 for the third syllable 240Ba based on the sum of the predicted phoneme durations 234 for the phonemes 230Bal, 230Ba2, 230Ba3.
- the faster clocking syllable layer 240 distributes the third syllable embedding 245Ba as an input to each phoneme 230Bal, 230Ba2, 230Ba3 included in the third syllable 240Ba.
- a timing signal may also be appended to the third syllable embedding 245Ba.
- the syllable level 240 also passes the state of the third syllable 240Ba to the fourth syllable 240Ca.
- the fourth syllable 240Ca contains a single phoneme 230Cal and therefore includes a predicted syllable duration equal to the predicted phoneme duration 234 for the phoneme 230Cal .
- the autoencoder 300 decodes a total of three fixed-length predicted pitch (F0) frames 280F0 for the fourth syllable 240Ca. Accordingly, the faster clocking syllable layer 240 distributes the fourth syllable embedding 245Ca as an input to the phoneme 230Cal.
- a timing signal may also be appended to the fourth syllable embedding 245Ca.
- the syllable level 240 also passes the state of the fourth syllable 240Ba to the fifth syllable 240Cb.
- the fifth syllable 240Cb contains phonemes 230CM, 230Cb2 and includes a predicted syllable duration equal to the sum of the predicted phoneme durations 234 for the phonemes 230CM, 230Cb2.
- the predicted syllable duration equal to the sum of the predicted phoneme durations 234 for the phonemes 230CM, 230Cb2.
- autoencoder 300 decodes a total of six fixed-length predicted pitch (F0) frames 280F0 for the fifth syllable 240Cb based on the sum of the predicted phoneme durations 234 for the phonemes 230CM, 230Cb2. Accordingly, the faster clocking syllable layer 240 distributes the fifth syllable embedding 245Cb as an input to each phoneme 230CM, 230Cb2 included in the fifth syllable 240Cb. A timing signal may also be appended to the fifth syllable embedding 245Cb.
- FIG. 3B provides a detailed view within dashed box 350 of FIG. 3 A to show the decoding of the first syllable embedding 245Aa into individual fixed-length predicted pitch (F0) frames 280F0 for the first syllable 240 Aa.
- the autoencoder 300 determines the number of fixed-length predicted pitch (F0) frames 280 to decode based on the predicted syllable duration for the first syllable 240Aa.
- the first syllable 240Aa generates the corresponding first syllable embedding 245Aa for distribution as an input to each of the first and second phonemes 230Aal, 230Aa2 of the faster clocking syllable level 240.
- the autoencoder 300 predicts the phoneme duration 234 for the first phoneme 230Aal by encoding the phoneme-level linguistic features 232 associated with the first phoneme 230Aal with the first syllable embedding 245 Aa. Likewise, the autoencoder 300 predicts the phoneme duration 234 for the second phoneme 230Aa2 by encoding the phoneme- level linguistic features (not shown) associated with the second phoneme 230Aa2 with the first syllable embedding 245Aa. The second phoneme 230Aa2 also receives the previous state from the first phoneme 230Aal.
- the predicted syllable duration for the first syllable 230Aa is equal to the sum of the predicted phone durations 234 for the first and second phonemes 230Aal, 230Aa2.
- the encodings of the first syllable embedding 245Aa with the corresponding phoneme-level linguistic features 232 associated with each of the phonemes 230Aal, 230Aa2 is further combined with the first syllable embedding 245 Aa at the output of the phoneme level 230 to predict the pitch (F0) for the first syllable 240Aa and generate the fixed-length predicted pitch (F0) frames 280F0 for the first syllable 240 Aa.
- the autoencoder 300 determines the total number (e.g., seven) of fixed-length predicted pitch (F0) frames 280F0 to
- the fixed-length predicted pitch (F0) frames 280 decoded from the first syllable embedding 245 Aa collectively indicate a corresponding duration and pitch contour for the first syllable 240Aa of the text utterance 320.
- the autoencoder 300 similarly decodes each of the remaining syllable embeddings 245Ab, 245Ba, 245Ca, 245Cb output from the syllable level 240 into individual fixed-length predicted pitch (F0) frames 280 for each corresponding syllable 240 Ab, 240Ba, 240Ca, 240Cb.
- the second syllable embedding 245Ab is further combined at the output of the phoneme level 230 with the encoding of the second syllable embedding 245 Ab and the corresponding phoneme-level linguistic features 232 associated with the phoneme 230AM
- the third syllable embedding 245Ba is further combined at the output of the phoneme level 230 with the encodings of the third syllable embedding 245Ba and the corresponding phoneme-level linguistic features 232 associated with each of the phonemes 230Bal, 230Ba2, 230Ba3.
- the fourth syllable embedding 245Ca is further combined at the output of the phoneme level 230 with the encodings of the fourth syllable embedding 245Ca and the corresponding phoneme-level linguistic features 232 associated with the phoneme
- the fifth syllable embedding 245Cb is further combined at the output of the phoneme level 230 with the encodings of the fifth syllable embedding 245Cb and the corresponding phoneme-level linguistic features 232 associated with each of the phonemes 230CM, 230Cb2.
- the fixed-length predicted pitch (F0) frames 280F0 generated by the autoencoder 300 include frame-level LSTM
- other configurations may replace the frame-level LSTM of pitch (F0) frames 280F0 with a feed-forward layer so that the pitch (F0) of every frame in a corresponding syllable is predicted in one pass.
- the autoencoder 300 is further configured to encode the phoneme-level linguistic features 232 associated with each phoneme 230Aal- 230Cb2 with the corresponding syllable embedding 245 for predicting the corresponding energy (CO) contour for each phoneme 230.
- the phoneme-level linguistic features 232 associated with phonemes 230Aa2-230Cb2 in the phoneme level 230 are only omitted from the view of FIG. 3C for the sake of clarity.
- the autoencoder 300 determines the number of fixed-length predicted energy (CO) frames 280, 280C0 to decode for each phoneme 230 based on the corresponding predicted phoneme duration 234.
- the autoencoder 300 decodes/generates four (4) predicted energy (CO) frames 280C0 for the first phoneme 230Aal, three (3) predicted energy (CO) frames 280C0 for the second phoneme 230Aa2, four (4) predicted energy (CO) frames 280C0 for the third phoneme 230AM, two (2) predicted energy (CO) frames 280C0 for the fourth phoneme 230Bal, five (5) predicted energy (CO) frames 280C0 for the fifth phoneme 230Ba2, four (4) predicted energy (CO) frames 280C0 for the sixth phoneme 230Ba3, three (3) predicted energy (CO) frames 280C0 for the seventh phoneme 230Cal, four (4) predicted energy (CO) frames 280C0 for the eighth phoneme 230CM, and two (2) predicted energy (CO) frames 280C0 for the ninth phoneme 230Cb2.
- the predicted energy contour for each phoneme in the phoneme level 230 is based on an encoding between the syllable embedding 245 input from the corresponding syllable in the slower clocking syllable level 240 that contains the phoneme and the linguistic features 232 associated with the phoneme.
- FIG. 4 is a flow chart of an example arrangement of operations for a method 400 of predicting a prosodic representation 322 for a text utterance 320.
- the method 400 may be described with reference to FIGS. 1-3C.
- the memory hardware 124 residing on the computer system 120 of FIG. 1 may store instructions that when executed by the data processing hardware 122 cause the data processing hardware 122 to execute the operations for the method 400.
- the method 400 includes receiving the text utterance 320.
- the text utterance 320 has at least one word, each word having at least one syllable, each syllable having at least one phoneme.
- the method 400 includes selecting an utterance embedding 260 for the text utterance 320.
- the utterance embedding 260 represents an intended prosody.
- the selected utterance embedding 260 is used to predict the prosodic representation 322 of the text utterance 320 for use by a TTS system 150 to produce synthesized speech 152 from the text utterance 320 and having the intended prosody.
- the utterance embedding 260 may be represented by a fixed-length numerical vector.
- the numerical vector may include a value equal to“256”.
- the data processing hardware 122 may first query the data storage 180 to locate utterance embeddings 260 having transcripts 261 that closely match the text utterance 320 and then select the utterance embeddings 260 to predict the prosodic representation 322 for the given text utterance 320.
- the fixed-length utterance embedding 260 is selected by picking a specific point in a latent space of embeddings 260 that likely represents particular semantics and pragmatics for a target prosody.
- the latent space is sampled to choose a random utterance embedding 260 for representing the intended prosody for the text utterance 320.
- the data processing hardware 122 models the latent space as multidimensional unit Gaussian by choosing a mean of the utterance embeddings 260 having closely matching transcripts 261 for representing a most likely prosody for the linguistic features of the text utterance 320. If the prosody variation of the training data is reasonably neutral, the last example of choosing the mean of utterance embeddings 260 is a reasonable choice
- the method 400 includes predicting a duration of the syllable by encoding linguistic features 232 of each phoneme 230 of the syllable with a
- the method 400 may predict a duration 234 of the corresponding phoneme 230 by encoding the linguistic features 232 of the corresponding phoneme 230 with the corresponding prosodic syllable embedding 245 for the syllable 240. Thereafter, the method 400 may predict the duration of the syllable 240 by summing the predicted durations 234 for each phoneme 230 associated with the syllable 240.
- the method 400 includes predicting a pitch contour of the syllable based on the predicted duration for the syllable.
- the method 400 also includes generating a plurality of fixed-length predicted pitch frames 280, 280F0 based on the predicted duration for the syllable 240.
- Each fixed-length predicted pitch frame 280F0 represents part of the predicted contour of the syllable 240.
- Additional operations for the method 400 may further include, for each syllable 240, using the selected utterance embedding 260, predicting an energy contour of each phoneme 230 in the syllable 240based on a predicted duration 234 for the corresponding phoneme 230.
- the method 400 may generate a plurality of fixed-length predicted energy frames 280, 280C0 based on the predicted duration 234 for the corresponding phoneme 230.
- each fixed-length energy frame 280C0 represents the predicted energy contour of the corresponding phoneme 230.
- a software application may refer to computer software that causes a computing device to perform a task.
- a software application may be referred to as an“application,” an“app,” or a“program.”
- Example applications include, but are not limited to, system diagnostic applications, system management applications, system maintenance applications, word processing
- the non-transitory memory may be physical devices used to store programs (e.g., sequences of instructions) or data (e.g., program state information) on a temporary or permanent basis for use by a computing device.
- the non-transitory memory may be volatile and/or non-volatile addressable semiconductor memory. Examples of non- volatile memory include, but are not limited to, flash memory and read-only memory
- ROM read-only memory
- PROM programmable read-only memory
- EPROM erasable programmable read-only memory
- EEPROM e.g., typically used for firmware, such as boot programs.
- volatile memory include, but are not limited to, random access memory (RAM), dynamic random access memory (DRAM), static random access memory (SRAM), phase change memory (PCM) as well as disks or tapes.
- RAM random access memory
- DRAM dynamic random access memory
- SRAM static random access memory
- PCM phase change memory
- FIG. 5 is schematic view of an example computing device 500 (e.g., computing system 120 of FIG. 1) that may be used to implement the systems and methods described in this document.
- the computing device 500 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers.
- the components shown here, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed in this document.
- the computing device 500 includes a processor 510, memory 520, a storage device 530, a high-speed interface/controller 540 connecting to the memory 520 and high-speed expansion ports 550, and a low speed interface/controller 560 connecting to a low speed bus 570 and a storage device 530.
- Each of the components 510, 520, 530, 540, 550, and 560, are interconnected using various busses, and may be mounted on a common motherboard or in other manners as appropriate.
- the processor 510 e.g., data processing hardware 122 of FIG. 1)
- GUI graphical user interface
- multiple processors and/or multiple buses may be used, as appropriate, along with multiple memories and types of memory.
- multiple computing devices 500 may be connected, with each device providing portions of the necessary operations (e.g., as a server bank, a group of blade servers, or a multi-processor system).
- the memory 520 stores information non-transitorily within the computing device 500.
- the memory 520 may be a computer-readable medium, a volatile memory unit(s), or non-volatile memory unit(s).
- the non-transitory memory 520 may be physical devices used to store programs (e.g., sequences of instructions) or data (e.g., program state information) on a temporary or permanent basis for use by the computing device
- non-volatile memory examples include, but are not limited to, flash memory and read-only memory (ROM) / programmable read-only memory (PROM) / erasable programmable read-only memory (EPROM) / electronically erasable programmable read only memory (EEPROM) (e.g., typically used for firmware, such as boot programs).
- volatile memory examples include, but are not limited to, random access memory (RAM), dynamic random access memory (DRAM), static random access memory (SRAM), phase change memory (PCM) as well as disks or tapes.
- the storage device 530 is capable of providing mass storage for the computing device 500.
- the storage device 530 is a computer- readable medium.
- the storage device 530 may be a floppy disk device, a hard disk device, an optical disk device, or a tape device, a flash memory or other similar solid state memory device, or an array of devices, including devices in a storage area network or other configurations.
- a computer program product is tangibly embodied in an information carrier.
- the computer program product contains instructions that, when executed, perform one or more methods, such as those described above.
- the information carrier is a computer- or machine-readable medium, such as the memory 520, the storage device 530, or memory on processor 510.
- the high speed controller 540 manages bandwidth-intensive operations for the computing device 500, while the low speed controller 560 manages lower bandwidth intensive operations. Such allocation of duties is exemplary only. In some
- the high-speed controller 540 is coupled to the memory 520, the display 580 (e.g., through a graphics processor or accelerator), and to the high-speed expansion ports 550, which may accept various expansion cards (not shown).
- the memory 520 e.g., the RAM 520
- the display 580 e.g., through a graphics processor or accelerator
- the high-speed expansion ports 550 which may accept various expansion cards (not shown).
- the low-speed controller 560 is coupled to the storage device 530 and a low-speed expansion port 590.
- the low-speed expansion port 590 which may include various communication ports (e.g., USB, Bluetooth, Ethernet, wireless Ethernet), may be coupled to one or more input/output devices, such as a keyboard, a pointing device, a scanner, or a networking device such as a switch or router, e.g., through a network adapter.
- the computing device 500 may be implemented in a number of different forms, as shown in the figure. For example, it may be implemented as a standard server 500a or multiple times in a group of such servers 500a, as a laptop computer 500b, or as part of a rack server system 500c.
- Various implementations of the systems and techniques described herein can be realized in digital electronic and/or optical circuitry, integrated circuitry, specially designed ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various implementations can include implementation in one or more computer programs that are executable and/or
- a programmable system including at least one programmable processor, which may be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.
- a programmable processor which may be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.
- the processes and logic flows described in this specification can be performed by one or more programmable processors, also referred to as data processing hardware, executing one or more computer programs to perform functions by operating on input data and generating output.
- the processes and logic flows can also be performed by special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an
- ASIC application specific integrated circuit
- a processor will receive instructions and data from a read only memory or a random access memory or both.
- the essential elements of a computer are a processor for performing instructions and one or more memory devices for storing instructions and data.
- a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks.
- mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks.
- a computer need not have such devices.
- Computer readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto optical disks; and CD ROM and DVD-ROM disks.
- semiconductor memory devices e.g., EPROM, EEPROM, and flash memory devices
- magnetic disks e.g., internal hard disks or removable disks
- magneto optical disks e.g., CD ROM and DVD-ROM disks.
- the processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
- one or more aspects of the disclosure can be implemented on a computer having a display device, e.g., a CRT (cathode ray tube), LCD (liquid crystal display) monitor, or touch screen for displaying information to the user and optionally a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer.
- a display device e.g., a CRT (cathode ray tube), LCD (liquid crystal display) monitor, or touch screen for displaying information to the user and optionally a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer.
- Other kinds of devices can be used to provide interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input
Abstract
Description
Claims
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201862670384P | 2018-05-11 | 2018-05-11 | |
PCT/US2019/027279 WO2019217035A1 (en) | 2018-05-11 | 2019-04-12 | Clockwork hierarchical variational encoder |
Publications (1)
Publication Number | Publication Date |
---|---|
EP3776531A1 true EP3776531A1 (en) | 2021-02-17 |
Family
ID=66323968
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP19720289.8A Pending EP3776531A1 (en) | 2018-05-11 | 2019-04-12 | Clockwork hierarchical variational encoder |
Country Status (6)
Country | Link |
---|---|
US (2) | US10923107B2 (en) |
EP (1) | EP3776531A1 (en) |
JP (2) | JP7035225B2 (en) |
KR (2) | KR102327614B1 (en) |
CN (2) | CN112005298B (en) |
WO (1) | WO2019217035A1 (en) |
Families Citing this family (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP3739572A4 (en) * | 2018-01-11 | 2021-09-08 | Neosapience, Inc. | Text-to-speech synthesis method and apparatus using machine learning, and computer-readable storage medium |
US10923107B2 (en) * | 2018-05-11 | 2021-02-16 | Google Llc | Clockwork hierarchical variational encoder |
US11264010B2 (en) * | 2018-05-11 | 2022-03-01 | Google Llc | Clockwork hierarchical variational encoder |
EP3576019A1 (en) * | 2018-05-29 | 2019-12-04 | Nokia Technologies Oy | Artificial neural networks |
KR20200015418A (en) * | 2018-08-02 | 2020-02-12 | 네오사피엔스 주식회사 | Method and computer readable storage medium for performing text-to-speech synthesis using machine learning based on sequential prosody feature |
WO2021040490A1 (en) * | 2019-08-30 | 2021-03-04 | Samsung Electronics Co., Ltd. | Speech synthesis method and apparatus |
EP4073786A1 (en) * | 2019-12-10 | 2022-10-19 | Google LLC | Attention-based clockwork hierarchical variational encoder |
US11562744B1 (en) * | 2020-02-13 | 2023-01-24 | Meta Platforms Technologies, Llc | Stylizing text-to-speech (TTS) voice response for assistant systems |
US11881210B2 (en) * | 2020-05-05 | 2024-01-23 | Google Llc | Speech synthesis prosody using a BERT model |
CN111724809A (en) * | 2020-06-15 | 2020-09-29 | 苏州意能通信息技术有限公司 | Vocoder implementation method and device based on variational self-encoder |
US11514888B2 (en) * | 2020-08-13 | 2022-11-29 | Google Llc | Two-level speech prosody transfer |
US11232780B1 (en) | 2020-08-24 | 2022-01-25 | Google Llc | Predicting parametric vocoder parameters from prosodic features |
CN112542153A (en) * | 2020-12-02 | 2021-03-23 | 北京沃东天骏信息技术有限公司 | Duration prediction model training method and device, and speech synthesis method and device |
KR20240030714A (en) * | 2022-08-31 | 2024-03-07 | 삼성전자주식회사 | Electronic apparatus and method for controlling thereof |
Family Cites Families (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO1995030193A1 (en) * | 1994-04-28 | 1995-11-09 | Motorola Inc. | A method and apparatus for converting text into audible signals using a neural network |
WO2003019528A1 (en) * | 2001-08-22 | 2003-03-06 | International Business Machines Corporation | Intonation generating method, speech synthesizing device by the method, and voice server |
CN101064103B (en) * | 2006-04-24 | 2011-05-04 | 中国科学院自动化研究所 | Chinese voice synthetic method and system based on syllable rhythm restricting relationship |
CN102254554B (en) * | 2011-07-18 | 2012-08-08 | 中国科学院自动化研究所 | Method for carrying out hierarchical modeling and predicating on mandarin accent |
CN102270449A (en) | 2011-08-10 | 2011-12-07 | 歌尔声学股份有限公司 | Method and system for synthesising parameter speech |
CN105185373B (en) * | 2015-08-06 | 2017-04-05 | 百度在线网络技术(北京)有限公司 | The generation of prosody hierarchy forecast model and prosody hierarchy Forecasting Methodology and device |
CN105244020B (en) * | 2015-09-24 | 2017-03-22 | 百度在线网络技术(北京)有限公司 | Prosodic hierarchy model training method, text-to-speech method and text-to-speech device |
EP3438972B1 (en) * | 2016-03-28 | 2022-01-26 | Sony Group Corporation | Information processing system and method for generating speech |
US10366165B2 (en) * | 2016-04-15 | 2019-07-30 | Tata Consultancy Services Limited | Apparatus and method for printing steganography to assist visually impaired |
TWI595478B (en) * | 2016-04-21 | 2017-08-11 | 國立臺北大學 | Speaking-rate normalized prosodic parameter builder, speaking-rate dependent prosodic model builder, speaking-rate controlled prosodic-information generating device and method for being able to learn different languages and mimic various speakers' speaki |
JP2018004977A (en) * | 2016-07-04 | 2018-01-11 | 日本電信電話株式会社 | Voice synthesis method, system, and program |
US11069335B2 (en) | 2016-10-04 | 2021-07-20 | Cerence Operating Company | Speech synthesis using one or more recurrent neural networks |
CN107464559B (en) * | 2017-07-11 | 2020-12-15 | 中国科学院自动化研究所 | Combined prediction model construction method and system based on Chinese prosody structure and accents |
US11264010B2 (en) * | 2018-05-11 | 2022-03-01 | Google Llc | Clockwork hierarchical variational encoder |
US10923107B2 (en) * | 2018-05-11 | 2021-02-16 | Google Llc | Clockwork hierarchical variational encoder |
JP7108147B2 (en) * | 2019-05-23 | 2022-07-27 | グーグル エルエルシー | Variational embedding capacity in end-to-end speech synthesis for expressions |
US11222620B2 (en) * | 2020-05-07 | 2022-01-11 | Google Llc | Speech recognition using unspoken text and speech synthesis |
US11232780B1 (en) * | 2020-08-24 | 2022-01-25 | Google Llc | Predicting parametric vocoder parameters from prosodic features |
-
2019
- 2019-04-12 US US16/382,722 patent/US10923107B2/en active Active
- 2019-04-12 KR KR1020207032596A patent/KR102327614B1/en active IP Right Grant
- 2019-04-12 KR KR1020217036742A patent/KR102464338B1/en active IP Right Grant
- 2019-04-12 JP JP2020563611A patent/JP7035225B2/en active Active
- 2019-04-12 CN CN201980027064.1A patent/CN112005298B/en active Active
- 2019-04-12 CN CN202311432566.7A patent/CN117524188A/en active Pending
- 2019-04-12 EP EP19720289.8A patent/EP3776531A1/en active Pending
- 2019-04-12 WO PCT/US2019/027279 patent/WO2019217035A1/en unknown
-
2021
- 2021-01-13 US US17/147,548 patent/US11393453B2/en active Active
-
2022
- 2022-03-01 JP JP2022030966A patent/JP7376629B2/en active Active
Also Published As
Publication number | Publication date |
---|---|
KR102464338B1 (en) | 2022-11-07 |
CN112005298A (en) | 2020-11-27 |
US20210134266A1 (en) | 2021-05-06 |
US10923107B2 (en) | 2021-02-16 |
WO2019217035A1 (en) | 2019-11-14 |
US20190348020A1 (en) | 2019-11-14 |
US11393453B2 (en) | 2022-07-19 |
JP7035225B2 (en) | 2022-03-14 |
KR20200141497A (en) | 2020-12-18 |
CN117524188A (en) | 2024-02-06 |
KR20210138155A (en) | 2021-11-18 |
JP7376629B2 (en) | 2023-11-08 |
JP2022071074A (en) | 2022-05-13 |
JP2021521492A (en) | 2021-08-26 |
KR102327614B1 (en) | 2021-11-17 |
CN112005298B (en) | 2023-11-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11393453B2 (en) | Clockwork hierarchical variational encoder | |
US11664011B2 (en) | Clockwork hierarchal variational encoder | |
US11514888B2 (en) | Two-level speech prosody transfer | |
US11881210B2 (en) | Speech synthesis prosody using a BERT model | |
US20240038214A1 (en) | Attention-Based Clockwork Hierarchical Variational Encoder | |
US11232780B1 (en) | Predicting parametric vocoder parameters from prosodic features | |
WO2023288169A1 (en) | Two-level text-to-speech systems using synthetic training data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: UNKNOWN |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE |
|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE |
|
17P | Request for examination filed |
Effective date: 20201111 |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR |
|
AX | Request for extension of the european patent |
Extension state: BA ME |
|
DAV | Request for validation of the european patent (deleted) | ||
DAX | Request for extension of the european patent (deleted) | ||
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: EXAMINATION IS IN PROGRESS |
|
17Q | First examination report despatched |
Effective date: 20230310 |