CN117597729A

CN117597729A - Use of advanced text and speech in ASR pre-training with consistency and contrast loss

Info

Publication number: CN117597729A
Application number: CN202280046159.XA
Authority: CN
Inventors: 安德鲁·罗森伯格; 陈哲怀; 布瓦那·拉马巴德兰; 佩德罗·J·莫雷诺·门吉巴尔; 加里·王; 张羽
Original assignee: Google LLC
Current assignee: Google LLC
Priority date: 2021-06-30
Filing date: 2022-04-15
Publication date: 2024-02-23

Abstract

A method (600) includes receiving training data including a non-spoken text utterance (320), an untransformed non-synthesized speech utterance (306), and a transcribed non-synthesized speech utterance (304). Each non-spoken text utterance is not paired with any corresponding spoken utterance of the non-synthesized speech. Each untranscribed non-synthesized speech utterance is not paired with a corresponding transcription. Each transcription of the non-synthesized speech utterance is paired with a corresponding transcription (302). The method further comprises the steps of: a corresponding synthesized speech representation (332) is generated for each non-spoken text utterance of the received training data using a text-to-speech model (330). The method further comprises the steps of: an audio encoder (210) is pre-trained on a synthesized speech representation generated for a non-spoken text utterance, an un-transcribed non-synthesized speech utterance, and a transcribed non-synthesized speech utterance to teach the audio encoder to learn a shared speech and text representation jointly.

Description

Use of advanced text and speech in ASR pre-training with consistency and contrast loss

Technical Field

The present disclosure relates to the use of advanced text and speech in Automatic Speech Recognition (ASR) pre-training with consistency and contrast loss.

Background

Automatic Speech Recognition (ASR), a process of taking audio input and transcribing it into text, has become an important technology for use in mobile devices and other devices. Typically, automatic speech recognition attempts provide an accurate transcription of what a person has spoken by taking an audio input (e.g., a speech utterance) and transcribing the audio input into text. Modern ASR models continue to evolve based on deep neural networks with improvements in both accuracy (e.g., low Word Error Rate (WER)) and latency (e.g., delay between user speech and transcription). However, one challenge in developing a deep learning-based ASR model is that parameters of the ASR model often overfit the training data, resulting in difficulty in generalizing the invisible data for the ASR model when the training data is not extensive enough. Thus, training the ASR model over a larger training data set improves the accuracy of the ASR model. Synthetic speech and/or data-enhanced speech can be incorporated to increase the volume of training data used to train the ASR model.

Disclosure of Invention

One aspect of the present disclosure provides a computer-implemented method that, when run on data processing hardware, causes the data processing hardware to perform operations for pre-training an audio encoder to jointly learn a shared representation of speech and text. The operation includes: training data is received, the training data including a non-spoken text utterance, a non-transcribed non-synthesized speech utterance, and a transcribed non-synthesized speech utterance. Each non-spoken text utterance is not paired with any corresponding spoken utterance of the non-synthesized speech. Each untranscribed non-synthesized speech utterance is not paired with a corresponding transcription. Each transcribed non-synthesized speech utterance is paired with a corresponding transcription. The operations further comprise: a corresponding synthesized speech representation is generated for each non-spoken text utterance of the received training data using a text-to-speech model. The operations further comprise: an audio encoder is pre-trained on a synthesized speech representation generated for a non-spoken text utterance, an un-transcribed non-synthesized speech utterance, and a transcribed non-synthesized speech utterance to teach the audio encoder to learn a shared speech and text representation jointly.

Implementations of the disclosure may include one or more of the following optional features. In some implementations, the audio encoder includes a stack of self-attention layers, each self-attention layer including a multi-headed self-attention mechanism. In some examples, the pre-training audio encoder includes: for each untranscribed non-synthetic speech utterance, generating a corresponding encoded representation of the untranscribed speech representation, and pre-training the audio encoder on a contrast applied on the corresponding encoded representation of the untranscribed non-synthetic speech utterance; for each synthesized speech representation: generating a corresponding encoded representation of the synthesized speech representation; and pre-training the audio encoder on a contrast penalty applied on a corresponding encoded representation of the synthesized speech representation; and for each transcribed non-synthesized speech utterance: generating a corresponding encoded representation of the transcribed non-synthesized speech utterance; and pre-training the audio encoder with a contrast penalty applied on a corresponding encoded representation of the transcribed non-synthesized speech utterance.

In some implementations, the pre-training audio encoder includes: for each synthesized speech representation at each of a plurality of time steps: generating a first probability distribution over possible synthetic speech recognition hypotheses for the corresponding synthetic speech representation using the auxiliary decoder; determining a synthesized speech loss term based on a first probability distribution over possible synthesized speech recognition hypotheses and a non-spoken text utterance corresponding to a corresponding synthesized speech representation; pre-training an audio encoder based on the synthesized speech loss term; and generating, for each transcribed non-synthetic speech utterance, a second probability distribution over possible non-synthetic speech recognition hypotheses for the corresponding transcribed non-synthetic speech utterance using an auxiliary decoder at each of a plurality of time steps; determining a non-synthetic speech loss term based on a second probability distribution over possible non-synthetic speech recognition hypotheses and a corresponding transcription paired with the transcribed non-synthetic speech utterance; and pre-training the audio encoder based on the non-synthesized speech loss term. Here, the first probability distribution over possible synthetic speech recognition hypotheses includes one of the possible phoneme labels or the possible word segment labels, and the second probability distribution over possible non-synthetic speech recognition hypotheses includes one of the possible phoneme labels or the possible word segment labels.

In these embodiments, the pre-training audio encoder may further comprise: for each synthesized speech representation at each of a plurality of time steps: generating a third probability distribution over possible synthetic speech recognition hypotheses for the corresponding synthetic speech representation using a further auxiliary decoder, the third probability distribution over possible synthetic speech recognition hypotheses including the other of the possible phoneme labels or the possible word segment labels; determining another synthesized speech loss term based on a third probability distribution over possible synthesized speech recognition hypotheses and a non-spoken text utterance corresponding to the corresponding synthesized speech representation; pre-training the audio encoder based on another synthesized speech loss term; and generating, for each transcribed non-synthetic speech utterance, a fourth probability distribution over possible non-synthetic speech recognition hypotheses for the corresponding transcribed non-synthetic speech utterance using another auxiliary decoder, the fourth probability distribution over possible non-synthetic speech recognition hypotheses including the other of the possible phoneme labels or the possible word segment labels; determining another non-synthetic speech loss term based on a fourth probability distribution over possible non-synthetic speech recognition hypotheses and a corresponding transcription paired with the transcribed non-synthetic speech utterance; and pre-training the audio encoder based on the non-synthesized speech loss term. The auxiliary decoder includes one of a Connection Time Class (CTC) decoder, a Listen Attention Spelling (LAS) decoder, or a recurrent neural network-transmitter (RNN-T) decoder.

In some examples, the operations further comprise: a set of training speech pairs is obtained, each training speech pair comprising: a corresponding one of the transcribed non-synthesized speech utterances of the received training data; and a paired synthesized speech representation of the corresponding transcribed non-synthesized speech utterance, the paired synthesized speech representation generated by the text-to-speech model performing text-to-speech conversion on the corresponding transcription paired with the transcribed non-synthesized speech utterance. In these examples, the pre-training audio encoder includes: generating a first probability distribution over possible non-synthetic speech recognition hypotheses for a corresponding transcribed non-synthetic speech utterance using an auxiliary decoder for each training utterance pair in a set of training utterance pairs at each of a plurality of output steps; generating a second probability distribution over possible synthetic speech recognition hypotheses for the corresponding paired synthetic speech representations using the auxiliary decoder; determining a consistency loss term for the corresponding training utterance pair based on the first probability distribution over possible non-synthesized speech recognition hypotheses and the second probability distribution over possible synthesized speech recognition hypotheses; and pre-training the audio encoder based on the non-synthesized speech loss term. One or more of the synthesized speech representations are enhanced prior to pre-training the audio encoder on the synthesized speech representation.

In some implementations, the non-spoken text utterance is generated and/or selected using one or more language models. In some examples, the non-spoken text utterance is generated using a background language model and an in-domain language model trained on transcribed speech utterances associated with the target domain. After pre-training the audio encoder, the encoder pre-trained on the transcribed speech utterance is fine-tuned.

Another aspect of the present disclosure provides a system comprising: data processing hardware; and memory hardware storing instructions that, when executed on the data processing hardware, cause the data processing hardware to perform operations. The operations include: training data is received, the training data including a non-spoken text utterance, a non-transcribed non-synthesized speech utterance, and a transcribed non-synthesized speech utterance. Each non-spoken text utterance is not paired with any corresponding spoken utterance of the non-synthesized speech. Each untranscribed non-synthesized speech utterance is not paired with a corresponding transcription. Each transcribed non-synthesized speech utterance is paired with a corresponding transcription. The operations further comprise: a corresponding synthesized speech representation is generated for each non-spoken text utterance of the received training data using a text-to-speech model. The operations further comprise: an audio encoder is pre-trained on a synthesized speech representation generated for a non-spoken text utterance, an un-transcribed non-synthesized speech utterance, and a transcribed non-synthesized speech utterance to teach the audio encoder to learn a shared speech and text representation jointly.

Implementations of the disclosure may include one or more of the following optional features. In some implementations, the audio encoder includes a stack of self-attention layers, each self-attention layer including a multi-headed self-attention mechanism. In some examples, the pre-training audio encoder includes: for each untranscribed non-synthesized speech utterance: generating a corresponding encoded representation of the untranscribed speech representation, and pre-training the audio encoder on a contrast applied on the corresponding encoded representation of the untranscribed non-synthesized speech utterance; for each synthesized speech representation, generating a corresponding encoded representation of the synthesized speech representation; and pre-training the audio encoder on a contrast penalty applied on a corresponding encoded representation of the synthesized speech representation; for each transcribed non-synthesized speech utterance: generating a corresponding encoded representation of the transcribed non-synthesized speech utterance; and pre-training the audio encoder on a contrast penalty applied on a corresponding encoded representation of the transcribed non-synthesized speech utterance.

In some implementations, the pre-training audio encoder includes: generating a first probability distribution over possible synthetic speech recognition hypotheses for each synthetic speech representation at each of a plurality of time steps using an auxiliary decoder for the corresponding synthetic speech representation; determining a synthesized speech loss term based on a first probability distribution over possible synthesized speech recognition hypotheses and a non-spoken text utterance corresponding to a corresponding synthesized speech representation; pre-training an audio encoder based on the synthesized speech loss term; and generating, for each transcribed non-synthetic speech utterance, a second probability distribution over possible non-synthetic speech recognition hypotheses for the corresponding transcribed non-synthetic speech utterance using an auxiliary decoder at each of a plurality of time steps; determining a non-synthetic speech loss term based on a second probability distribution over possible non-synthetic speech recognition hypotheses and a corresponding transcription paired with the transcribed non-synthetic speech utterance; and pre-training the audio encoder based on the non-synthesized speech loss term. Here, the first probability distribution over possible synthetic speech recognition hypotheses includes one of the possible phoneme labels or the possible word segment labels, and the second probability distribution over possible non-synthetic speech recognition hypotheses includes one of the possible phoneme labels or the possible word segment labels.

In these embodiments, the pre-training audio encoder may further comprise: generating, for each synthesized speech representation, a third probability distribution over possible synthesized speech recognition hypotheses including another one of the possible phoneme labels or the possible word segment labels for the corresponding synthesized speech representation using another auxiliary decoder at each of the plurality of time steps; determining another synthesized speech loss term based on a third probability distribution over possible synthesized speech recognition hypotheses and a non-spoken text utterance corresponding to the corresponding synthesized speech representation; pre-training the audio encoder based on another synthesized speech loss term; and generating, for each transcribed non-synthetic speech utterance, a fourth probability distribution over possible non-synthetic speech recognition hypotheses for the corresponding transcribed non-synthetic speech utterance using another auxiliary decoder, the fourth probability distribution over possible non-synthetic speech recognition hypotheses including the other of the possible phoneme labels or the possible word segment labels; determining another non-synthetic speech loss term based on a fourth probability distribution over possible non-synthetic speech recognition hypotheses and a corresponding transcription paired with the transcribed non-synthetic speech utterance; and pre-training the audio encoder based on the non-synthesized speech loss term. The auxiliary decoder includes one of a Connection Time Class (CTC) decoder, a Listen Attention Spelling (LAS) decoder, or a recurrent neural network-transmitter (RNN-T) decoder.

In some examples, the operations further comprise: obtaining a set of training speech pairs, each speech pair comprising a corresponding one of the transcribed non-synthesized speech utterances of the received training data; and a paired synthesized speech representation of the corresponding transcribed non-synthesized speech utterance, the paired synthesized speech representation generated by the text-to-speech model performing text-to-speech conversion on the corresponding transcription paired with the transcribed non-synthesized speech utterance. In these examples, the pre-training audio encoder includes, for each training speech pair in the set of training speech pairs, at each of a plurality of output steps: generating a first probability distribution over possible non-synthetic speech recognition hypotheses for a corresponding transcribed non-synthetic speech utterance using an auxiliary decoder; generating a second probability distribution over possible synthetic speech recognition hypotheses for the corresponding paired synthetic speech representations using the auxiliary decoder; determining a consistency loss term for the corresponding training utterance pair based on the first probability distribution over possible non-synthesized speech recognition hypotheses and the second probability distribution over possible synthesized speech recognition hypotheses; and pre-training the audio encoder based on the non-synthesized speech loss term. One or more of the synthesized speech representations are enhanced prior to pre-training the audio encoder on the synthesized speech representation.

The details of one or more embodiments of the disclosure are set forth in the accompanying drawings and the description below. Other aspects, features, and advantages will be apparent from the description and drawings, and from the claims.

Drawings

FIG. 1 is a schematic view of an example speech recognition system.

FIG. 2 is a schematic view of a recurrent neural network-Transducer (RNN-T) model architecture.

Fig. 3A-3C are schematic views of an example training process of an audio encoder for pre-training a speech recognition model.

FIG. 4 is a schematic view of an example non-spoken text selection process for selecting a non-spoken text utterance related to a particular field.

FIG. 5 is an example projected spatial encoder representation of non-synthesized speech and synthesized speech.

FIG. 6 is a flow chart of an example arrangement of the operation of a method of pre-training an audio encoder to jointly learn a shared representation of speech and text.

FIG. 7 is a schematic view of an example computing device that may be used to implement the systems and methods described herein.

Like reference symbols in the various drawings indicate like elements.

Detailed Description

With the introduction of sequence-to-sequence (Seq 2 Seq) models, which map audio to character sequences, automatic speech recognition has made great progress. Meanwhile, text-to-speech (TTS) or speech synthesis systems have successfully applied the Seq2Seq model to obtain the most advanced natural, realistic-sounding synthesized speech that the human ear cannot distinguish from human speech.

One challenge in developing a deep learning-based ASR model is that parameters of the ASR model tend to overfit the training data, resulting in difficulty in generalizing the invisible data for the ASR model when the training data is not extensive enough. Thus, training the ASR model over a larger training data set improves the accuracy of the ASR model. For example, the use of machine learning or other statistical methods can train an ASR model on a training dataset comprising transcribed speech for more than 10000 hours. However, when the domain associated with the training data is different from the domain in which the ASR model is to be deployed during reasoning, the performance of the ASR model suffers. For example, training an ASR model on transcribed speech in a domain associated with a video conference will be less effective at recognizing speech relevant to a voice search query, and vice versa.

Synthetic speech may significantly limit the amount of tagged human speech required to train an ASR model while also providing flexibility to move the ASR model across different domains. In general, although most advanced synthesized speech is indistinguishable from human speech, the use of synthesized speech has been shown to affect ASR training differently than human speech. This gap between synthesized speech and human speech is due to mismatch of synthesized speech data and human speech data, which is created by the one-to-many mapping problem that TTS systems are attempting to solve. That is, although the overall quality of available synthesized speech is very high, synthesized speech exhibits much less variation than human speech and very little speech is not fluent. Thus, training an ASR model exclusively on synthesized speech data makes it difficult to generalize the real speech utterance during reasoning.

Embodiments herein relate to using synthesized speech for training an ASR model to recognize speech to maintain accuracy of the ASR model when a large amount of transcribed speech (e.g., non-synthesized speech) in a target domain and/or target language used to train the ASR model is not available or is less common. More particularly, embodiments relate to pre-training an ASR model on training data that includes an un-transcribed non-synthesized speech utterance, a non-spoken text utterance used to generate a corresponding synthesized speech representation, and a transcribed non-synthesized speech utterance to jointly learn speech and text representations, and then fine-tuning (e.g., hot-start training) the pre-trained ASR model using the available transcribed non-synthesized speech utterances. As will become apparent, the pre-training of the audio encoder includes updating parameters of the audio encoder based on a combination of comparative self-supervised, supervised and consistency losses derived from the training data.

The comparative self-supervising loss may be derived from potential speech representations generated by the audio encoder from corresponding ones of the un-transcribed non-synthetic speech utterances, the synthetic speech representations, and the transcribed non-synthetic speech utterances to facilitate language learning. In another aspect, the supervisory loss may be derived from speech recognition indicia predicted by the one or more auxiliary decoders based on potential speech representations generated by the audio encoder from the corresponding ones of the synthesized speech representations and transcribed non-synthesized speech utterances. Here, the corresponding transcription paired with the transcribed non-synthetic speech utterance and the corresponding non-spoken text utterance used to generate the synthetic speech representation is used as a ground-trunk (score) marker for deriving the supervision loss. Finally, a consistency loss may be derived from each transcribed non-synthesized speech utterance and corresponding synthesized speech representation of the same utterance to facilitate consistent predictions (e.g., potential speech representations) by the audio encoder of both the non-synthesized (e.g., real/human) speech representation and the synthesized speech representation of the same utterance. In short, by encouraging the audio encoder to perform consistently on both human speech and training utterances of synthesized speech, a loss of consistency between the human (non-synthesized) and synthesized (synthesized) representations of the same utterance provides an unsupervised training aspect. Notably, a text-to-speech (TTS) model can convert a corresponding transcription paired with each transcribed non-synthesized speech utterance into a corresponding synthesized speech representation of the same utterance.

Additional implementations include applying data enhancement techniques, such as the implementation of diversity of synthesized training utterances by varying synthesized speaker characteristics in order to promote robustness to speaker differences. The techniques described herein are particularly useful when relatively little transcribed human speech is available in the target domain and/or target language.

FIG. 1 illustrates an Automatic Speech Recognition (ASR) system 100 implementing an ASR model 200, the ASR model 200 residing on a user device 102 of a user 104 and/or a remote computing device 201 in communication with the user device 102 (e.g., one or more servers of a distributed system operating in a cloud computing environment). Although user device 102 is depicted as a mobile computing device (e.g., a smart phone), user device 102 may correspond to any type of computing device, such as, but not limited to, a tablet device, a laptop/desktop computer, a wearable device, a digital assistant device, a smart speaker/display, a smart appliance, an automotive infotainment system, or an internet of things (IoT) device, and is equipped with data processing hardware 111 and memory hardware 113.

The user device 102 includes an audio subsystem 108, which audio subsystem 108 is configured to receive the utterance 106 spoken by the user 104 (e.g., the user device 102 may include one or more microphones for recording the spoken utterance 106), and to convert the utterance 106 into a corresponding digital format associated with the input acoustic frames 110 that can be processed by the ASR system 100. In the example shown, the user speaks the phrase "What is the weather in New York City? (what is the weather in new york city. Thereafter, the ASR model 200 receives as input acoustic frames 110 corresponding to the utterance 106 and generates/predicts as output corresponding transcriptions 120 (e.g., recognition results/hypotheses) of the utterance 106. In the illustrated example, the user device 102 and/or the remote computing device 201 also execute a user interface generator 107 configured to present a representation of the transcription 120 of the utterance 106 to the user 104 of the user device 102. In some configurations, the transcript 120 output from the ASR system 100 is processed, for example, by a Natural Language Understanding (NLU) module running on the user device 102 or remote computing device 201, to run user commands. Additionally or alternatively, a text-to-speech system (e.g., running on any combination of the user device 102 or the remote computing device 201) may convert the transcription into synthesized speech for audible output by another device. For example, the original utterance 106 may correspond to a message that the user 104 is sending to a friend, where the transcription 120 is converted into synthesized speech for audible output to the friend to listen to the message conveyed in the original utterance 106.

Referring to FIG. 2, an example frame alignment based Transducer model 200a includes a recurrent neural network-Transducer (RNN-T) model architecture that adheres to latency constraints associated with interactive applications. The use of RNN-T model architecture is exemplary, and the frame-alignment-based transducer model 200 may include other architectures such as a transducer-transducer model architecture and a transducer-transducer model architecture. The RNN-T model 200 provides a small computational footprint and utilizes lower memory requirements than conventional ASR architectures, thereby making the RNN-T model architecture suitable for performing speech recognition entirely on the user device 102 (e.g., without requiring communication with a remote server). The RNN-T model 200 includes an encoder network 210, a prediction network 220, and a federated network 230. The encoder network 210 in an ASR system that is substantially similar to a conventional Acoustic Model (AM) includes a stack of self-attention layers (e.g., conformer or transducer layers) or a cyclic network of stacked long-term memory (LSTM) layers. For example, the encoder reads a d-dimensional feature vector (e.g., acoustic frame 110 (fig. 1)) x= (x) ₁ ,x ₂ ,···,x _T ) WhereinAnd higher order feature representations are generated at each output step. This higher order feature representation is denoted +. >、…、/>

Similarly, the predictive network 220 is also an LSTM network that, like the Language Model (LM), will so far output a sequence y of non-blank symbols by the last Softmax layer 240 ₀ 、...、y _ui-1 Processing into dense representationsFinally, in the case of RNN-T model architecture, the representations generated by the encoder 210 and the predictor/decoder network 220 are composed of a unionAnd combined with network 230. The prediction network 220 may be replaced with an embedded look-up table to improve latency by outputting a sparse embedded representation of the look-up instead of a processing intensive representation. The federated network then predicts +.>Which is the distribution over the next output symbol. In other words, the federated network 230 generates a probability distribution over the possible speech recognition hypotheses at each output step (e.g., time step). Here, the "possible speech recognition hypothesis" corresponds to a set of output tokens, each representing a symbol/character in a specified natural language. For example, when the natural language is english, the output set of marks may include twenty-seven (27) symbols, e.g., one mark for each of the 26 letters in the english alphabet and one mark specifying a space. Accordingly, the federated network 230 may output a set of values that indicate the likelihood of occurrence for each of the predetermined set of output markers. This set of values can be a vector and can indicate a probability distribution over the set of output markers. In some cases, the output markers are graphemes (e.g., individual characters, and possibly punctuation and other symbols), but the output marker set is not so limited. For example, the output tag set can include word segments and/or whole words in addition to or instead of graphemes. The output profile of the federated network 230 can include a posterior probability value for each different output marker. Thus, if there are 100 different output labels representing different graphemes or other symbols, then the output y of network 230 is combined _i 100 different probability values can be included, one for each output marker. The probability distribution can then be used (e.g., by Softmax layer 240) to select and assign scores to candidate orthographic elements (e.g., graphemes, word segments, and/or words) during a bundle search for use in determining transcription 120.

The Softmax layer 240 may employ any technique to select the output marker/symbol in the distribution with the highest probability as the next output symbol predicted by the RNN-T model 200 at the corresponding output step. In this way, the RNN-T model 200 does not make a condition independence assumption, but rather the predictions for each symbol are conditioned not only on acoustics, but also on the sequence of markers output so far. The RNN-T model 200 assumes that the output symbols are independent of future acoustic frames 110, which allows the RNN-T model to be employed in a streaming manner.

In some examples, the encoder network (i.e., audio encoder) 210 of the RNN-T model 200 includes a stack of self-attention layers/blocks, such as a consumer block. Here, each consumer block includes a series of multi-headed self-attention, depth-wise convolution, and feed-forward layers. The predictive network 220 may have two 2048-dimensional LSTM layers, each of which is also followed by a 640-dimensional projection layer. Alternatively, the predictive network 220 may include a stack of transformers or custom blocks or an embedded lookup table instead of an LSTM layer. Finally, the federated network 230 may also have 640 hidden elements. The Softmax layer 240 may be comprised of a unified word segment or grapheme set generated using all unique word segments or graphemes in a plurality of training data sets.

Fig. 3A-3C illustrate an example training process 300 for pre-training the audio encoder 210 of the ASR model 200 (fig. 2). The training process 300 may pretrain the audio encoder 210 with available training data including non-spoken text utterances (X _text ) 320, transcribing a non-synthesized speech utterance (X _sup ) 304 and the non-transcribed non-synthesized speech utterance (X _unsup ) 306. Each non-spoken training text utterance 320 includes plain text data (i.e., unpaired data) such that each non-spoken training text utterance 320 is not paired with any corresponding spoken audio representation (speech) of the utterance. Each untranscribed non-synthesized speech utterance 306 (also referred to simply as "untranscribed speech utterance 306") includes pure audio data (i.e., unpaired data) such that the untranscribed speech utterance 306 is not paired with any corresponding transcription. On the other hand, each transcribed non-synthesized speech utterance 304 (also referred to simply as "transcribed speech utterances 304") includes a corresponding transcription 302 paired with a corresponding non-synthesized speech representation of the corresponding transcribed non-synthesized speech utterance 304.

For simplicity, training process 300 includes a comparison self-supervised penalty portion 300a (FIG. 3A), a supervised penalty portion 300B (FIG. 3B), and a consistency regularization portion 300C (FIG. 3C). The training process 300 works in a total loss (L) based on _{tts4pretrain2} ) Upper pre-training audio encoder 210: from a non-spoken text utterance (X) using a comparative self-supervising loss section 300a _text ) 320, transcribing a non-synthesized speech utterance (X) _sup ) 304 and an un-transcribed non-synthesized speech utterance (X _unsup ) 306 derived contrast loss (L _w2v ) 316, a step of; using the supervised loss section 300b from the non-spoken text utterance (X _text ) 320 and transcribing a non-synthesized speech utterance (X _sup ) 304 derived supervision loss (L _aux ) 344; and a consistency loss (L) derived using the consistency regularization portion 300c _cons )352。

Referring to fig. 3A, the comparative self-supervising losing portion 300a of the training process 300 may employ a text-to-speech (TTS) system 330 configured to generate a synthesized speech representation (e.g., synthesized speech) 332 of each of a plurality of non-verbal training text utterances 320 at each of a plurality of output steps. The non-spoken training text utterances 320 (also referred to simply as "non-spoken text utterances 320") include non-spoken text as plain text data (i.e., unpaired data) such that each non-spoken text utterance (e.g., X _text ) 320 are not paired with any synthesized or non-synthesized speech. Thus, TTS system 330 generates a corresponding synthesized speech representation 332 for each non-spoken text utterance 320. Notably, the synthesized speech representation 332 may include mel-frequency spectrogram frames for training the audio encoder 210, thereby eliminating the need for the training process 300 to include a vocoder and/or synthesizer to synthesize the mel-frequency spectrogram frames into synthesized speech.

TTS system 330 can apply speaker insertion z in converting non-spoken text utterance 320 to generate synthesized speech representation 332 having a particular speaking style and prosody associated with the speaker insertion. TTS system 330 may apply a number of different speaker-embedded z, each associated with a different speaker characteristic of the resulting utterance of synthesized speech representation 332. Similarly, TTS system 330 can alter prosody and other quality of production of the synthesized utterance.

In some examples, training process 300 applies data enhancement to at least one sample utterance of composite speech representation 332. Data enhancement may include, but is not limited to, adding noise, manipulating timing (e.g., stretching), or adding reverberation to the corresponding speech representation. Data enhancement may add different synthetic recording conditions to the synthetic speech representation 332.

This pretrained batch generation process for generating sample utterances of the synthesized speech representation 332 advantageously samples new speaker and prosody adjustment values each time a non-spoken text utterance 320 is observed during training, thereby producing a diversity of synthesized utterances on subsequent observations. Thus, each batch contains both synthetic utterances and real (non-synthetic) utterances. The penalty contribution may be masked using the penalty mask σ (see equation 4 below), so the penalty is calculated for the appropriate batch element.

In some examples, audio encoder 210 includes a stack of self-attention layers each having a multi-headed self-attention mechanism. For example, the stack of self-attention layers may include a stack of Conformer layers or Transformer layers. In the illustrated example, the audio encoder 210 comprises a Conformer encoder comprising a stack of Conformer blocks, each Conformer block comprising a series of multi-headed self-attention, depth-wise convolution, and feedforward layers. Conformer encoder 210 can naturally be split into feature encoders (including convolutionally sub-sampling blocks 212) and context networks (including stacks of linear layers 214 and Conformer blocks 216). In some embodiments, the convolutionally sub-sampled block 212 has two-dimensional convolutions layers each having a step size (2, 2) resulting in a 4-fold reduction in feature sequence length. The convolution sub-sample block 212 receives as input a sequence of input features/vectors (e.g., mel-frequency spectrograms such as the acoustic frame 110 of fig. 1) associated with each of the synthesized speech representations 332, each of the transcribed non-synthesized speech utterances 304, and each of the non-transcribed non-synthesized speech utterances 306, and for each of the plurality of output steps, generates as output an encoded feature 211 corresponding to a respective one of the synthesized speech representations 332, one of the transcribed non-synthesized speech utterances 304, or one of the non-transcribed non-synthesized speech utterances 306.

The encoded features 211 output from the convolved sub-sampling block 212 may be fed to a masking module 218, wherein some of the encoded features 211 are randomly selected and replaced by training feature vectors shared between all masking time steps to provide corresponding masking encoded features 211m. In some examples, the masking module 218 masks the randomly selected code feature 211 for masking by randomly sampling a proportion p of all time steps as a starting index without substitution, and then masks the next M consecutive time steps from each sample index, whereby some spans may overlap. After masking is applied, the linear layer 214 and the Conformer block 216 of the context network receive the masking code feature 211m and output a corresponding comparative context vector (i.e., corresponding coded representation) 215 from the masking code feature 211m. Further, the quantizer 217 receives the encoding feature 211 as an input, and generates a quantization vector (i.e., a target context vector) 219 as an output. Thereafter, the contrast loss module 315 derives a contrast loss between the contrast context vector 215 at the masking position and the target context vector 219 as follows316：

Wherein c _t Is a contrast context vector 215 centered on the masking time step t, and q _t Representing the target context vector 219 in a set of K+1 candidate target context vectors 219 at time step t, the set comprising q _t And K interference terms. The interference terms may be sampled uniformly from other masking time steps of the same utterance.

The contrast loss 316 is optimized between the contrast context vector 215 at the masking position and the target context vector 219. After the pre-trained audio encoder 210 converges on the non-transcribed non-synthetic speech utterance 306, the pre-training process is repeated on both the synthetic speech representation 332 and the transcribed non-synthetic speech utterance 304. Thus, the contrast loss 316 is optimized for both real/human (non-synthesized) and synthesized (TTS audio) features, with additional auxiliary loss on the transcribed non-synthesized speech utterance 304 and synthesized speech representation 332 as described in more detail below with reference to fig. 3B. Thus, the training process 300 pre-trains the audio encoder 210 on the derived contrast loss 316 applied on the corresponding coding features 211 associated with each synthesized speech representation 332, each transcribed non-synthesized speech utterance 304, and each non-transcribed non-synthesized speech utterance 306 provided as input to the audio encoder 210. Pre-training audio encoder 210 may include updating parameters of the audio encoder based on the contrast loss.

Referring to fig. 3B, the supervised loss section 300B of the training process 300 is configured to inject lexical information into the audio encoder 210 during pre-training based on the supervised loss terms 342, 344 derived from the transcribed non-synthesized speech utterances 304 and the synthesized speech representations 332 generated by the TTS system 330 for the non-spoken text utterances 320. Notably, the supervised penalty portion 300b utilizes one or more auxiliary decoders 390 to generate the supervised penalty entries 344, 346. The auxiliary decoder 390 may include a Connection Time Class (CTC) decoder, a listen to note spelling (LAS) decoder, or an RNN-T decoder. These auxiliary decoders 390 may include at least one of a phoneme decoder configured to decode a sequence of phonemes or a word segment decoder configured to decode a sequence of word segments. The auxiliary decoder 390 can also include a grapheme decoder configured to decode sequences of graphemes. In some examples, the training process 300 applies data enhancement to at least one sample utterance of the synthetic speech representation 332 to provide one or more lexically diverse synthetic speech representations 332 for a given non-spoken training text utterance 320. Data enhancement may include, but is not limited to, adding noise, manipulating timing (e.g., stretching), or adding reverberation to the corresponding speech representation. Data enhancement may add different synthetic recording conditions to the synthetic speech representation 332.

During the supervised loss section 300b, the audio encoder 210 receives as input each synthesized speech representation 332 generated from the non-spoken text utterances 320 as a sequence of features/vectors (e.g., mel-frequency spectrograms such as the acoustic frame 110 of fig. 1), and for each of a plurality of time stepsOne time step, a first encoded representation (e) corresponding to the synthesized speech representation 332 at the corresponding time step is generated _text ) 312 as an output. An auxiliary decoder 390, comprising a phoneme decoder or word fragment decoder, receives as input each first encoded representation 312 output from the audio encoder 310 and generates as output a first probability distribution 392 over possible synthetic speech recognition hypotheses for the corresponding synthetic speech representation 332 at the corresponding time step. In some examples, the first probability distribution 392 over possible synthetic speech recognition hypotheses includes one of possible phoneme labels or possible word segment labels. Thereafter, the supervisory loss module 340 may determine the synthesized speech loss term 342 based on the first probability distribution 392 over possible synthesized speech recognition hypotheses and the corresponding non-spoken text utterance 320. Here, the corresponding non-spoken text utterance 320 from which the synthesized speech representation 332 is generated is also used as a true-value transcription. The supervised loss section 300b may pre-train the audio encoder 210 on the synthesized speech loss term 342 by updating parameters of the audio encoder 210.

Similarly, during the supervision loss section 300b, the audio encoder 210 receives as input each transcribed non-synthesized speech utterance 304 that is a sequence of features/vectors (e.g., a mel-frequency spectrogram such as the acoustic frame 110 of fig. 1), and for each of a plurality of time steps, generates a second encoded representation (e _sup ) 314 as an output. An auxiliary decoder 390, comprising a phoneme decoder or a word-segment decoder, receives as input each second encoded representation 314 output from the audio encoder 310 and generates a second probability distribution 394 over possible non-synthetic speech recognition hypotheses for the corresponding transcribed non-synthetic speech utterance 304 at the corresponding time step. In some examples, the second probability distribution 394 over possible synthetic speech recognition hypotheses includes one of possible phoneme labels or possible word segment labels. Thereafter, the supervisory loss module 340 may determine a non-synthetic speech loss term 344 based on the second probability distribution 394 over possible non-synthetic speech recognition hypotheses and the corresponding transcription 302 paired with the transcribed non-synthetic speech utterance 304. Here, the corresponding transcript 302 is used as true Real-valued transcription and may include sequences of target phonemes and/or target word segments. The supervised loss section 300b may pre-train the audio encoder 210 on the non-synthesized speech loss term 344 by updating parameters of the audio encoder 210.

In some embodiments, the supervised loss section 300b of the training process 300 uses another auxiliary decoder 390 to synthesize a first encoded representation (e _text ) 312 generates a third probability distribution 393 over possible synthetic speech recognition hypotheses, whereby the supervisory loss module 340 can determine another synthetic speech loss term 342 based on the third probability distribution and the non-spoken text utterance 320 corresponding to the synthetic speech representation. Here, the further auxiliary decoder 390 comprises the other of the phoneme decoder or the word-segment decoder and the third probability distribution 393 over the possible synthetic speech recognition hypotheses comprises the other of the possible phoneme labels or the possible word-segment labels. In these embodiments, the further auxiliary decoder 290 also generates a fourth probability distribution 395 over possible non-synthetic speech recognition hypotheses for the corresponding transcribed non-synthetic speech utterance 304 at the corresponding time step, whereby the supervised loss module 340 may determine the further non-synthetic speech loss term 344 based on the fourth probability distribution 395 and the corresponding transcription 302 paired with the transcribed non-synthetic speech utterance 304. Here, the fourth probability distribution 395 over the possible non-synthesized speech recognition hypotheses includes the other one of the possible phoneme labels or the possible word segment labels. The supervised loss section 300b of the training process 300 may similarly pre-train the audio encoder 210 on another synthesized speech loss term 342 and another non-synthesized speech loss term 344.

The untranscribed non-synthesized speech utterance 306 and the non-spoken text utterance 320 each correspond to "unpaired" training data, whereby the speech is derived from the non-spoken text utterance (X _text ) 320 derived contrast loss (L _w2v ) 316 (fig. 3A) may be associated with a supervisory penalty associated with a synthesized speech penalty term 342Combining to obtain a non-verbal text loss function as follows>

Likewise, a non-synthesized speech utterance (X _unsup ) 306 derived contrast loss (L _w2v ) 316 (fig. 3A) may be used to represent an unsupervised speech loss function as follows

During pre-training of the audio encoder 210, the synthesized speech representation 332 and the non-transcribed non-synthesized speech utterances 306 are mixed within each batch. To force the audio encoder 210 to learn representations that are valid for both synthesized speech and non-synthesized (human/real) speech, the loss functions of equation 2 are combinedAnd the loss function of equation 3 to obtain the unpaired data loss function +.>

Transcribing the non-synthesized speech utterance 304 corresponds to "pairing" and "supervising" training data, whereby the derived contrast loss associated with the non-synthesized speech loss term 344 may be combined316 (FIG. 3A) and derived supervision loss->The paired data loss function is obtained as follows >

Referring to fig. 3C, the consistent regularization portion 300C of the training process 300 is configured to facilitate the audio encoder 210 by including in each case a transcription of the non-synthesized speech utterance (X _sup ) Generating a consistency loss term between a training speech pair 301 of a corresponding one of 304 and a paired synthesized speech representation 334 of the same speech as the corresponding transcribed non-synthesized speech utterance 304) 352 to learn a prediction of correspondence between non-synthesized speech (e.g., real/human speech) and synthesized speech. Thus, the transcribed non-synthesized speech utterances 304 and the paired synthesized speech representations of each training utterance pair are associated with the same true-value transcription. In short, by encouraging the audio encoder 210 terms to behave consistently, regardless of whether the training utterance belongs to non-synthesized speech or synthesized speech and independent of the loss of supervision between the true value transcription 302 and each of the following: non-synthesized speech recognition hypotheses output by the auxiliary decoder 390; and the synthetic speech recognition hypothesis output by the auxiliary decoder 390, a loss of consistency term between the non-synthetic speech representation and the synthetic speech representation of the same training utterance provides an unsupervised training aspect.

Similar to the synthesized speech representations 332 generated from the non-spoken text utterance 320 in fig. 3B, the TTS system 330 can generate each paired synthesized speech representation 334 by performing a text-to-speech conversion on the corresponding transcription 302 paired with the transcribed non-synthesized speech utterance 304. Here, a non-synthesized speech utterance 304 is transcribed and synthesized by the TTS system 330 by converting text associated with the true-valued transcription 302 into a synthesis The synthesized speech generated from the audio is associated. TTS system 330 can apply speaker insertion z in converting the true value transcription (y x) 302 to obtain synthesized speech having a particular speaking style and prosody associated with the speaker insertion. Here, the true value transcription (y X) 302 is associated with a source for supervisory data enhancement, where the TTS system 330 generates a paired synthesized speech representation 334 having a speech signal (X) that is not synthesized with the transcription associated with the true value transcription (y X) 302 _sup ) 304, consistent expectations. In some examples, training process 300 applies data enhancement to at least one of transcribed non-synthesized speech utterances 304 or paired synthesized speech representations 334 of at least one of training utterance pairs 301. Data enhancement may include, but is not limited to, adding noise, manipulating timing (e.g., stretching), or adding reverberation to the corresponding speech representation.

During the consistency regularization portion 300c, the audio encoder 210 receives as input each paired synthesized speech representation 334 that is a sequence of features/vectors (e.g., a mel-frequency spectrogram such as the acoustic frame 110 of fig. 1), and for each of a plurality of time steps, generates an enhanced encoded representation (e) corresponding to the paired synthesized speech representation 334 at the corresponding time step _sup ) 313 as output. An auxiliary decoder 390 comprising a phoneme decoder or a word-segment decoder receives as input each enhancement encoded representation 313 output from the audio encoder 210 and generates as output a first probability distribution 311 over possible synthetic speech recognition hypotheses for the corresponding paired synthetic speech representations 334 at the corresponding time steps. In some examples, the first probability distribution 311 over possible synthetic speech recognition hypotheses includes one of possible phoneme labels or possible word segment labels.

Similarly, the audio encoder 210 receives as input each transcribed non-synthesized speech utterance 304 that is a sequence of features/vectors (e.g., a mel-frequency spectrogram such as the acoustic frame 110 of fig. 1), and for each of a plurality of time steps, generates a non-enhanced encoded representation (e) corresponding to the transcribed non-synthesized speech utterance 304 at the corresponding time step _sup ) 314 as an output. Auxiliary decoding including a phoneme decoder or a word segment decoderThe processor 390 receives as input each non-incremental encoded representation 314 output from the audio encoder 310 and generates as output a second probability distribution 394 over possible non-synthetic speech recognition hypotheses for the corresponding transcribed non-synthetic speech utterance 304 at the corresponding time step. In some examples, the second probability distribution 394 over possible synthetic speech recognition hypotheses includes one of possible phoneme labels or possible word segment labels.

With continued reference to fig. 3C, the consistency regularization portion 300C of the training process 300 further determines a consistency loss term for each training speech pair 301 at each of a plurality of time steps based on the first probability distribution 311 over the possible non-synthetic speech recognition hypotheses and the second probability distribution 394 over the possible non-synthetic speech recognition hypotheses) 352. For example, training process 300 may employ a consistency loss term module 350 configured to receive the corresponding non-synthesized speech recognition results 311 and synthesized speech recognition results 394 output by auxiliary decoder 390 at each time step and determine a consistency loss term 352 for the corresponding training utterance pair 301 at that time step.

In some examples, the consistency regularization portion 300c of the training process 300 is based on a kulbe-leber divergence (Kullback-Leibler divergence, D) between a first probability distribution 311 over possible synthetic speech recognition hypotheses and a second probability distribution 394 over possible non-synthetic speech recognition hypotheses _KL ) A consistency loss term 352 is determined. Based on D _KL The consistency loss term 352 of (c) may be expressed by the following equation.

Here, the consistency loss term 352 determined for the training speech pair 301 at each time step provides an "unsupervised" loss term that is independent of the accuracy of the auxiliary decoder 390 (e.g., independent of the supervised loss terms 342, 344 of fig. 3B), and thus, may be employed to update parameters of the audio encoder 210 in order to facilitate consistency between the non-synthesized speech representation and the synthesized speech representation of the same speech. In batch training, the consistency loss term 352 may correspond to an average loss term obtained for a batch. In other words, the consistency loss term 352 permits the audio encoder 210 to learn to perform the same, e.g., make consistent coded representation predictions for non-synthesized speech (e.g., real/human speech) and synthesized speech (e.g., synthesized speech) of the same training utterance, regardless of whether the training utterance belongs to non-synthesized speech or synthesized speech.

Finally, training process 300 may combine unpaired data loss functions) Pairing data loss function) And a consistency loss term (+)>) To obtain the total loss term +.>

Wherein lambda is ₁ May be equal to 1.0 and lambda ₂ Equal to 0.1. The training process 300 may use the total loss term by updating parameters of the audio encoder 210To pre-train the audio encoder 210 to effectively teach the audio encoder 210 to learn a shared representation between speech and text. After pre-training audio encoder 210, training process 300 may fine tune the pre-trained audio encoder on the transcribed speech utterancesThe transcribed speech utterance may include a supervisory training sample of both synthesized speech (e.g., synthesized speech) and non-synthesized speech (e.g., human speech).

In some implementations, the training process 300 for pre-training the audio encoder 210 applies encoder consistency regularization. Unlike decoder consistency regularization that is applied to the auxiliary decoder(s) during the consistency regularization portion 300c that requires hypothesis markers (e.g., transcription 302 and non-spoken text utterance 320), encoder consistency regularization does not require hypothesis markers and therefore has the advantage of being allowed to be applied to all training data 304, 306, 320. Encoder consistency regularization may be applied via a Hierarchical Contrast Consistency Regularization (HCCR) technique, where encoder activations e, e from original/non-enhanced speech and enhanced speech are projected through an auxiliary network to generate z and zx. Thereafter, positive and negative pairs are estimated and loss compared Calculated as follows.

In particular to HCCR, a Convolutional Neural Network (CNN) projection network may calculate projections on incremental length segments (30, 50, 120 ms) of encoder activations e to produce 3 views (V), and extract negative examples from the same utterance of short segments as well as other utterances in batches with 120ms segments. Thus, HCCR loss may be calculated on transcribed non-synthesized speech utterance 304 (paired speech), non-transcribed non-synthesized speech utterance 306 (unpaired speech), and a synthesized speech representation (synthesized speech) generated from non-spoken text utterance 320 as follows.

The HCCR loss calculated from equation 9 can be added to equation 7 as a total loss term using the coefficient 1e-3For use in pre-training audio encoder 210.

Referring to fig. 4, the comparative non-spoken text selection process 400 may select the non-spoken text utterance 320 for the pre-trained audio encoder 210 from a large non-spoken text corpus 402, whereby the selected non-spoken text utterance 320 is most similar to the particular domain in which the audio encoder 210 is being pre-trained for learning. That is, the text selection process 400 is capable of identifying intra-domain and near-domain non-spoken text from the non-spoken text corpus 402 for inclusion in the non-spoken text utterance 320 for use in pre-training the audio encoder 210. Notably, the non-spoken text utterances 320 selected by the text selection process 400 enable the immediate synthesis of different utterances during the construction of a batch, such that each time the non-spoken text utterance 320 is in a batch, new speaker inserts Z and potential variables Z may be sampled.

The non-spoken text corpus 402 includes a number of non-spoken training text utterances 320, 320a-n that span a wide-range domain and includes language diversity that is much greater than the particular domain in which the audio encoder 210 is being trained to learn. As previously mentioned, the set of transcribed non-synthesized speech utterances 304 may be domain-specific in that they are related to a particular domain and each transcribed non-synthesized speech utterance 304 is paired with a corresponding transcription 302. The non-spoken text corpus 402 may be stored in the same or different data storage 401 as the verbally transcribed non-synthesized speech utterances (i.e., training utterances) 304. The non-spoken text corpus 402 may dynamically change to incorporate new non-spoken text utterances 320. Simply using all of the non-spoken text utterances 320 in the non-spoken text corpus 402 is not feasible for the following reasons: i) For each sentence, the speech modality requires more memory to encode than the text, making it impractical to convert all text in the non-spoken text corpus 402; and ii) the vast difference between the transcription 302 paired with the transcription of the non-synthetic speech utterance 304 and the non-spoken text utterances 320 in the non-spoken text corpus 402 requires intelligent strategies to balance their contributions.

The text selection process 400 aims to select a subset of available non-spoken text utterances 320 from the non-spoken text corpus 402 as data for TTS synthesis that produces a synthesized speech representation 332 that was generated for the pre-training of the audio encoder 210 during the contrast loss portion 300a and the supervision loss portion 300B of the training process 300 described above with reference to fig. 3A and 3B. In other words, the text selection process 400 aims at improving the match between the selected subset of available non-spoken text utterances 320 and the specific domain targeted, which in turn reduces the computational resources required to utilize large amounts of non-domain specific data. Thus, the text selection process 400 reduces computational and memory costs by selecting a non-spoken text utterance 320 that best matches a particular domain in which the audio encoder 210 is being trained to learn.

In some examples, the text selection process 400 selects a subset of the available non-spoken text utterances 320 from the non-spoken text corpus 402 that best match a particular domain by simply providing a domain identifier (not shown) associated with the particular domain as input to a background LM 406 previously trained across the non-spoken text corpus 402. As previously mentioned, the non-spoken text corpus 402 spans many different domains. In these examples, background LM 406 may include a maximum entropy (MaxEnt LM) capable of optionally accepting a domain identifier as input as described in U.S. patent No.9,842,592 filed 2/12 2014, the contents of which are incorporated herein by reference in their entirety. Here, the domain identifier associated with the particular domain may allow the MaxEnt LM to output from the non-spoken text corpus 402 a subset of available non-spoken text utterances 320 that are likely to include words and/or phrases related to the particular domain. In some configurations, rather than evaluating the likelihood of a word, the statistical language model operates in a reverse mode to randomly generate text phrases that match the statistical distribution of words associated with a particular domain.

In an additional example, and as depicted in fig. 4, the text selection process 400 uses the transcription 302 paired with the transcribed non-synthesized speech utterance 304 spoken by a human speaker to select a subset of available non-spoken text utterances 320 from the non-spoken text corpus 402 that best match a particular domain. Here, transcribed non-synthesized speech utterances 304 include words, phrases, and/or other terms that relate to a particular domain. Alternatively, in addition to or instead of transcribing 302 paired with transcribing non-synthetic speech utterances 304, a set of different transcribed utterances related to a particular domain can be used to select non-spoken text utterance 320. This would provide the advantage of not requiring all transcribed non-synthesized speech utterances 304 to belong to a particular domain.

During the first stage (stage a), the non-spoken text selection process 400 builds two language models 404, 406 to enable a comparative selection of the non-spoken text utterance 320. Here, domain-specific LM410 is trained on each transcription 302 in a set of transcribed non-synthesized speech utterances 304. The set of transcribed non-synthesized speech utterances 304 is assumed to belong to a particular domain that the audio encoder 210 is being trained to learn. On the other hand, the background LM 406 is trained on each non-spoken text utterance 320 in the entire non-spoken text corpus 402. As previously mentioned, the non-spoken text corpus 402 spans many different domains. In some examples, the first stage uses n-gram language model training to build two language models 404, 406. In other examples, the first stage uses neural network language model training to build two language models 404, 406.

During the second phase (phase B), the non-spoken text selection process 400 uses two comparative LMs 404, 406 to evaluate each non-spoken text utterance 320 in the non-spoken text corpus 402 by: determining a first probability associated with each word in the non-spoken text utterance 320 that occurs in the particular field LM 404And determining a second probability +/associated with each word in the non-spoken text utterance 320 that occurs in the background LM406>Thereafter, for each non-spoken text corpus 402, a non-spoken text corpus is generatedThe text utterance 320, the text selection process 400 determines a score S at the scorer 408 based on the first probability, the second probability, and the number of words # (w) present in the corresponding non-spoken text utterance 320. For example, the score S for each non-spoken text utterance 320 may be calculated as follows.

After determining the scores, the non-spoken text selection process 400 selects the non-spoken text utterances 320 having the N best scores S because these non-spoken text utterances 320 best match the particular domain. The non-spoken text corpus 402 may include billions of non-spoken text utterances 320. The non-spoken text utterance 320 selected by the text selection process 400 can include millions of utterances and, therefore, far exceeds the number of untranscribed non-synthesized speech utterances 306 spoken by a human speaker. As discussed above, the content of the non-spoken text utterance 320 increases the language diversity of the particular domain that the audio encoder 210 is being trained to learn, while the corresponding synthesized speech representation 332 generated from the non-spoken text utterance 320 increases the acoustic/lexical diversity of the speech that the audio encoder 210 is encoding as part of the speech recognition process when the audio encoder 210 is integrated within the ASR model 200.

Fig. 5 illustrates an example projection space 500 of an encoder representation of a synthesized (TTS) speech utterance and a non-synthesized (real/human) speech utterance. After introducing the consistency regularization for pre-training the audio encoder via the consistency regularization portion 300C of fig. 3C, the learned resulting speech and text encoder representations are closer to each other than the speech and text encoder representations when the consistency regularization is not applied. Thus, projection space 500 illustrates that using supervised training data (i.e., transcribing non-synthesized speech utterances) for pre-training audio encoder 210 effectively generates improved shared speech and text representations.

FIG. 6 is a flow chart of an example arrangement of the operation of a method 600 for pre-training an audio encoder 210 to jointly learn a shared representation of speech and text. The method 600 may run on the data processing hardware 710 (fig. 7) using instructions stored on the memory hardware 720 (fig. 7). The data processing hardware 710 and memory hardware 720 may reside on a remote computer/server 201 of fig. 1 corresponding to computing device 700 (fig. 7).

At operation 602, the method 600 includes receiving training data including a non-spoken text-utterance 320, an untransformed non-synthesized speech utterance 306, and a transcribed non-synthesized speech utterance 304. Each non-spoken text utterance 320 is not paired with any corresponding spoken utterance of the non-synthesized speech. Each non-transcribed non-synthesized speech utterance 306 is not paired with a corresponding transcription. Each transcription of the non-synthesized speech utterance 304 is paired with a corresponding transcription 302.

At operation 604, the method 600 further includes generating, for each non-spoken text utterance 320 of the received training data, a corresponding synthesized speech representation 332 using the text-to-speech (TTS) system 330. At operation 606, the method further includes pre-training the audio encoder 210 on the synthesized speech representation 332 generated for the non-spoken text utterance 320, the non-transcribed non-synthesized speech utterance 306, and the transcribed non-synthesized speech utterance 304 to teach the audio encoder 210 to learn the shared speech and text representations jointly. The pre-training may include pre-training the audio encoder 210 based on contrast loss 315 derived from each of the synthesized speech representation 332, the non-transcribed non-synthesized speech utterance 306, and the transcribed non-synthesized speech utterance 304. The pre-training may also include pre-training the audio encoder 210 based on the supervisory losses 342, 344 (e.g., auxiliary decoder losses) derived from the synthesized speech representation 332 and the transcribed non-synthesized speech utterance 304. Finally, pre-training may additionally include pre-training the audio encoder 210 based on a consistency loss 352 derived from transcribing the non-synthetic speech utterance 304.

A software application (i.e., a software resource) may refer to computer software that causes a computing device to perform tasks. In some examples, a software application may be referred to as an "application," app, "or" program. Example applications include, but are not limited to, system diagnostic applications, system management applications, system maintenance applications, word processing applications, spreadsheet applications, messaging applications, media streaming applications, social networking applications, and gaming applications.

The non-transitory memory may be a physical device for storing programs (e.g., sequences of instructions) or data (e.g., program state information) for use by the computing device on a temporary or permanent basis. The non-transitory memory may be a volatile addressable semiconductor memory and/or a non-volatile addressable semiconductor memory. Examples of non-volatile memory include, but are not limited to, flash memory and read-only memory (ROM)/programmable read-only memory (PROM)/erasable programmable read-only memory (EPROM)/electrically erasable programmable read-only memory (EEPROM) (e.g., typically used for firmware such as a boot program). Examples of volatile memory include, but are not limited to, random Access Memory (RAM), dynamic Random Access Memory (DRAM), static Random Access Memory (SRAM), phase Change Memory (PCM), and magnetic disk or tape.

FIG. 7 is a schematic view of an example computing device 700 that may be used to implement the systems and methods described in this document. Computing device 700 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The components shown here, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed in this document.

Computing device 700 includes a processor 710, memory 720, storage device 730, high-speed interface/controller 740 coupled to memory 720 and high-speed expansion ports 750, and low-speed interface/controller 760 coupled to low-speed bus 770 and storage device 730. The components 710, 720, 730, 740, 750, and 760 are all interconnected using various buses, and may be mounted on a common motherboard or in other manners as appropriate. The processor 710 is capable of processing instructions for execution within the computing device 700, including instructions stored in the memory 720 or on the storage device 730 to display graphical information for a Graphical User Interface (GUI) on an external input/output device, such as the display 780 coupled to the high speed interface 740. In other embodiments, multiple processors and/or multiple buses may be used, as appropriate, along with multiple memories and types of memory. In addition, multiple computing devices 700 may be connected, with each device providing portions of the necessary operations (e.g., as a server bank, a group of blade servers, or a multiprocessor system).

Memory 720 stores information non-transitory within computing device 700. Memory 720 may be a computer-readable medium, volatile memory unit(s), or non-volatile memory unit(s). Non-transitory memory 720 may be a physical device for storing programs (e.g., sequences of instructions) or data (e.g., program state information) for use by computing device 700 on a temporary or permanent basis. Examples of non-volatile memory include, but are not limited to, flash memory and read-only memory (ROM)/programmable read-only memory (PROM)/erasable programmable read-only memory (EPROM)/electrically erasable programmable read-only memory (EEPROM) (e.g., commonly used for firmware such as a boot strap). Examples of volatile memory include, but are not limited to, random Access Memory (RAM), dynamic Random Access Memory (DRAM), static Random Access Memory (SRAM), phase Change Memory (PCM), and magnetic disk or tape.

Storage device 730 is capable of providing mass storage for computing device 700. In some implementations, the storage device 730 is a computer-readable medium. In various different implementations, storage device 730 may be a floppy disk device, a hard disk device, an optical disk device, or a tape device, a flash memory or other similar solid state memory device, or an array of devices, including devices in a storage area network or other configurations. In additional embodiments, the computer program product is tangibly embodied in an information carrier. The computer program product contains instructions that, when executed, perform one or more methods, such as those described above. The information carrier is a computer-or machine-readable medium, such as memory 720, storage device 730, or memory on processor 710.

The high speed controller 740 manages bandwidth-intensive operations for the computing device 700, while the low speed controller 760 manages lower bandwidth-intensive operations. This allocation of responsibilities is merely exemplary. In some implementations, high-speed controller 740 is coupled to memory 720, display 780 (e.g., through a graphics processor or accelerator), and to high-speed expansion port 750, which may accept various expansion cards (not shown). In some implementations, a low-speed controller 760 is coupled to the storage device 730 and the low-speed expansion port 790. The low-speed expansion port 790, which may include various communication ports (e.g., USB, bluetooth, ethernet, wireless ethernet), may be coupled to one or more input/output devices, such as a keyboard, pointing device, scanner, or networking device, such as a switch or router, for example, through a network adapter.

As shown, computing device 700 may be implemented in a number of different forms. For example, it may be implemented as a standard server 700a or multiple times in a group of such servers 700a, as a laptop 700b, or as part of a rack server system 700 c.

Various implementations of the systems and techniques described here can be realized in digital electronic and/or optical circuits, integrated circuits, specially designed ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments can include embodiments in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be a special or general purpose processor, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications or code) include machine instructions for a programmable processor, and can be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, non-transitory computer-readable medium, apparatus and/or device (e.g., magnetic discs, optical disks, memory, programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

The processes and logic flows described in this specification can be performed by one or more programmable processors (also referred to as data processing hardware) executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by, and special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit). Processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer. Generally, a processor will receive instructions and data from a read-only memory or a random access memory or both. The essential elements of a computer are a processor for executing instructions and one or more memory devices for storing instructions and data. Typically, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data (e.g., magnetic, magneto-optical disks, or optical disks). However, a computer does not have to have such a device. Computer readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto-optical disk; CD ROM and DVD-ROM discs. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.

To provide for interaction with a user, one or more aspects of the disclosure can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube), LCD (liquid crystal display) or touch screen) for displaying information to the user and optionally a keyboard and a pointing device (e.g., a mouse or a trackball) by which the user can provide input to the computer. Other kinds of devices can also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form including acoustic, speech, or tactile input. In addition, the computer is capable of interacting with the user by sending and receiving documents to and from the device used by the user; for example, by sending web pages to a web browser on a user's client device in response to requests received from the web browser.

A number of embodiments have been described. Nevertheless, it will be understood that various modifications may be made without departing from the spirit and scope of the disclosure. Accordingly, other embodiments are within the scope of the following claims.

Claims

1. A computer-implemented method (600), the computer-implemented method (600), when run on data processing hardware (710), causing the data processing hardware (710) to perform operations comprising:

receiving training data (304, 306, 320), the training data (304, 306, 320) comprising:

non-spoken text utterances (320), each non-spoken text utterance (320) not paired with any corresponding spoken utterance of non-synthesized speech;

an untranscribed non-synthetic speech utterance (306), each untranscribed non-synthetic speech utterance (306) not paired with a corresponding transcription; and

transcribing non-synthesized speech utterances (304), each transcribed non-synthesized speech utterance (304) paired with a corresponding transcription (302);

generating a corresponding synthesized speech representation (332) for each non-spoken text utterance (320) of the received training data (304, 306, 320) using a text-to-speech model (330); and

an audio encoder (210) is pre-trained on the synthetic speech representation (332), the non-transcribed non-synthetic speech utterance (306), and the transcribed non-synthetic speech utterance (304) generated for the non-spoken text utterance (320) to teach the audio encoder (210) to learn a shared speech and text representation jointly.

2. The computer-implemented method (600) of claim 1, wherein the audio encoder (210) comprises a stack of self-attention layers, each self-attention layer comprising a multi-headed self-attention mechanism.

3. The computer-implemented method (600) of claim 1 or 2, wherein pre-training the audio encoder (210) comprises:

for each untranscribed non-synthesized speech utterance (306):

generating a corresponding encoded representation (215) of the non-transcribed non-synthetic speech utterance (306); and

pre-training the audio encoder (210) on a contrast penalty (316) applied on the corresponding encoded representation (215) of the non-transcribed non-synthetic speech utterance (306);

for each synthesized speech representation (332):

generating a corresponding encoded representation (215) of the synthesized speech representation (332); and

pre-training the audio encoder (210) on a contrast penalty (316) applied on the corresponding encoded representation (215) of the synthesized speech representation (332); and is also provided with

For each transcribed non-synthesized speech utterance (304):

generating a corresponding encoded representation (215) of the transcribed non-synthesized speech utterance (304); and

the audio encoder (210) is pre-trained on a contrast loss (316) applied on the corresponding encoded representation (215) of the transcribed non-synthetic speech utterance (304).

4. The computer-implemented method (600) of any of claims 1-3, wherein pre-training the audio encoder (210) comprises:

for each synthesized speech representation (332) at each of a plurality of time steps:

generating a first probability distribution (392) over possible synthetic speech recognition hypotheses for the corresponding synthetic speech representation (332) using an auxiliary decoder (390);

determining a synthesized speech loss term (342) based on the first probability distribution (392) over the possible synthesized speech recognition hypotheses and the non-spoken text utterance (320) corresponding to the corresponding synthesized speech representation (332); and

pre-training the audio encoder (210) based on the synthesized speech loss term (342); and is also provided with

For each transcribed non-synthesized speech utterance (304) at each of a plurality of time steps:

generating a second probability distribution (394) over possible non-synthetic speech recognition hypotheses for the corresponding transcribed non-synthetic speech utterance (304) using the auxiliary decoder (390);

determining a non-synthetic speech loss term (344) based on the second probability distribution (394) over the possible non-synthetic speech recognition hypotheses and the corresponding transcription (302) paired with the transcribed non-synthetic speech utterance (304); and

The audio encoder (210) is pre-trained based on the non-synthesized speech loss term (344).

5. The computer-implemented method (600) of claim 4, wherein:

the first probability distribution (392) over the possible synthetic speech recognition hypotheses includes one of possible phoneme labels or possible word segment labels; and is also provided with

The second probability distribution (394) over the possible non-synthetic speech recognition hypotheses includes the one of the possible phoneme labels or the possible word segment labels.

6. The computer-implemented method (600) of claim 5, wherein pre-training the audio encoder (210) further comprises:

for each synthesized speech representation (332) at each of the plurality of time steps:

generating a third probability distribution (393) over possible synthetic speech recognition hypotheses for the corresponding synthetic speech representation (332) using a further auxiliary decoder (390), the third probability distribution (393) over possible synthetic speech recognition hypotheses including the other of the possible phoneme labels or the possible word segment labels;

determining another synthesized speech loss term (342) based on the third probability distribution (393) over the possible synthesized speech recognition hypotheses and the non-spoken text utterance (320) corresponding to the corresponding synthesized speech representation (332); and

Pre-training the audio encoder (210) based on the further synthesized speech loss term (342); and is also provided with

For each transcribed non-synthesized speech utterance (304) at each of the plurality of time steps:

generating a fourth probability distribution (395) over possible non-synthetic speech recognition hypotheses for the corresponding transcribed non-synthetic speech utterance (304) using the further auxiliary decoder (390), the fourth probability distribution (395) over possible non-synthetic speech recognition hypotheses including the other of the possible phoneme labels or the possible word segment labels;

determining another non-synthetic speech loss term (344) based on the fourth probability distribution (395) over the possible non-synthetic speech recognition hypotheses and the corresponding transcription (302) paired with the transcribed non-synthetic speech utterance (304); and

7. The computer-implemented method (600) of any of claims 4-6, wherein the auxiliary decoder (390) comprises one of a Connection Time Classification (CTC) decoder, a listen-note spelling (LAS) decoder, or a recurrent neural network-transmit (RNN-T) decoder.

8. The computer-implemented method (600) of any of claims 1-7, wherein the operations further comprise:

obtaining a set of training speech pairs (301), each training speech pair (301) comprising:

-a corresponding one of the transcribed non-synthetic speech utterances (304) in the received training data (304, 306, 320); and

a paired synthesized speech representation (334) of the corresponding transcribed non-synthesized speech utterance (304), the paired synthesized speech representation (334) generated by the text-to-speech model (330) performing a text-to-speech conversion on the corresponding transcription (302) paired with the transcribed non-synthesized speech utterance (304),

wherein pre-training the audio encoder (210) comprises, for each training speech pair (301) in the set of training speech pairs (301), at each of a plurality of output steps:

generating a first probability distribution (311) over possible non-synthetic speech recognition hypotheses for the corresponding transcribed non-synthetic speech utterance (304) using an auxiliary decoder (390);

generating a second probability distribution (394) over possible synthetic speech recognition hypotheses for the corresponding counterpart synthetic speech representation (334) using the auxiliary decoder (390);

Determining a consistency loss term (352) for the corresponding training speech pair (301) based on the first probability distribution (311) over the possible non-synthetic speech recognition hypotheses and the second probability distribution (394) over the possible synthetic speech recognition hypotheses; and

the audio encoder (210) is pre-trained based on the consistency loss term (352).

9. The computer-implemented method (600) of any of claims 1-8, wherein the operations further comprise enhancing one or more of the synthesized speech representations (332) prior to pre-training the audio encoder (210) on the synthesized speech representations (332).

10. The computer-implemented method (600) of any of claims 1-9, wherein the non-spoken text utterance (320) is generated and/or selected using one or more language models (404, 406).

11. The computer-implemented method (600) of any of claims 1-10, wherein the non-spoken text utterance (320) is generated using a background language model (406) and an in-domain language model (404) trained on transcribed speech utterances (304) associated with a target domain.

12. The computer-implemented method (600) of any of claims 1-11, wherein the operations further comprise, after pre-training the audio encoder (210), fine-tuning the encoder (210) pre-trained on the transcribed speech utterance (304).

13. A system (100) comprising:

data processing hardware (710); and

-memory hardware (720), the memory hardware (720) being in communication with the data processing hardware (710), the memory hardware (720) storing instructions that, when run on the data processing hardware (710), cause the data processing hardware (710) to perform operations comprising:

14. The system (100) of claim 13, wherein the audio encoder (210) includes a stack of self-attention layers, each self-attention layer including a multi-headed self-attention mechanism.

15. The system (100) of claim 13 or 14, wherein pre-training the audio encoder (210) comprises:

for each untranscribed non-synthesized speech utterance (306):

for each synthesized speech representation (332):

For each transcribed non-synthesized speech utterance (304):

16. The system (100) of any of claims 13 to 15, wherein pre-training the audio encoder (210) comprises:

17. The system (100) of claim 16, wherein:

18. The system (100) of claim 17, wherein pre-training the audio encoder (210) further comprises:

19. The system (100) of any one of claims 16 to 18, wherein the auxiliary decoder (390) comprises one of a Connection Time Classification (CTC) decoder, a listen to note spelling (LAS) decoder, or a recurrent neural network-transmit (RNN-T) decoder.

20. The system (100) of any one of claims 13 to 19, wherein the operations further comprise:

21. The system (100) of any of claims 13 to 20, wherein the operations further comprise enhancing one or more of the synthesized speech representations (332) prior to pre-training the audio encoder (210) on the synthesized speech representations (332).

22. The system (100) according to any one of claims 13 to 21, wherein the non-spoken text utterance (320) is generated and/or selected using one or more language models (404, 406).

23. The system (100) of any of claims 13 to 22, wherein the non-spoken text utterance (320) is generated using a background language model (406) and an in-domain language model (404) trained on transcribed speech utterances (304) associated with a target domain.

24. The system (100) of any of claims 13 to 23, wherein the operations further comprise, after pre-training the audio encoder (210), fine-tuning the encoder (210) pre-trained on the transcribed speech utterance (304).