US20200035219A1 - Augmented generalized deep learning with special vocabulary - Google Patents

Augmented generalized deep learning with special vocabulary Download PDF

Info

Publication number
US20200035219A1
US20200035219A1 US16/232,652 US201816232652A US2020035219A1 US 20200035219 A1 US20200035219 A1 US 20200035219A1 US 201816232652 A US201816232652 A US 201816232652A US 2020035219 A1 US2020035219 A1 US 2020035219A1
Authority
US
United States
Prior art keywords
neural network
training
word
speech recognition
output
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
US16/232,652
Other versions
US10540959B1 (en
Inventor
Jeff Ward
Adam Sypniewski
Scott Stephenson
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Deepgram Inc
Original Assignee
Deepgram Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Deepgram Inc filed Critical Deepgram Inc
Priority to US16/232,652 priority Critical patent/US10540959B1/en
Assigned to Deepgram, Inc. reassignment Deepgram, Inc. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: STEPHENSON, SCOTT, SYPNIEWSKI, ADAM, WARD, JEFF
Application granted granted Critical
Publication of US10540959B1 publication Critical patent/US10540959B1/en
Publication of US20200035219A1 publication Critical patent/US20200035219A1/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • G06K9/6256
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • G06N3/0481
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • G10L15/19Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules
    • G10L15/197Probabilistic grammars, e.g. word n-grams
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0635Training updating or merging of old and new templates; Mean values; Weighting
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/081Search algorithms, e.g. Baum-Welch or Viterbi

Definitions

  • Neural networks are machine learning models that may be trained to produce outputs based on an input.
  • Neural networks may include an output layer where one or more nodes of an output layer correspond to candidate outputs, and the value of output nodes is a probability that the candidate output is the correct output for the input.
  • Neural networks are often trained on general training sets. However, training the neural network on a general training set may not produce high quality outputs when the neural network is used on a more specific dataset. Therefore, it would be desirable to provide a mechanism for customizing a neural network that has been trained on a general training set for a specific dataset.
  • Systems and methods are disclosed for customizing the output of a neural network for a custom dataset, when the neural network has been trained on a general training set.
  • One embodiment comprises providing a trained neural network, where the trained neural network includes a plurality of layers each having a plurality of nodes.
  • the trained neural network may include an output layer with nodes corresponding to candidate outputs, wherein the values of the nodes in the output layer correspond to a probability of a candidate output being a correct output corresponding to an input.
  • the values of a plurality of nodes in the output layer may be adjusted for a custom model, wherein the custom model is different from a general training set used to generate the trained neural network.
  • One embodiment comprises providing a trained speech recognition neural network, where the trained speech recognition neural network including a plurality of layers each having a plurality of nodes.
  • the trained speech recognition neural network may include an output layer with nodes corresponding to words of a vocabulary, wherein the values of the nodes in the output layer correspond to a probability of a word in the vocabulary being a correct transcription of an input. For a plurality of words in the vocabulary, the frequency of occurrence of the word in a general training set and the frequency of occurrence of the word in a custom dataset is determined.
  • the probability output by the output node for the word is multiplied by the frequency of occurrence of the word in the custom dataset and divided by the resulting product by the frequency of occurrence of the word in the general training set to obtain a custom model probability.
  • a customization layer is provided in a neural network.
  • the customization layer may customize the output of the neural network for a custom vocabulary by adjusting the probabilities of each output of the neural network according based on characteristics of the custom vocabulary and a general vocabulary.
  • the customization may be performed based on the observed frequency of each output in the custom vocabulary as compared to the observed frequency of each output in the general vocabulary.
  • the neural network is an end-to-end speech recognition system, end-to-end speech classification system, or end-to-end phoneme recognition system. In other embodiments, the neural network may perform tasks unrelated to speech recognition.
  • FIG. 1 illustrates an exemplary network environment where some embodiments of the invention may operate
  • FIG. 2 illustrates an end-to-end speech recognition system according to an embodiment
  • FIG. 3 illustrates an example of audio features produced by a front-end module according to an embodiment
  • FIG. 4 illustrates an example CNN stack architecture according to an embodiment
  • FIG. 5 illustrates an example RNN stack architecture according to an embodiment
  • FIG. 6 illustrates an example transcription output of an end-to-end speech recognition system according to an embodiment
  • FIG. 7 illustrates an end-to-end speech recognition system according to an embodiment.
  • FIG. 8 illustrates an end-to-end phoneme recognition system according to an embodiment.
  • FIG. 9A illustrates an iterative beam search according to an embodiment.
  • FIG. 9B illustrates exemplary radial basis functions used in an iterative beam search according to an embodiment.
  • FIG. 9C illustrates an example use of iterative beam search according to an embodiment.
  • FIG. 10 illustrates an example of looping training samples in a training batch that are shorter than a longest training sample.
  • FIGS. 11A-B illustrates an example attention mechanism for a neural network.
  • FIG. 12 illustrates an example of a general domain and a custom domain.
  • FIG. 13 illustrates an example system for predicting the weights of neural network nodes.
  • FIG. 14 illustrates an example customization layer of a neural network.
  • FIG. 15 illustrates an example method of training a neural network for a custom domain by selecting portions of a general training dataset to train on.
  • FIG. 16 illustrates an example training data augmentation and streaming system.
  • FIG. 17 illustrates an example process for parallelizing an inference task.
  • FIG. 18 illustrates an example method of generating an internal state representation of a neural network.
  • Embodiments described herein relate to end-to-end neural network speech recognition systems. Some disclosed embodiments form a single neural network from input to output. Because of this unitary architecture, the disclosed speech recognition systems are able to be trained solely by data driven techniques, eschewing laborious hand-tuning and increasing accuracy.
  • Components, or modules, shown in diagrams are illustrative of embodiments of the invention. It shall also be understood that throughout this disclosure that components may be described as separate functional units, which may comprise sub-units, but those skilled in the art will recognize that various components, or portions thereof, may be divided into separate components or may be integrated together, including integrated within a single system or component. It should be noted that functions or operations discussed herein may be implemented as components. Components may be implemented in software, hardware, or a combination thereof.
  • connections between components or systems within the figures are not intended to be limited to direct connections. Rather, data between these components may be modified, re-formatted, or otherwise changed by intermediary components. Also, additional or fewer connections may be used. It shall also be noted that the terms “coupled,” “connected,” or “communicatively coupled” shall be understood to include direct connections, indirect connections through one or more intermediary devices, and wireless connections.
  • a service, function, or resource is not limited to a single service, function, or resource; usage of these terms may refer to a grouping of related services, functions, or resources, which may be distributed or aggregated.
  • memory, database, information base, data store, tables, hardware, and the like may be used herein to refer to system component or components into which information may be entered or otherwise recorded.
  • steps may optionally be performed; (2) steps may not be limited to the specific order set forth herein; (3) steps may be performed in different orders; and (4) steps may be done concurrently.
  • FIG. 1 illustrates an exemplary network environment 100 where some embodiments of the invention may operate.
  • the network environment 100 may include multiple clients 110 , 111 connected to one or more servers 120 , 121 via a network 140 .
  • Network 140 may include a local area network (LAN), a wide area network (WAN), a telephone network, such as the Public Switched Telephone Network (PSTN), an intranet, the Internet, or a combination of networks.
  • LAN local area network
  • WAN wide area network
  • PSTN Public Switched Telephone Network
  • An intranet the Internet
  • Two clients 110 , 111 and two servers 120 , 121 have been illustrated for simplicity, though in practice there may be more or fewer clients and servers.
  • Clients and servers may be computer systems of any type. In some cases, clients may act as servers and servers may act as clients.
  • Clients and servers may be implemented as a number of networked computer devices, though they are illustrated as a single entity.
  • Clients may operate web browsers 130 , 131 , respectively for display web pages, websites, and other content on the World Wide Web (WWW).
  • Clients 110 , 111 may also access content from the network 140 using applications, or apps, rather than web browsers 130 , 131 .
  • Servers may operate web servers 150 , 151 , respectively for serving content over the network 140 , such as the web.
  • the apparatuses and methods described in this application may be partially or fully implemented by one or more computer programs executed by one or more processors.
  • the computer programs include processor-executable instructions that are stored on at least one non-transitory tangible computer readable medium.
  • the computer programs may also include and/or rely on stored data.
  • FIG. 2 illustrates an end-to-end speech recognition system 200 according to an embodiment.
  • the example end-to-end speech recognition system 200 illustrated in FIG. 2 is configured to transcribe spoken word into written text.
  • Speech recognition system 200 comprises front-end module 201 , convolutional neural network (CNN) stack 202 , first fully-connected layer 203 , recurrent neural network (RNN) stack 204 , second fully-connected layer 205 , output neural network stack 206 , and optional customization layer 207 .
  • CNN convolutional neural network
  • RNN recurrent neural network
  • Neural networks comprise a plurality of neural network nodes organized in one or more layers. Each node has one or more inputs, an activation function, and an output. The inputs and output may generally be real number values. The inputs to the node are combined through a linear combination with weights and the activation function is applied to the result to produce the output.
  • the output may be transmitted as an input to one or more other nodes in subsequent layers.
  • Neural network nodes may be organized in one or more layers.
  • An input layer may comprise input nodes whose values may correspond to inputs to the neural network, without use of an activation function.
  • An output layer may comprise one or more output nodes corresponding to output from the neural network.
  • Neural network layers other than the input layer and output layer may be hidden layers, and the nodes in those layers may be referred to as hidden nodes.
  • end-to-end speech recognition system 200 may be roughly analogized to components of a traditional ASR system, though the components of end-to-end speech recognition system 200 are not so rigidly defined as in a traditional ASR system.
  • CNN stack 202 detects features of the input audio stream and RNN stack 204 classifies groups of features as words, roughly similar to an acoustic model and a pronunciation dictionary.
  • CNN stack 202 does not produce a discrete phoneme stream output, and RNN stack 204 does not expressly use a language model or hand-coded dictionary. Instead, the features produced by CNN stack 202 are entirely learned in the training process, and RNN stack 204 learns relationships between sounds and words through training as well. No hand-coded dictionaries or manual interventions are used throughout.
  • Each layer or stack of end-to-end speech recognition system 200 is described in further detail below.
  • Front-end module 201 produces acoustic features from audio input.
  • Front-end module 201 receives raw audio data and applies a series of transformations and filters to generate acoustic features suitable for speech recognition by the following neural networks.
  • the input audio is a recording of an utterance that may be segmented on relative silence such that the input audio comprises an entire utterance.
  • An utterance may be one or more words.
  • the input audio may be a 7-10 second long recording of a speaker speaking a word, phrase, or series of words and/or phrases.
  • the input audio may be an entire sentence.
  • the input audio is segmented based on time intervals rather than relative silence.
  • the input audio is segment is based on a combination of features, such as relative silence, time, and other features.
  • Front-end module 201 may filter the input audio to isolate or emphasize frequency bands relevant to speech recognition. For example, front-end module 201 may low-pass filter the input audio at a predetermined frequency to remove high frequency information beyond the range of speech. Similarly, front-end module may filter the input audio with high-pass filters, band-pass filters, dynamic range compressors, dynamic range expanders, or similar audio filtering techniques suitable for processing audio for speech recognition.
  • Front-end module 201 may then segment the input recording of an utterance into a series of frames.
  • the input utterance recording may be split into a series of frames of audio data 10 milliseconds long, such that one second of input audio may be split into 100 frames.
  • the frames may overlap.
  • one second of input audio may be divided into 100 frames that are 25 milliseconds in length, spaced at 10 millisecond intervals. Any frame duration, spacing, and overlap may be used as appropriate for any given implementation as determined by one skilled in the art.
  • front-end module 201 may output raw audio information for consumption by subsequent layers. In other embodiments, front-end module 201 may further process the audio frames before outputting. For example, in some embodiments, front-end module 201 generates spectrograms of audio frames. The spectrograms for each frame may then be arranged sequentially, producing a two-dimensional representation of the input audio that reflects the frequency content over time. In this way, the front-end module may generate a visual, two-dimensional representation of the input audio for the following neural networks.
  • front-end module 201 generates other features of the input audio frames.
  • feature representations include: log-mel filterbanks, Mel-Frequency Cepstral Coefficients (MFCC), and perceptual linear prediction coefficients, among other similar acoustic feature representations.
  • MFCC Mel-Frequency Cepstral Coefficients
  • perceptual linear prediction coefficients among other similar acoustic feature representations.
  • an MFCC representation of each frame may be visualized as a linear vector similar to the spectrogram example above, and similarly rotated and stacked side-by-side to produce a 2-dimensional visual representation of the audio input over time.
  • the relevant parameters of front-end module 201 include the number of frames, the width and overlap of frames, the type of features determined, and the number of features per frame. Each of these parameters may be chosen by one skilled in the art for any given implementation.
  • FIG. 3 illustrates an example of audio features produced by a front-end module such as front-end module 201 .
  • audio input 301 is divided into windows 302 a - n .
  • audio windows 302 a - n are illustrated in FIG. 3 .
  • audio windows would either abut or overlap such that the entire audio input is processed.
  • Each window of audio data is then processed by a filter 303 .
  • filter 303 produces an MFCC representation 304 of each window of audio data.
  • MFCC representations 304 a - n comprise 12 coefficients, but any number of coefficients may be used.
  • each coefficient in MFCC representations 304 a - n represent an intensity of each coefficient, corresponding to some feature or quality of the audio stream.
  • a plurality of feature representations are joined together to form a single representation 305 of the entire audio input.
  • This representation 305 may be illustrated as a 2-dimensional image as shown in FIG. 3 .
  • Representations of greater or less than 1-dimension or 2-dimensions may also be used to represent frames, and frames may be represented in the system as tensors.
  • the term tensor is used to refer to a vector or matrix of any number of dimensions.
  • a tensor may have dimension 0 (scalar), dimension 1 (vector), dimension 2 (2-dimensional matrix), or any higher number of dimensions such as 3, 4, 5, and so on.
  • the multi-dimensional property of some tensors makes them a useful tool for representing neural networks and also the data representations between neural network layers.
  • CNN stack 202 receive the representation of the audio input from front-end module 201 .
  • CNN stack 202 processes the audio features to determine a first set of features. Specifically, CNN stack 202 generates a number of feature maps corresponding to a number of convolutional filters, where each convolutional filter represents some characteristic or feature of the audio input. This step may be regarded as roughly analogous to determining a phoneme representation of input audio, however CNN stack 202 does not discretize the output to a set number of acoustic representations.
  • the features determined by CNN stack 202 are not limited to a predetermined set of phonemes. Because it is not so limited, CNN stack 202 can encode a wide range of information.
  • CNN stack 202 may include any number of convolutional layers, each including various size convolutional kernels.
  • the relevant hyperparameters of CNN stack 202 include the dimension and number of CNN stack, the dimension and number of convolutional kernels at each layer, the stride of the convolutional kernels, and the number and function of any pooling stack.
  • Convolutional kernels may be square, such as of size 5 ⁇ 5, or rectangular, such as of size 3 ⁇ 9, for example. Rectangular convolutional kernels that are ‘narrow’ along the time-axis may be more sensitive to features that are spread out over a wide range of frequencies but local to a short time period.
  • convolution kernels that are ‘wider’ along the time-axis may detect acoustic features that are confined to a relatively narrow audio band but are of longer duration in time.
  • Convolution kernels may also be referred to as windows, filters, or feature detectors.
  • the size of the convolutional kernel also determines the number of connections between the input layer and at least the first hidden layer of neural network nodes of the CNN.
  • Each node in the first hidden layer of the CNN has an input edge from each of the input values in the convolutional kernel centered on that node. For example, if the convolutional kernel has size 5 ⁇ 5, then a hidden neural network node in the first hidden layer has 25 inbound edges, one from each of the input values in a 5 ⁇ 5 square in the vicinity of the neural network node, and the hidden neural network node does not have inbound edges from other input values outside of the convolutional kernel.
  • the subsequent hidden layers of the same CNN stack or later CNN stacks operate in the same manner, but the inbound edges come not from the input values but from the preceding CNN layer.
  • Each subsequent neural network node in the CNN stack has inbound connections from preceding CNN nodes in only a local area defined around the subsequent neural network node, where the local area may be defined by the size of the convolutional kernel.
  • This property also implies that a given hidden layer node of a CNN also only has outbound edges to hidden layer nodes of the next layer that are in the vicinity of the given hidden layer node.
  • the outbound connections of a hidden layer node may also correspond to the size of the convolutional kernel.
  • a CNN is one type of locally connected neural network because the neural network nodes of each layer are connected only to nodes of the preceding layer of the neural network that are in the local vicinity of the neural network nodes.
  • a CNN may also be referred to as one type of sparsely connected neural network because the edges are sparse, meaning that most neural network nodes in a layer are not connected to the majority of neural network nodes in the following layer.
  • the aforementioned definitions may exclude the output or input layer as necessary given that the input layer has no preceding layer and the output layer has no subsequent layer.
  • a CNN is only one type of locally connected or sparsely connected neural network, and there are other types of locally connected or sparsely connect neural networks.
  • Convolutional layers may produce an output activation map that is approximately the same dimensionality as the input to the layer.
  • the convolutional kernel may operate on all or nearly all input values to a convolutional layer.
  • Convolutional layers may also incorporate a stride factor wherein the convolutional kernel may be shifted by 2 or more pixels per iteration and produce an activation map of a correspondingly reduced dimensionality. Stride factors for each layer of CNN stack 202 may be determined by one of skill in the art for each implementation.
  • CNN stack 202 may include pooling layers in between convolutional layers. Pooling layers are another mechanism to reduce dimensionality. For example, a pooling layer may operate on a 2 ⁇ 2 window of an activation map with a stride of 2 and select the maximum value within the window, referred to as a max pooling operation. This example pooling layer reduces the dimensionality of an activation map by a factor of 4. Other dimensions of pooling stack may be used between convolutional stack to reduce dimensionality, for example 1 ⁇ 2, 1 ⁇ 3, or other pooling dimensions.
  • the input to CNN stack 202 is all frames of audio features produced by front-end module 201 and no segmenting or windowing is involved.
  • convolutional kernel dimension, stride, and pooling dimensions may be selected so as to retain temporal information. In an embodiment, this is accomplished by reducing dimensions only the frequency dimension, such that the output of CNN stack 202 has a time dimension equal to its input.
  • CNN stack 202 produce a set of features corresponding to sounds in the audio input.
  • the input to CNN stack 202 is a segment of frames of audio features produced by front-end module 201 .
  • a context of frames before and/or after the output frame may be included in the segment.
  • CNN stack 202 may operate on a ‘window’ of the 5 previous frames and the following 5 frames, for a total of 11 frames. In this example, if there are 40 audio features per frame, CNN stack 202 would then operate on an input having dimensions of 11 ⁇ 40.
  • the output for a segment may be dimensioned smaller in the time dimension than its input.
  • CNN stack 202 may resize in the temporal dimension so as to produce a different dimensioned output for each input segment of frames.
  • an embodiment of CNN stack 202 may have an input of dimension 11 ⁇ 40 and an output for each feature of width 1 in the time dimension.
  • FIG. 4 illustrates an example CNN stack architecture according to an embodiment.
  • Acoustic feature representation 401 may be a representation such as an MFCC representation as illustrated in FIG. 3 .
  • Each horizontal division is a frame, and each vertical division indicates a different MFCC coefficient value.
  • a highlighted window 403 of 7 frames centered around a central frame 402 .
  • This segment of frames is then processed by one or more convolutional and pooling neural network layers that make up a convolutional neural network stack such as CNN stack 202 discussed above.
  • a single convolutional kernel 404 is illustrated, and a number of network layers as illustrated by network layers 403 a - c .
  • a final dataset 404 is produced corresponding to a number of features that describe input frame 402 .
  • the final dataset 404 may be a volume with a first dimension corresponding to time, a second dimension corresponding to features of the audio at a point in time, such as frequencies or coefficients, and a third dimension corresponding to various filters.
  • the illustrated number and arrangement of datasets and layers is for illustrative purposes only, it is to be understood that any combination of convolutional and/or pooling layers would be used in an implementation as determined by one of skill in the art.
  • first fully-connected layer 203 receives features from CNN stack 202 and produces a second set of features.
  • a fully-connected neural network is a neural network in which all nodes in a layer of the neural network are connected to all nodes of the subsequent layer of the neural network.
  • a fully-connected layer 203 comprises one or more fully-connected neural networks placed end-to-end. The term fully-connected comes from the fact that each layer is fully-connected to the subsequent layer.
  • a fully-connected neural network is one kind of densely connected neural network, where a densely connected neural network is one where most of the nodes in each layer of the neural network have edge connections to most of the nodes in the subsequent layer. The aforementioned definitions may exclude the output layer which has no outbound connections.
  • the first fully-connected layer 203 is implemented as a fully-connected neural network that is repeated across the entire segment that is output by the CNN stack 202 , and each copy of the fully-connected neural network accepts as input a single strided frame.
  • Strided frame refers to the frames output by the CNN stack 202 , which may be obtained by slicing the final dataset 404 in the time dimension so that each strided frame refers to a single point in time.
  • Each strided frame retains features of the audio at the point in time and features in the depth dimension created by the various convolutional filters.
  • Each copy of the fully-connected neural network shares the same parameters, in particular each of the weights of all of the nodes of the fully-connected neural network, which allows for computational and memory efficiency because the size of the fully-connected neural network corresponds to a single strided frame rather than the segment and one copy of the fully-connected neural network may be stored and reused. It should be understood that the repetition of the fully-connected neural network across the segment is a reuse of the neural network per strided frame and would not require actually creating a separate copy of the neural network in memory per strided frame.
  • the output of each fully-connected neural network is a tensor comprising features of the strided frame, which is input into the following layer.
  • First fully-connected layer 203 serves several functions. First, the dimensionality of the first fully-connected layer 203 may be selected so as to resize the output of CNN stack 202 . Second, the fully-connected stack may learn additional features that the CNN stack 202 are not able to detect.
  • First fully-connected layer 203 may resize the output of CNN stack 202 for consumption by the subsequent stack.
  • CNN stack 202 may produce a high dimensioned output based on the number of feature maps used and the frequency context of the output.
  • the first fully-connected layer 203 may reduce the dimension of this output to reduce the number of parameters subsequent stack need to process. Further, this flexibility allows various implementations to optimize the hyperparameters of various stacks independently of one-another while retaining compatibility between stacks.
  • First fully-connected layer 203 may also learn additional features.
  • first fully-connected layer 203 may learn features that CNN stack 202 are not sensitive to.
  • the first fully-connected layer 203 is not limited to local connections between nodes so concepts that require considering tensor values that are distant may be learned.
  • the first fully-connected layer 203 may combine information collected from multiple different feature maps generated by different convolutional kernels.
  • the output of the CNN stack 202 and first fully-connected layer 203 may be thought of as roughly analogous to a phoneme representation of the input audio sequence, even though no hardcoded phoneme model is used. The similarity is that these network layers produce an output that describes the acoustic features of the input audio in sequence.
  • the output is a series of short temporal axis slices corresponding to acoustic features in each audio segment or window.
  • the output of first fully-connected layer 203 is a representation of the activation of acoustic features over the entire time of the input.
  • the output from CNN stack 202 and first fully-connected layer 203 is a set of features that describe acoustic features of the audio input.
  • Recurrent Neural Network (RNN) stack 204 receives these features from first fully-connected stack 203 and produces a third set of features.
  • the input features comprises a set of tensors 501 a - n with one tensor corresponding to each strided frame, and the corresponding tensor produced by the first fully-connected layer representing features of the associated strided frame.
  • Each of the tensors 501 a - n is generated from the fully-connected neural network that operates per strided frame produced by the CNN stack 202 . All of the tensors may be iterated over by the RNN stack 204 in order to process the information in a sequential, temporal manner.
  • RNN stack 204 may be regarded as roughly analogous to a language model in that it receives acoustic features and outputs features related to words that correspond to acoustic features.
  • RNN stack 204 may include various types of recurrent neural network layers, such as Long Short-Term Memory (LSTM) neural network layers and/or Gated Recurrent Unit (GRU) neural network layers.
  • LSTM and GRU type recurrent neural network cells and layers include mechanisms for retaining or discarding information from previous frames when updating their hidden states.
  • LSTM and GRU type RNNs include at least one back loop where the output activation of a neural network enters as an input to the neural network at the next time step.
  • the output activation of at least one neural network node is an input to at least one neural network node of the same or a prior layer in a successive time step.
  • the LSTM or GRU compute a hidden state, comprising a vector, through a series of mathematical operations, which is produced as an output of the neural network at each time step.
  • the hidden state is passed as an input to the next time step of the LSTM or GRU.
  • an LSTM has three inputs at a particular time step, the hidden step passed from the previous time step, the output tensor value of the previous time step, and the input frame or tensor representation of the frame of the current time step. At each time step, the LSTM produces both a hidden state and output tensor value.
  • a GRU has two inputs at a particular time step, the hidden step passed form the previous time step and the input frame or tensor representation of the frame of the current time step. In a GRU, the hidden state and output tensor value are the same tensor and thus only a single tensor value is output.
  • the LSTM may comprise a forget gate layer comprising a neural network layer with a sigmoid activation function and a pointwise multiplication gate for determining which elements of the input hidden state to preserve.
  • the LSTM may comprise an update gate layer comprising a neural network layer with a sigmoid activation function and a neural network layer with a tan h activation function that are both input to a pointwise multiplication gate.
  • the product may be input to a pointwise addition gate with the hidden state to add data to the hidden state.
  • the LSTM may comprise an output gate layer comprising a neural network layer with a sigmoid activation function input to a pointwise multiplication gate with the other input being the hidden state after being passed through the tan h function. The result of this operation may be output as the tensor output of the LSTM at the current time step.
  • Other implementations and variations of an LSTM may also be used, and the LSTM is not limited to this embodiment.
  • the GRU may comprise an update gate layer for determining how much information from the prior hidden state to pass on to the future.
  • the update gate layer may comprise a pointwise addition gate and a neural network layer with a sigmoid activation function.
  • the GRU may comprise a reset gate layer for deciding how much prior hidden state information to forget.
  • the reset gate layer may comprise a pointwise addition gate and a neural network layer with a sigmoid activation function.
  • Other implementations and variations of a GRU may also be used, and the GRU is not limited to this embodiment.
  • RNN stack 204 processes the tensors representing the strided frames in sequence, and its output for each strided frame is dependent on previously processed frames.
  • RNN stack 204 may include either unidirectional or bidirectional RNN layers.
  • Unidirectional RNN stack operate in one direction in time, such that current frame predictions are only based on previously observed inputs.
  • Bidirectional RNN layers are trained both forward in time and backward in time. Bidirectional RNNs may therefore make current-frame predictions based on both preceding frames and following frames.
  • the tensors corresponding to frames are processed sequentially by the RNN in a single direction such as front to back or back to front.
  • the tensors corresponding to frames may be processed in both directions, front to back and back to front, with the information produced from the forward and backward runs combined at the end of processing, such as by concatenation, addition, or other operations.
  • FIG. 5 illustrates an example RNN stack architecture according to an embodiment.
  • Features 501 a - n are received from first fully-connected layer 203 .
  • each of features 501 a - n corresponds to a single strided frame.
  • These features are input into recurrent neural network 502 .
  • Recurrent neural network 502 is illustrated as ‘unrolled’ network elements 502 a - n , each corresponding to the input from one of features 501 a - n , to show the temporal operation of RNN 502 at each time step.
  • Recurrent neural network 502 is a bidirectional recurrent neural network, as illustrated by the bidirectional arrows connecting elements 502 a - n .
  • the diagram shows that data is passed from the RNN at the prior time step to the next time step.
  • data is passed from the RNN at the successive time step to the prior time step in a backward pass through the features 501 a - n .
  • Other embodiments may utilize unidirectional RNN architectures.
  • recurrent neural network 502 is illustrated as a single layer for the purposes of illustration, it is to be understood that the recurrent network may include any number of layers. For each time step, recurrent neural network 502 produces a set of features related to a word prediction 503 a - n at that time step. This set of features is expressed as a tensor or vector output and is directly input to subsequent layers.
  • a second fully-connected stack 205 receives the output features from RNN stack 204 and produces a word embedding. Similar to first fully-connected stack 203 , second fully-connected stack 205 serves several functions. In an embodiment, second fully-connected stack 205 reduces the dimensionality of the output of RNN stack 204 to something more concise. In an embodiment, second fully-connected stack 205 produces a word embedding of significantly reduced dimension compared to the output of RNN stack 204 . This word embedding contains information related to the word predicted for a given time frame, and also information regarding words around the predicted word.
  • Output stack 206 has an output node for each word of a vocabulary and a blank or null output. For each frame of input audio data, output stack 206 produces a probability distribution over its output nodes for a word transcription or a null output. For each spoken word in the input audio, one frame of the output sequence will be desired to have a high probability prediction for a word of the vocabulary. All other frames of audio data that correspond to the word will be desired to contain the null or blank output. The alignment of a word prediction with the audio of the word is dependent on the hyperparameters of the various stacks and the data used for training.
  • the word prediction must come after a sufficient amount of audio frames corresponding to the word have been processed, likely near or around the end of the spoken word. If the recurrent stack is bidirectional, the alignment of the word prediction may be more towards the middle of the spoken word, for example.
  • the learned alignments are dependent on the training data used. If the training data have word transcriptions aligned to the beginning of words, the RNN stack will learn a similar alignment.
  • FIG. 6 illustrates an example output of a transcription from an example output stack of an example end-to-end speech recognition system.
  • the output stack will produce a prediction of which word corresponds to the audio for each time frame.
  • the output 600 for an example time frame is illustrated as a table with words in the first column and corresponding probabilities in the second column.
  • the word “Carrot” has the highest prediction for this time frame with a weighted prediction of 0.90, or 90% likelihood.
  • a complete transcription output may be determined from the output of end-to-end speech recognition system 200 by choosing the highest probability predicted word at each frame.
  • the output probabilities of end-to-end speech recognition system 200 may be modified by a customization layer 207 based on a set of custom prior probabilities to tailor the transcription behavior for certain applications. In this way, a single, general training set may be used for a number of different applications that have varying prior probabilities.
  • Customization layer 207 may be useful, for example, to resolve ambiguities between homophones, to increase priors for words that rarely occur in the training data but are expected to occur frequently in a particular application, or to emphasize particular proper nouns that are expected to occur frequently.
  • the custom priors applied may be determined from a statistical analysis of a corpus of data. For example, if end-to-end speech recognition system 200 is employed by a particular company, documents from that company may be analyzed to determine relative frequency of words. The output of end-to-end speech recognition system 200 may then be modified by these custom priors to reflect the language usage of the company. In this way, end-to-end speech recognition system 200 may be trained once on a general training dataset and customized for a number of particular use cases while using the same trained model.
  • FIG. 7 illustrates an end-to-end speech classification system 700 according to an embodiment.
  • the example end-to-end speech classification system 700 illustrated in FIG. 7 is configured to classify spoken words into a set of classifications rather than generate a transcription.
  • end-to-end speech recognition classification 700 may classify a spoken word or set of words into classes such as semantic topic (e.g., sports, politics, news), gender (e.g., male/female), emotion or sentiment (e.g., angry, sad, happy, etc.), speaker identification (i.e., which user is speaking), speaker age, speaker stress or strain, or other such classifications.
  • semantic topic e.g., sports, politics, news
  • gender e.g., male/female
  • emotion or sentiment e.g., angry, sad, happy, etc.
  • speaker identification i.e., which user is speaking
  • speaker age i.e., speaker stress or strain, or other such classifications.
  • An advantage of the disclosed neural network architecture over traditional ASR systems using discrete components is that the same neural network architecture described above may be repurposed to learn classifications, instead of speech recognition.
  • the neural network architecture learns the appropriate features automatically instead of requiring hand tuning.
  • the architecture of end-to-end speech classification system 700 is identical to that of end-to-end speech recognition system 200 as illustrated in FIG. 2 except for the output neural network stack 706 .
  • Front-end module 701 may be identical to front-end module 201
  • CNN stack 702 may be identical to CNN stack 202
  • first fully-connected layer 703 may be identical to first fully-connected layer 203
  • RNN stack 704 may be identical to RNN stack 204 .
  • the hyperparameters and number and order of hidden nodes of each particular layer or stack may be separately tuned for the classification task.
  • the configuration of each implementation will depend on the particular categorization goal and various implementation concerns such as efficacy, efficiency, computing platform, and other such factors.
  • the trained hidden nodes of any layer or component are learned through the training process and may differ between various implementations. For example, the convolutional kernels used by a gender classification implementation may be very different than those used by a transcription implementation.
  • end-to-end speech recognition system 200 may also be used for end-to-end classification system 700 , aside from a change to the output neural network stack 706 .
  • end-to-end speech recognition system 200 may be used for speech classification by simply changing the output layer, removing output network 206 and replacing it with output network 706 .
  • the output neural network stack 706 of end-to-end speech classification system 700 contains categories related to the classification scheme being used rather than words in a vocabulary.
  • an output neural network stack 706 of an example end-to-end speech recognition system 700 may have two output nodes, one for male and one for female. Alternatively, a single output node may be used for the binary classification of male or female. The output of this example would be to classify spoken word as either male or female. Any number of classifications may be used to classify speech by output neural network stack 706 .
  • a single output node may be provided in output layer 706 for each potential classification, where the value of each output node is the probability that the spoken word or words corresponds to the associated classification.
  • a customization layer may alter predicted outputs based on some external guidance, similar to customization layer 207 .
  • FIG. 8 illustrates an end-to-end phoneme recognition system 800 according to an embodiment.
  • the example end-to-end phoneme recognition system 800 illustrated in FIG. 8 is configured to generate a set of phonemes from audio rather than generate a transcription.
  • end-to-end phoneme recognition system 800 may generate a sequence of phonemes corresponding to spoken words rather than a transcription of the words.
  • a useful application of the end-to-end phoneme recognition system 800 is for addressing the text alignment problem, in other words, aligning an audio file with a set of text that is known to correspond to the audio. Text alignment may be used to split training examples that comprise lengthy audio files with lengthy corresponding text transcripts into shorter training examples that are easier to fit into computer memory. By performing text alignment, portions of the audio file may be associated with their corresponding portions of the text transcript. These portions may then be extracted or used as points of division and used as shorter training examples.
  • end-to-end phoneme recognition system 800 is identical to that of end-to-end speech recognition system 200 as illustrated in FIG. 2 and end-to-end speech classification system 700 as illustrated in FIG. 7 except for the output neural network stack 806 .
  • Front-end module 801 may be identical to front-end module 201
  • CNN stack 802 may be identical to CNN stack 202
  • first fully-connected layer 803 may be identical to first fully-connected layer 203
  • RNN stack 804 may be identical to RNN stack 204 .
  • the hyperparameters and number and order of hidden nodes of each particular layer or stack may be separately tuned for the phoneme recognition task.
  • the configuration of the implementation will depend on implementation concerns such as efficacy, efficiency, computing platform, and other such factors.
  • the trained hidden nodes of any layer or component are learned through the training process and may differ between various implementations. For example, the convolutional kernels used by a phoneme recognition implementation may be very different than those used by a transcription implementation.
  • end-to-end speech recognition system 200 and end-to-end speech classification system 700 as shown in FIGS. 2-7 and as described in the related sections of the description may also be used for end-to-end phoneme recognition system 800 , aside from a change to the output neural network stack 806 .
  • end-to-end speech recognition system 200 may be used for phoneme recognition by simply changing the output layer, removing output network 206 and replacing it with output network 806 .
  • the output neural network stack 806 of end-to-end phoneme recognition system 800 contains phonemes rather than words in a vocabulary.
  • one output node may be provided in the output layer 806 per phoneme, where the value of each output node is the probability that the audio input corresponds to the associated phoneme.
  • 40 phonemes may be provided via a total of 40 nodes in the output layer 806 .
  • other numbers of phonemes may be provided such as 26, 36, 42, or 44.
  • a customization layer may alter predicted outputs based on some external guidance, similar to customization layer 207 .
  • the phoneme recognition system 800 may be used to perform text alignment.
  • An audio file and a corresponding text transcript are provided, and it is desired to match the corresponding audio features to the appropriate text.
  • the audio file may be processed through phoneme recognition system 800 to produce a predicted sequence of audio phonemes.
  • the text file may also be processed to translate the textual words to text phonemes.
  • the text file may be converted to phonemes by iterating over the text and using known mappings of words to the corresponding phonemes. Alternatively, mappings from syllables to phonemes or from sequences of characters to phonemes may be used and may be applied to the text iteratively.
  • FIG. 9A illustrates an iterative beam search that is used in some embodiments.
  • the mapping of the audio phonemes and text phonemes may be set in a few possible ways.
  • the text phonemes could be assumed to be evenly spaced in time and mapped to the audio phoneme at the corresponding time stamp of the audio file.
  • an estimated distribution of text phonemes over time may be determined based on the rate of speech in the audio file and regions of dead silence or high density talking.
  • An estimated time stamp for each text phoneme may be derived for each time stamp based on this distribution, and each text phoneme may then be mapped to the audio phoneme at the corresponding time stamp of the audio file.
  • the audio phonemes and text phonemes could be matched one-to-one starting from the beginning of the audio phonemes and beginning of the text phonemes until the number of phonemes is exhausted.
  • the first iteration of the iterative beam search is represented by the starting node of the search at layer 901 .
  • mappings or alignments of audio phonemes to text phonemes from the prior iteration are used as a starting point and then changed to create multiple new mappings or alignments, which are known as candidates.
  • the candidates are scored and the n best-scoring candidates are selected for expansion at the next level of the iterative beam search, where n is the branching factor of the iterative beam search.
  • Layer 902 is the next layer following starting layer 901 of the iterative beam search.
  • Each of the nodes at layer 902 are generated by adjusting the alignment provided at the starting node in layer 901 .
  • the best n in layer 902 are selected according to a heuristic scoring function as shown by nodes highlighted by the rectangles in FIG. 9A .
  • Candidates at layer 903 are created by using the selected best n nodes at layer 902 as a starting point and adjusting the alignments provided at those nodes. Nodes at layer 902 that were not selected for the set of best n are not expanded and not used as the starting point for adjustments. Therefore, iterative beam search is not guaranteed to find the optimal solution because it prunes parts of the tree during the search. However, the iterative beam search performs well in practice and is computationally efficient.
  • the candidates are again scored and the n best scoring are again expanded for the next level.
  • the process may continue until a stopping condition is reached.
  • the process stops when the number of matching phonemes between the audio phonemes and text phonemes does not change at the next iteration.
  • a novel feature of the iterative beam search is the use of the parent alignment from the prior iteration as a hint to the nodes at the next level.
  • the hint increases the score of candidates that are closer to the alignment of the prior mapping and decreases the score of candidates that are farther from the alignment of the prior mapping.
  • the hint is implemented by increasing the value of the scoring function when a candidate alignment changes little from its parent alignment but decreasing the value of the scoring function when a candidate alignment changes a lot from its parent alignment.
  • the scoring function for evaluating candidate alignments produces a score based on the number of matching phonemes, that is, the number of audio phonemes and text phonemes that are mapped to each other and are the same phoneme; the number of missed phonemes, meaning the number of audio phonemes or text phonemes that are not mapped to any phoneme in the other set; and the distance from the hint, where the hint is the alignment at the parent iteration of the beam search.
  • the distance from the hint is evaluated by iterating over the audio phonemes or text phonemes and producing a score for each of the phonemes.
  • the score is higher when the audio phoneme or text phoneme has stayed in the same position or changed position only a little and lower when the audio phoneme or text phoneme has moved to a significantly farther position, where the distance may be measured by, for example, time or number of phoneme positions moved.
  • the per-phoneme scores are then combined, such as by summation, to produce a score for the distance from the hint.
  • the hint may act as a weight keeping the children alignments closer to the parent alignment.
  • the distance score for phonemes may be implemented with a radial basis function (RBF).
  • RBF accepts as input the distance between the phoneme at its parent location and its current location in the new candidate alignment. When the distance is zero, the RBF is at its peak value. The RBF is symmetric around the origin, and the value may drop steeply for input values farther from the origin.
  • the parameters of the RBF may be adjusted between iterations of the iterative beam search make the curve steeper at later iterations of the beam search. As a result, the penalty in the scoring function for the phoneme's current location not matching its location in the parent alignment increases in later iterations.
  • FIG. 9B illustrates two RBFs, a broader RBF on the left that may be used in earlier iterations of the iterative beam search and a steeper RBF on the right that may be used in later iterations of the iterative beam search.
  • the illustrated RBFs are exemplary and other RBFs and non-RBF functions may be used for scoring distance between a phoneme's prior alignment and the current alignment.
  • FIG. 9C illustrates an embodiment of the text alignment algorithm using iterative beam search using a well-known tongue twister.
  • a mapping between audio phonemes and text phonemes is created. The initial mapping is close but not exactly correct.
  • the alignments of the phonemes are adjusted from the initial mapping and the new candidate alignments are rescored.
  • a candidate alignment 1A is created, which matches the phonemes for “the” and “sixth” but misses several other phonemes and has unmatched phonemes for “sixth,” “sheep's”, and “sick.”
  • the candidate alignment 1A moves the phonemes two words to the right from the parent alignment, which is lower scoring than if the phonemes were moved a smaller distance.
  • candidate alignment 1B has a higher score, according to the heuristic scoring function, candidate alignment 1A. It matches a higher number of phonemes and has no missing phonemes. Moreover, the phonemes were moved a smaller distance from the location of the phonemes in the parent alignment (only moved one word to the left).
  • the example shown in FIG. 9C is illustrative only and other embodiments may operate in a different manner and use different scoring functions.
  • Iterative beam search may be used in a variety of machine learning applications and is not limited to use with neural networks or for the application of speech recognition.
  • all layers and stacks of an end-to-end speech recognition system 200 , end-to-end speech classification system 700 , or end-to-end phoneme recognition system 800 are jointly trained as a single neural network.
  • end-to-end speech recognition system 200 , end-to-end speech classification system 700 , or end-to-end phoneme recognition system 800 may be trained as a whole, based on training data that contains audio and an associated ground-truth output, such as a transcription.
  • training may use stochastic gradient descent with initial weights randomly initialized.
  • training may use back propagation to adjust the weights of the neural network nodes in the neural network layers by using the partial derivative of a loss function.
  • the loss function may be represented by
  • the value of the loss function depends on the training examples used and the difference between the output of the system 200 , system 700 , or system 800 and the known ground-truth value for each training example.
  • An optional regularization expression may be added to the loss function in which case the value of the loss function may also depend on the magnitude of the weights of the neural network.
  • Backpropagation may be used to compute the partial derivative of the loss function with respect to each weight of each node of each layer of the neural network, starting from the final layer and iteratively processing the layers from back to front.
  • Each of the weights may then be updated according to the computed partial derivative by using, for example, gradient descent. For example, a percentage of the weight's partial derivative, or gradient, may be subtracted from the weight, where the percentage is determined by a configurable learning rate.
  • training is performed on a batch of utterances at a time.
  • the utterances in a training batch must be of the same length. Having samples of the same length may simplify tensor operations performed in the forward propagation and backward propagation stages, which may be implemented in part through matrix multiplications with matrices of fixed dimension. For the matrix operations to be performed, it may be necessary that each of the training samples have the same length.
  • the batch of training samples may be created by splitting an audio file into utterances, such as 7-10 second long portions which may correspond to a word, phrase, or series of words and/or phrases. In an audio file, naturally some utterances may be longer or shorter than others. In an embodiment where training samples must be the same length, techniques may be used to adjust the length of some of the samples.
  • the length of training samples has been adjusted by padding shorter samples with zeros or other special characters indicating no data. While this allows creating training samples of the same size, the zeros or special characters may lead to artifacts in the model and cause slower training.
  • FIG. 10 illustrates an example of looping each of the shorter training samples in a training batch so that the shorter training samples are repeated until they are the same length as the longest training sample.
  • a set of training samples is created by splitting an audio file.
  • the training samples are processed by front-end module 201 to create a sequence of frames comprising each training sample, where the frames may be of any of the types described above such as log-mel filterbanks, MFCC, perceptual linear prediction coefficients, or spectrograms.
  • Each of the training samples may be stored as a row of tensor 1000 to create a training batch.
  • the length of the tensor 1000 in the time dimension is determined by the length of the longest sample 1002 in terms of the number of frames. Longest sample 1002 is not changed.
  • each of the shorter samples 1001 , 1003 , 1004 , 1005 , 1006 in the batch are repeated until they are the same length as the longest sample 1002 , so that every row of the tensor has the same length.
  • the shorter samples are repeated exactly in all of their elements starting from the first element through the last element of the sample.
  • the last repetition of the sample may only be a partial repetition until the desired length is reached.
  • the partial repetition is a repetition of the shorter sample starting from the first element and iteratively repeating through subsequent elements of the sample until the desired length is reached.
  • shorter sample 1001 is repeated k times where
  • each row may be a multi-dimensional tensor, such as when the frames in the rows are multi-dimensional tensors.
  • the training samples of a training batch are stored as rows in a single tensor. In other embodiments, the training samples are not stored in a single tensor.
  • the training samples may be stored as a list or set and input into the neural network one by one.
  • the CNN layer (such as CNN layer 202 , CNN layer 702 , or CNN layer 802 ) is of a fixed size. In an embodiment, the CNN layer accepts input tensor representations up to a fixed length, and the longest sample in a training batch is selected to be less than the fixed length of the CNN layer.
  • a ground-truth output value may be provided in tensor 1000 attached to each of the frames of the training samples in tensor 1000 .
  • the ground-truth output values may also be repeated for the shorter samples, when the frames of the shorter samples are repeated in tensor 1000 .
  • a second tensor separate from tensor 1000 , is provided with the ground-truth output values, instead of storing the ground-truth values in tensor 1000 .
  • the ground-truth output values in the second tensor may be repeated for shorter samples just as with tensor 1000 .
  • the ground-truth output values in the second tensor are not repeated, even though the corresponding training samples in tensor 1000 are repeated.
  • Padding the shorter training samples by repetition has several advantages over padding with zeros or special characters indicating no data.
  • zeros or other meaningless data When zeros or other meaningless data is used, no information is encoded and computation time is wasted in processing that data leading to slower learning or model convergence.
  • the neural network By repeating the input sequence, the neural network can learn from all elements of the input, and there is no meaningless or throw-away padding present. The result is faster convergence and learning, better computational utilization, and better behaved and regularized models.
  • inference is performed on a tensor similar to tensor 1000 with multiple samples obtained by splitting an audio file. Each sample may be stored in a row of the tensor. The same process described above for training may be applied during inference. A longest sample may be unchanged, and each of the shorter samples may be repeated until they are the same length as the longest sample so that every row of the tensor is the same length. The tensor, with the repetitions of shorter samples, may be input to the neural network for inference.
  • the technique of looping shorter training samples in a training batch may be used in a variety of machine learning applications and is not limited to use for the application of speech recognition.
  • FIGS. 11A-B illustrate an example attention mechanism for a neural network, called “Neural Network Memory,” that may be used in end-to-end speech recognition system 200 , end-to-end speech classification system 700 , end-to-end phoneme recognition system 800 , or other neural networks.
  • a neural network called “Neural Network Memory”
  • One problem with neural networks and other machine learning techniques is that the size of the machine learning model constrains the amount of knowledge that can be learned. It is one version of the mathematical pigeon hole principle, which states that if n items are put into m containers, with n>m, then one container must contain more than one item.
  • a machine learning model that is trying learn a complex decision boundary on a large amount of data cannot, in general, learn the complex decision boundary exactly if the machine learning model is significantly smaller in size than the amount of data being trained on.
  • various components of the neural network such as weights and hidden nodes, become overloaded and must try to learn more than one function, causing the learning rate of the neural network to slow down significantly over time as more training examples are seen.
  • the quality of the machine learning model that is learned by the neural network may plateau or even become worse.
  • Neural Network Memory addresses this problem by creating an expert knowledge store, which is a data store in memory, that stores expert neural network layer portions that may be inserted into the neural network at the right time.
  • the expert knowledge store is a database.
  • the expert neural network layer portions may be a portion of a neural network layer or an entire neural network layer.
  • the expert neural network layer portions may learn specialized functions that apply in specific conditions and be swapped in and out of the neural network automatically when those conditions are detected.
  • Example neural network 1100 is a fully-connected neural network with multiple layers of hidden states.
  • Neural network layer portion 1110 is a selector and neural network layer portion 1120 is a gap with no hidden nodes and that is filled by swapping expert neural network layer portions in and out.
  • forward propagation occurs as normal.
  • the gap 1120 is reached, forward propagation cannot continue until an expert layer is inserted.
  • forward propagation occurs through selector neural network layer portion 1110 as normal.
  • the activation outputs of the nodes of the selector neural network layer portion 1110 are used as a query to find the expert neural network layer to insert into gap 1120 .
  • Expert knowledge store 1130 stores selectors 1115 that each serve as an index for one expert neural network layer portion 1125 that corresponds to the selector.
  • Each expert neural network layer may comprise the weights for the inbound edges to the nodes of the expert neural network layer and the activation function of the nodes.
  • the activation outputs of the nodes of the selector neural network layer portion 1110 are stored in a tensor.
  • the activation outputs are output from the activation function of each node.
  • Each element of the tensor may correspond to one node output.
  • selector neural network layer portion 1110 there are three nodes, which means that there are three output values stored in the tensor.
  • the tensor of activation outputs is compared with all of the selectors 1115 in the expert knowledge store 1130 . In an embodiment, the comparison is performed by using a distance metric. In an embodiment, the distance metric is the cosine similarity between the tensor of activation outputs and a selector 1115 .
  • the distance metric is the dot product between the tensor of activation outputs and a selector 1115 .
  • the closest selector 1115 according to the distance metric is chosen as the correct row of the expert knowledge store.
  • the expert neural network layer associated with the closest selector 1115 is then inserted into the neural network 1100 in the gap 1120 .
  • forward propagation continues through the neural network 1100 just as if the expert neural network layer were a permanent layer of the neural network 1100 . If the neural network 1100 is performing inference, then after neural network 1100 produces its output, the expert neural network layer may be deleted from portion 1120 so that portion 1120 is once again empty and ready to be filled in at the next iteration.
  • training of the expert neural network layer and the selector may be performed.
  • the output of the neural network may be compared with the ground-truth output associated with the input.
  • Backpropagation is performed based on the difference between those two values, the ground-truth output and the actual output of the neural network.
  • the backpropagation is performed through the expert neural network layer inserted into gap 1120 just as if the expert neural network layer was a permanent part of neural network 1100 and adjusts the weights of each of the nodes of the expert neural network layer through training.
  • the updated expert neural network layer is stored back in the expert knowledge store, overwriting the prior version.
  • the backpropagation trains the expert neural network layer to become more accurate, for those conditions where it is inserted in the network, and allows it to become specialized for particular use cases.
  • the selector associated with the expert neural network layer is trained to become more similar to the tensor of activation outputs from selector neural network layer portion 1110 . This process allows the selectors to become specialized to the correct conditions.
  • the selector is adjusted pointwise to become more similar to the values of the tensor of activation outputs from selector neural network layer portion 1110 , such as by reducing the distance between the selector and tensor in vector space.
  • a selector learning rate may be set to control the rate at which selectors are adjusted and may be a scalar value.
  • the values of the selector are changed by a percentage of the distance between the selector and the tensor of activation outputs multiplied by the selector learning rate. In an embodiment, the values of the selector are changed by a fixed value in the direction of the tensor of activation outputs multiplied by the selector learning rate.
  • the selector neural network layer portion 1110 and gap 1120 for inserting the expert neural network layer are two halves of the same neural network layer. In other embodiments, the relative location of these portions may be different. They can be of different sizes and do not need to be exactly half of a neural network layer. Moreover, the selector neural network layer portion 1110 and the gap 1120 are not required to be in the same layer.
  • Neural Network Memory may be used in neural network 1150 where the selector neural network layer 1160 is a full neural network layer and a gap 1170 for insertion for an expert neural network layer is a full neural network layer.
  • the process described with respect to neural network 1100 is the same, except that the expert knowledge store 1180 stores selectors corresponding to activation outputs for an entire layer and the expert neural network layer portions are entire neural network layers.
  • the selector neural network layer 1160 directly precedes the portion 1170 for inserting the expert neural network layer.
  • the selector neural network layer 1160 and the gap 1170 for inserting the expert neural network layer may be in different relative locations.
  • Neural Network Memory is used in the first fully-connected layer 203 , 703 , 803 . In an embodiment, Neural Network Memory is used in the second fully-connected layer 205 , 705 , 805 . Although Neural Network Memory has been illustrated in fully-connected neural networks 1100 , 1150 it may be used in any other form of neural network, such as CNN layers 202 , 702 , 802 or RNN layers 204 , 704 , 804 . Moreover, multiple selector neural network layers and gaps for inserting expert neural network layers may exist in the same neural network.
  • the size of expert knowledge store 1130 , 1180 increases over time as more training examples are seen by the neural network. As more training is performed, more expert neural network layers are expected to be needed to address the pigeon hole principle.
  • a counter stores the number of training examples that have been run through the neural network. The counter is incremented with each new training example.
  • a threshold which may be a threshold value or threshold function, defines the points at which the size of the expert knowledge store increases in size. When the counter of training examples exceeds the threshold, one or more new rows are added to the expert knowledge store. Each row includes a selector and an associated expert neural network layer.
  • New selectors and expert neural network layers may be initialized to random values, may be initialized as an average of the rows above it, or may be initialized with values from existing neural network layer portions of the neural network.
  • the growth rate at which new rows are added to the expert knowledge store 1130 , 1180 decreases over time.
  • the growth rate is, for example, the rate at which new expert neural network layers are added to the store. As more training examples are seen, the rate at which new information is learned is expected to decrease because more and more of the variations in the training data will have already been seen.
  • the growth rate at which rows are added to the expert knowledge store 1130 , 1180 is inversely proportional to the total number of training examples ever processed by the neural network.
  • Neural Network Memory may be used in a variety of machine learning applications and is not limited to use for the application of speech recognition.
  • FIG. 12 illustrates an example of a general domain 1210 and a custom domain 1220 .
  • Neural networks such as end-to-end speech recognition system 200 , end-to-end speech classification system 700 , and end-to-end phoneme recognition system 800 , may be trained on a general dataset, which trains them to perform in a general domain 1210 for multiple possible applications or situations.
  • the general domain 1210 is the domain learned by learning across a set of training examples that come from a plurality of different datasets. The different datasets may be aggregated into a general training set.
  • Advantages of training a neural network for a general domain 1210 include the ability to use more training data and also building a model that may work well in multiple situations.
  • a custom domain 1220 may differ from the general domain 1210 in numerous aspects, such as frequencies of words, classifications, and phonemes, audio features (such as background noise, accents, and so on), pronunciations, new words that are present in the custom domain 1220 but unseen in the general domain 1210 , and other aspects.
  • the statistical distribution of audio examples in general domain 1210 may differ from the distribution in custom domain 1220 . It may be desirable to customize the neural network for the custom domain 1220 , which can potentially improve performance significantly in the custom domain 1220 .
  • the custom domain 1220 may include a set of training examples from the custom domain 1220 .
  • a training set may not be available for custom domain 1220 and only some information about the distribution in custom domain 1220 may be known, such as a list of frequent words and their frequencies.
  • the neural network trained on the general training set may be referred to as the general model and the neural network customized for the custom domain may be referred to as the custom model.
  • An example of a custom domain 1220 for speech recognition is performing speech recognition on the phone calls of a particular company. Some words in the custom domain 1220 are likely to have a higher frequency in the domain of phone calls for the company than for general speech recordings. It is likely that the name of the company and names of employees will occur with higher frequency in the custom domain 1220 than in general. Moreover, some words in the custom domain may not exist in a general training set, such as the names of the companies' products or brands.
  • custom domain 1220 has been performed by first training a neural network with a general training set to build a general model and then training the neural network on a set of training examples from the custom domain 1220 to customize it.
  • Significant downsides of this approach are that there may not be sufficient data from the custom domain 1220 to customize the neural network by training and that the process of re-training may be slow.
  • Techniques herein address this problem and allows more effective customization of a neural network for a custom domain 1220 more quickly and even when only limited custom training data is available.
  • FIG. 13 illustrates an example supervised learning approach for predicting the weights of neural network nodes to improve performance in a custom domain.
  • the predicted weights may be used to replace weights in a neural network that has been trained on a general training set in order to customize the neural network for a custom domain.
  • a machine learning model separate from the neural network, is trained to predict weights of nodes in the neural network based on phonemes and the frequency of a word.
  • the approach may be used for words that are unseen in the general domain or for words that are seen in the general domain but are more frequent in the custom domain.
  • a neural network layer is selected for which new weights will be predicted.
  • the output layer such as output layers 206 , 706 , 806 , is selected.
  • the predicted weights will be the weights of the node, which are the weights applied to input values to the node prior to application of the activation function.
  • a weights predictor 1320 which is a machine learning model, is provided. The weights predictor 1320 is trained to predict neural network node weights for a particular word in the vocabulary based on the phonetic representation of the word and its frequency in the general domain.
  • the weights predictor 1320 is trained by iterating over all of the words of the vocabulary and inputting tensor 1310 comprising the concatenation of a one-hot encoding 1302 of the phonetic representation of the word and the frequency 1304 of the word in the general training set, which may be normalized such as by log normalization, into predictor 1320 .
  • the one-hot encoding has zeroes in all positions except for one location having a one representing the phonetic representation of the word.
  • the resulting sparse input vector has two non-zero values, the one-hot encoded location representing the phonetic representation and a value representing the frequency of the word in the general domain.
  • the weights predictor 1320 Based on the input vector 1310 , the weights predictor 1320 generates output vector 1330 representing the weights for this word in the selected neural network layer.
  • the predicted weights are the weights for the output node for the word.
  • the weights predictor 1320 is linear regression. When using linear regression, the predictor 1320 may be trained using least squares fit. The target value for training examples is the neural network node weights in the general model. Generated values of the predictor 1320 may be compared to the true neural network node weights in the general model and the differences reduced using the least squares method. In one embodiment, the weights predictor 1320 is a neural network, which may have one or more layers. The weights predictor 1320 may be trained using backpropagation. Generated values of the predictor 1320 may be compared to the true neural network node weights in the general model and the weights of the predictor 1320 may be adjusted by backpropagation and gradient descent. The weights predictor 1320 may be other regression models such as polynomial regression, logistic regression, nonlinear regression, and so on.
  • a training set is provided for a custom domain.
  • the training set comprises audio files and corresponding text transcripts. Frequent words in the custom dataset that are unseen or have low frequencies in the general training set are identified. In other embodiments, no training set of custom data is provided, but a list of frequent words and their frequencies is provided for their custom domain. For each of the frequent words that are unseen or have low frequencies in the general model, a set of weights is predicted. A one-hot encoding is created for the phonetic representation of the word and the frequency of the word in the custom domain, optionally with normalization such as log normalization, is concatenated to the one-hot encoding. The resulting vector is input into the weights predictor 1320 . The output vector provides the predicted weights.
  • the predicted weights are used to replace the weights of the corresponding layer of the neural network in order to customize the neural network for the custom domain. If a word was unseen in the general training set, then a new node is added to the output layer and the weights of the node are initialized to be the predicted weights. In some embodiments, customized weights are predicted for all words in the vocabulary and not just words that occur with high frequency.
  • the neural network may be further trained on training examples that come from the custom domain.
  • the input tensor 1310 to weights predictor 1320 also includes bigram information.
  • the bigram information characterizes information about words frequently occurring immediately adjacent to the left or right of the word.
  • the bigram information may be a vector with one entry per word of the vocabulary and the value at each location represents the probability that the word appears adjacent to the current word.
  • the bigram vector may be concatenated to input tensor 1310 .
  • the weights predictor 1320 may be trained by computing the bigram information in the general training set for each word of the vocabulary, concatenating that to the input tensors 1310 for each word, and training on all of the words of the vocabulary as described above.
  • bigram information may collected based on rate of co-occurrence as adjacent words in the custom domain, which may either be provided or be computed from a custom training set.
  • the bigram information is attached to the input tensor 1310 during inference.
  • the predicted output weights are used in the same way as described above.
  • the technique of predicting neural network node weights may be used in a variety of machine learning applications and is not limited to use for the application of speech recognition.
  • FIG. 14 illustrates an example unsupervised learning approach for customizing a neural network for a custom domain by using a customization layer, such as customization layer 207 .
  • a customization layer such as customization layer 207 .
  • Customization layer 207 may change the probability that words are produced according to these frequencies.
  • the concept of prior probability also called a prior, refers to the probability of an occurrence before any observations are made. Statistically, the prior probability should be taken into account in the probabilities of words generated by the neural network.
  • frequent words in the custom dataset that are unseen or have low frequencies in the general training set are identified.
  • no training set of custom data is provided, but a list of frequent words and their frequencies is provided for their custom domain.
  • customization is performed as described below. In other embodiments, customization is performed for all words in the vocabulary regardless of whether they are frequently occurring or not.
  • an output layer 1410 is provided that outputs the probability that the input corresponds to the associated word represented by the output node.
  • the probabilities are adjusted by dividing by the frequency of the word in the general training set and multiplying by the frequency of the word in the custom training set. The resulting values are used as the new word probabilities, and the word with the highest probability after customization is selected as the output of the neural network.
  • the effect of the customization is, roughly, to remove the prior for the word from the general domain and replace it with the prior for the word from the custom domain.
  • the frequency of words in the general training set may be tracked and stored as general training is performed. Words that were unseen in the general training set may be given a small non-zero frequency value to allow the division step to be performed.
  • the frequency of the words in the custom domain may be provided.
  • the frequency of words in the custom dataset may be generated by running a custom training set through the general model to obtain a transcription of the custom training set. The frequency of the word may then be determined by parsing the transcription.
  • customization is performed on a per-bigram basis instead of a per-word basis.
  • Bigrams may be formed by combining the current word with the preceding word or succeeding word.
  • the frequency of word bigrams in the general training set is tracked, and the frequency of word bigrams in the custom training set is also determined, using the methods described above.
  • Word probabilities are computed as normal in output layer 1410 .
  • the correct bigram is determined based on the combination of the current word with the preceding word or succeeding word as appropriate.
  • the word probability is then divided by the bigram frequency in the general training set and multiplied by the bigram frequency in the custom training set.
  • the technique of customizing a neural network by using a customization layer may be used in a variety of machine learning applications and is not limited to use with neural networks or for the application of speech recognition.
  • FIG. 15 illustrates an example of dynamically training on a general training set to customize a neural network, such as such as end-to-end speech recognition system 200 , end-to-end speech classification system 700 , and end-to-end phoneme recognition system 800 , for a custom domain.
  • General training set 1510 with audio examples from general domain 1210 and custom training set 1520 with audio examples from custom domain 1220 may be provided.
  • the general training set 1510 may have significantly more data and training samples than custom training set 1520 .
  • the general training set 1510 has tens of thousands, hundreds of thousands, or millions of hours of audio data and the custom training set 1520 has a few hours of audio data or less. Re-training a general model, trained on the general training set 1510 , with the custom training set 1520 may not be effective because there may not be enough custom training data to customize the model.
  • the general training set 1510 is a collection of training subsets 1511 - 1515 collected from various sources. Although five training subsets 1511 - 1515 are illustrated, many more may be used in practice.
  • the training subsets 1511 - 1515 may have different characteristics, such as source (e.g., public dataset, proprietary inhouse data, data acquired from third parties), types of speakers (e.g., mix of male and female, mix of accents), topics (e.g., news, sports, daily conversation), audio quality (e.g., phone conversations, in-person recordings, speaker phones), and so on.
  • Some training subsets 1511 - 1515 may be more similar to the examples in custom training set 1520 and others less similar.
  • Each training subset 1511 - 1515 may have a handle that identifies it.
  • the entire general training set 1510 is used for training the neural network.
  • this approach does not customize the neural network for the custom domain 1220 represented by the custom training set 1520 .
  • some of the custom training data may be set aside as a custom evaluation subset 1522 . Only some of the general training subsets 1511 - 1515 are used for training and the quality of the results are tested against the custom evaluation subset 1522 .
  • the set of general training subsets 1511 - 1515 used for training may be adjusted to improve performance on the custom evaluation subset 1522 .
  • a neural network is trained on general training set 1510 to create a general model and different mixes of general training subsets 1511 - 1515 are used for further training to customize the neural network.
  • An AB testing approach may be taken with different combinations of general training subsets 1511 - 1515 tried according to a selection algorithm, which may use randomization, and the quality of the results measured against the custom evaluation subset 1522 .
  • the combination of general training subsets 1511 - 1515 that provides the lowest word error rate (number of words misidentified) on the custom evaluation subset 1522 may be selected as the best combination to use for customization. That combination may be used for additional training of the neural network to customize it for the custom domain 1220 .
  • a fully dynamic method is used where the mix of general training subsets 1511 - 1515 to train on is never finalized because the mix can continue to change over time.
  • the combination of general training subsets is fully dynamic and is chosen in a way balance exploration and exploitation on an ongoing basis. This third approach is described in more detail below.
  • a reinforcement learning algorithm is used to dynamically select general training subsets to train on for customization of a neural network.
  • the neural network is initially trained on the general training set 1510 to create a general model.
  • the custom training set 1520 may be divided into three pieces, a custom evaluation subset 1522 , a custom validation subset 1524 , and a custom training subset 1526 . Although the subsets are illustrated as roughly equal in size, they may have varying relative sizes.
  • the reinforcement learning system takes actions, which in this case are selections of a general training subset to train on for a number of training batches, and receives rewards for those actions, which are the word error rate on the custom evaluation subset 1522 .
  • a decreased word error rate is a positive reward, and an increased or unchanged word error rate may be a negative reward.
  • the reinforcement learning system may learn a policy for choosing general training subsets to train on in order to improve the word error rate on the custom evaluation subset 1522 and thereby customize the neural network for the custom domain 1220 .
  • the reinforcement learning system has an agent, actions, environment, state, state transition function, reward function, and policy.
  • the agent is the customization system that chooses the next general training subset to train on.
  • the actions are the choice of which general training subset to train on for the next iteration.
  • the environment is an environment that is affected by the agent's actions and comprises the state, state transition function, and reward function.
  • the state is the current neural network state, whose weights are determined by the prior training iterations.
  • the state may also include tracked information about the distribution of past rewards for each action (e.g., choice of general action subset) including the expected rewards for each action and tracked information about uncertainty associated with each action, such as how many times each action has been taken.
  • the state transition function is the function that defines the transition to a new state based on the selected action.
  • the state transition function may be implicitly defined by the act of training the neural network with the selected general training subset to obtain new weights for the neural network.
  • the reward function is a function determining reward values based on the change in word error rate in the custom evaluation subset 1522 after training with the selected general training subset.
  • the reward function outputs the percent change in word error rate as the reward.
  • the reward output by the reward function is a transformed value based on the percent change in word error rate.
  • the policy is a function for selecting the action to take, what general training subset to choose in the next iteration, based on the current state.
  • the reinforcement learning system trains the custom model iteratively. At each iteration, it a selects general training subset 1511 - 1515 to train on. The neural network is trained on the selected general training subset for a number of training batches, where the number of training batches may be configurable. After training, the neural network is tested on the custom evaluation subset 1522 . The word error rate in the custom evaluation set 1522 is measured and stored. The reinforcement learning system may update its policy based on the word error rate.
  • the reinforcement learning system selects the general training subset to train on at the next iteration based on, for example, the distribution of past rewards for each general training subset, expected rewards for each general training subset, uncertainty values associated with each general training subset, and/or the number of times each general training subset has already been trained on. In an embodiment, this process continues indefinitely to iteratively improve the neural network's performance in the custom domain 1220 .
  • the training policy of the reinforcement learning system may be continuously adjusted based on rewards and need not ever reach a “final” policy.
  • a multi-armed bandit algorithm referred to as a bandit algorithm, is one example of a reinforcement learning system.
  • the multi-armed bandit algorithm provides a policy of which actions to take, where the actions provide differing rewards and the distribution of rewards for each action is not known.
  • the multi-armed bandit problem, addressed by the bandit algorithm is deciding which action to take at each iteration to balance exploration, that is learning which actions are the best, with exploitation, that is taking advantage of the best action to maximize the total rewards over time.
  • the multi-armed bandit problem takes its name from a hypothetical problem of choosing which of a set of slot machines to play, where the slot machines pay out at different, unknown rates.
  • a bandit algorithm may be used where the actions for the bandit algorithm are the choice of which general training subset to train on and the rewards for the bandit training algorithm are the change in word error rate on the custom evaluation set 1522 or a function based on that value.
  • the bandit algorithm iteratively chooses general training subsets to train on according to a policy that balances exploration and exploitation.
  • the bandit algorithm may run indefinitely and continuously and dynamically update its policy on an ongoing basis, never stopping at a “final” policy.
  • a bandit algorithm is used to iteratively select general training subsets to train on to customize a neural network for a custom domain 1220 .
  • the bandit algorithm has a scoring function, and the bandit algorithm's policy is to select the general training subset that has the highest score according to the scoring function.
  • the value of the scoring function may be based on the distribution of past rewards for each general training subset, expected rewards for each general training subset, uncertainty values associated with each general training subset, and/or the number of times each general training subset has already been trained on.
  • the value of the scoring function increases with the mean reward observed for the general training subset and decreases with the number of times the general training subset has been chosen.
  • an uncertainty value is stored for each general training subset and increases over time when the subset is not chosen.
  • the value of the scoring function may increase with increases in the uncertainty value of the general training subset.
  • Use of uncertainty values models the uncertainty produced by the non-stationary rewards of this bandit problem.
  • the distribution of rewards from the general training subsets is not fixed over time because the neural network weights are changing as it is trained and so the effect of each general training subset on the neural network will also change.
  • a bandit problem with non-stationary rewards may be referred to as a non-stationary bandit problem and a bandit algorithm configured for addressing a non-stationary bandit problem may be referred to as a non-stationary bandit algorithm.
  • the bandit algorithm selects a general training subset to train on by applying the scoring function to each subset and choosing the highest scoring one.
  • the neural network is trained on the selected general training subset for a number of training batches, where the number of training batches may be configurable.
  • the neural network is tested on the custom evaluation subset 1522 .
  • the word error rate in the custom evaluation set 1522 is measured and stored.
  • the word error rate corresponds to a reward, with reductions in word error rate corresponding to a positive reward and increases in word error rate corresponding to a negative reward, or penalty.
  • Stored information about the distribution of rewards and mean reward for this general training subset may be updated based on the observed word error rate.
  • a counter of the number of times the general training subset was trained on may be incremented.
  • An uncertainty value associated with the selected general training subset may be decreased, and the uncertainty values associated with all other general training subsets, which were not chosen, may be increased.
  • the next iteration then begins with the bandit algorithm selecting the next general training subset to train on. The process may continue indefinitely to iteratively improve the neural network's performance in the custom domain 1220 . No final “best” mix of general training subsets is chosen, rather the bandit algorithm continues to select the general training subsets based on information about the past rewards observed and its measures for uncertainty regarding each subset.
  • the bandit algorithm may be the upper confidence bound (UCB) algorithm, the UCB1 algorithm, the epsilon greedy algorithm, or other bandit algorithms.
  • the scoring function for the bandit algorithm is given by
  • i is the index or handle of the general training subset
  • t is the iteration number
  • the UCB1 algorithm uses the related scoring function
  • the bandit algorithm may initially iterate through the general training subsets and train on each of them once, and then switch to choosing the general training subset through the scoring function.
  • a reinforcement learning system may be used to select general training subsets to train on to condition a neural network for a custom domain.
  • One reinforcement learning system is implemented with a bandit algorithm.
  • a portion of custom training set 1520 may be reserved as a custom training subset 1526 to further condition the neural network.
  • the neural network may be trained on the custom training subset 1526 in the usual manner, by inputting the values, comparing the outputs to ground-truth results, and adjusting the neural network node weights with backpropagation.
  • a custom validation subset 1524 may be used for validation to independently test the quality of the custom model after it has been customized using the reinforcement learning system or bandit algorithm and optional custom training subset 1526 . Validation may be performed by testing the performance of the neural network on custom validation subset 1524 on word error rate or other measures.
  • reinforcement learning and/or bandit algorithms for selecting general training subsets to train on and customize for a custom domain, as described herein, may be used in a variety of machine learning applications and is not limited to use with neural networks or for the application of speech recognition.
  • FIG. 16 illustrates an example training data augmentation and streaming system 1600 according to an embodiment.
  • it is valuable to augment existing training data by applying one or more augmentations to the data. Augmentations may also be referred to as “effects.”
  • the augmentations expand the dataset to provide more data with more variety and can increase the robustness of the learned model.
  • augmentations are difficult to perform because the number of different combinations of potential augmentations can be combinatorially large.
  • the training dataset itself may already be large and additionally storing all of the augmented versions of the dataset may not be feasible due to the large amount of memory it would occupy.
  • the training data augmentation and streaming system 1600 provides training data augmentation as a service through an Application Programming Interface (API).
  • API Application Programming Interface
  • the system 1600 provides a service that generates augmented training data just-in-time when it is requested by a training process.
  • training data store 1610 stores training data in the form of audio files or other data.
  • the training data store 1610 comprises one or more Redundant Array of Redundant Disk (RAID) arrays, which provide fault tolerance.
  • Meta-data store 1620 stores meta-data about the training data sets. It may store information about the name and source of the training data sets and associate names to handles and locations in the training data store 1610 .
  • Computer servers 1640 , 1650 perform the processing necessary to train a machine learning model, such as the neural networks discussed herein.
  • Training processes 1644 , 1646 , 1648 , 1654 , 1656 , 1658 perform training of a neural network such as by accepting training data, performing forward propagation through a neural network, and performing backpropagation based on the results.
  • the training processes may be training the same single neural network in parallel or may be training different neural networks.
  • Training manager 1643 manages the training processes on server 1640
  • training manager 1653 manages the training processes on server 1650 .
  • Training data augmentation system 1642 provides training data augmentation service to the training processes 1644 , 1646 , and 1648 .
  • the training processes 1644 , 1646 , and 1648 communicate with the training data augmentation system 1642 through an API.
  • the API is implemented with UNIX sockets.
  • Training data augmentation system 1652 provides training data augmentation service to the training processes 1654 , 1656 , and 1658 .
  • the connection between servers 1640 , 1650 and the training data store 1610 and meta-data store 1620 is implemented over the Network File System (NFS).
  • NFS Network File System
  • Training data augmentation system 1642 waits for a training process 1644 to connect to it using an API call.
  • the training process 1644 connects to the training data augmentation system 1642 , and training process 1644 transmits via an API call an indication of the training dataset that it wants to train on and which augmentations it desires to be applied.
  • the indication of the training dataset may be provided in the form of a handle.
  • the augmentations provided may be reverb, with a selection of kernels; noise from varying noise profiles; background tracks, such as for emulation of background speaking; pitch shifting; tempo shifting; and compression artifacts for any of one or more compression algorithms.
  • the training augmentation system 1642 accesses the meta-data store using the provided handle to identify the location of the requested training data in the training data store 1610 .
  • Training augmentation system 1642 then accesses the training data store 1610 at the identified location to download the requested training data through a streaming process. Streaming provides the data in a continuous flow and allows the data to be processed by the training augmentation system 1642 even before an entire file is downloaded. As portions of the training data are downloaded from the training data store 1610 , the training augmentation system 1642 buffers it in the memory of the server 1640 .
  • Training data augmentation system 1642 monitors the streaming download to determine if sufficient data has been downloaded to begin training. Training data augmentation system 1642 determines when the amount of data downloaded exceeds a threshold to determine when to begin training.
  • Training may begin before the entire training dataset is downloaded, by training using the buffered portions.
  • the training data augmentation system 1642 applies the requested augmentations to the buffered data. It sends the augmented training data as a stream to the training process 1644 .
  • the training data augmentation system 1642 continues to stream additional training data from the training data store 1610 . As this data is buffered on server 1640 , training data augmentation system 1642 applies the requested augmentations to the data and streams it to the training process 1644 .
  • the training data augmentation system 1642 receiving streaming training data from the training data store 1610 , applying augmentations to other buffered training data at the training data augmentation system 1642 , and transmitting streaming augmented training data to the training process 1644 may occur concurrently and in parallel.
  • the augmented stream of data is deleted.
  • portions of the augmented stream of training data are deleted as soon as the training process 1644 completes training on the portion, and even when streaming of the remainder of the augmented training data from the same training dataset to the training process 1644 continues.
  • the buffered, un-augmented training dataset downloaded from the training data store 1610 to server 1640 may be stored temporarily or permanently on server 1640 to provide caching.
  • the training data augmentation system 1642 may check the cache to see if the training dataset is already buffered in local memory of the server 1640 . If the training dataset is already present, the training data augmentation system may use the cached version of the training dataset, instead of fetching the training dataset from the training data store 1610 . If the training dataset is not in the cache, then the training data augmentation system 1642 may initiate a fetch of the training dataset from the training data store 1610 .
  • training datasets are stored as audio files.
  • Training data augmentation system 1642 may optionally perform preprocessing on the training data before applying augmentations.
  • training data augmentation system 1642 performs the functionality of front-end module 201 , 701 , or 801 .
  • the training data augmentation system 1642 decompresses the audio files and performs feature extraction to generate features.
  • the training data augmentation system 1642 may provide the feature data and the corresponding text transcripts for the training audio files to the training processes.
  • the training processes may access the training data augmentation system 1642 through the training manager 1643 .
  • Training data augmentation and streaming system 1600 may be used in a variety of machine learning applications and is not limited to use with neural networks or for the application of speech recognition.
  • FIG. 17 illustrates example process 1700 for massively parallelizing the inference processing using neural networks, such as end-to-end speech recognition system 200 , end-to-end speech classification system 700 , or end-to-end phoneme recognition system 800 .
  • Traditional ASR systems do not parallelize well, which may lead to performance difficulties in production systems with many requests.
  • the Hidden Markov Models and Gaussian Mixture Models coupled to language models, as used in traditional ASR are typically not easy to parallelize.
  • neural networks are well-suited to parallelization, leading to significant advantages for end-to-end neural network systems.
  • a client process submits an audio file 1710 for transcription.
  • This inference task may be transmitted from the client process over a network to a server hosting the end-to-end speech recognition system 200 .
  • a server process identifies locations, which may be identified by timestamps, where the audio file can be split.
  • the server process identifies splitting locations by identifying low-energy points in the audio file, such as locations of relative silence.
  • the low-energy points are determined by applying a convolutional filter.
  • a neural network is trained to learn a convolutional filter that identifies desirable locations in the audio file to split at.
  • the neural network may be trained by providing training examples of audio files and ground-truth timestamps where the audio files were split.
  • the neural network may learn a convolutional filter for determining splitting locations through backpropagation.
  • the split portions of the audio file may be approximately 7-10 seconds in length.
  • the audio file 1710 is split into portions 1711 , 1712 , 1713 .
  • the portions may be referred to as chunks. Although three chunks are illustrated, the audio file 1710 may be split into more or fewer chunks.
  • the server process applies an index to each chunk to preserve an indication of their order so that the chunks may be reassembled after inference.
  • the index stored is a timestamp of the temporal location of the chunk in the audio file, such as a starting timestamp, ending timestamp, or both.
  • the chunks 1711 , 1712 , 1713 are routed to a scheduler 1720 , which assigns each chunk to a GPU for performing the inference to determine the transcription.
  • the scheduler 1720 may dynamically assign chunks to GPUs based on characteristics of the GPUs and the chunks.
  • the scheduler 1720 may assign chunks based on how busy GPUs are, the size of the GPU's queue of waiting tasks, the processing power of the GPUs, the size of the chunks, and other characteristics.
  • GPUs perform inference processes 1732 , 1742 , 1752 for end-to-end speech recognition, end-to-end speech classification, end-to-end phoneme recognition, or other inference tasks.
  • Each GPU maintains a queue, 1731 , 1741 , 1751 of waiting jobs.
  • a scheduling protocol determines when each GPU begins processing the chunks in its queue.
  • the central scheduler 1720 performs this task for all of the GPUs.
  • the GPUs perform their inference tasks in parallel to each other, thereby allowing massive speedups by converting a single inference task into a set of parallel inference tasks.
  • the scheduling protocol for determining when the GPU begins processing a training batch in its queue is dynamic.
  • the GPU begins processing a batch when the batch in the queue reaches a target batch size.
  • the GPU compares the target batch size with the number of tasks in its queue, or their aggregate size in memory, to determine when to begin processing.
  • the target batch size starts at the maximum size that fits in the GPU memory.
  • the scheduling protocol also maintains a time out, and the GPU begins processing the batch in its queue if the time out is reached, even if the target batch size is not met.
  • the scheduling protocol sets the target batch size to the number of tasks in the queue. However, if no tasks are left in the queue, then the scheduling protocol sets the target batch size to the maximum size that fits in the GPU memory.
  • the inference processes 1732 , 1742 , 1752 may produce inference results, such as transcriptions of the audio chunks 1711 , 1712 , 1713 .
  • the inference results and chunks may be provided to recombination process 1760 .
  • the transcribed text is stitched back together, such as by concatenation, into a single output based on their indices, which may be timestamps.
  • the recombination process 1760 orders the transcribed text in the correct temporal arrangement based on the value of the indices of their corresponding audio chunks in order to produce final output 1762 , which is a transcription of the entire audio input 1710 .
  • the technique of chunking an input file and dynamically scheduling the chunks for processing by GPUs may be used in a variety of machine learning applications and is not limited to use with neural networks or for the application of speech recognition.
  • a trained neural network such as disclosed above may be used for purposes in addition to speech recognition.
  • the internal state of a trained neural network may be used for characterizing speech audio or deriving an internal state representation of the speech audio.
  • an internal state representation is determined based on the internal state of a trained speech recognition neural network while transcribing a speech audio sample.
  • the internal state representation is a concise representation of the internal state of the trained neural network while processing the audio input.
  • the total internal state of a trained neural network may be very large—on the order of hundreds of megabytes of data to describe the entire internal state.
  • the internal state representation obtained by sampling or compressing the total internal state may be significantly smaller, on the order of hundreds of bytes of data.
  • an internal state representation may be 256 bytes derived from an internal state of approximately 300 MB.
  • the internal state representation may be recorded at the time of initial transcription by a trained neural network and stored alongside the original audio.
  • the internal state representations may be associated with the particular frames or timestamps of the original audio that produced them.
  • various discrimination tasks or search tasks may be performed on the original audio by way of the stored internal state representations without needing to run the original audio through a full end-to-end transcription or classification neural network model a second time. That is, many applications in audio classification or search may be performed on the stored audio without processing the original audio with a potentially computationally-intensive speech recognition or classification neural network a second time.
  • the work performed by the initial speech recognition may be leveraged by any future processing of the audio that would otherwise potentially require a computationally intensive process.
  • a classification task may be to determine when speakers in an audio segment change, sometimes referred to as speaker diarization.
  • speech classification may be, for example, to determine a mood, sentiment, accent, or any other quality or feature of the speech audio.
  • a search task may be, for example, to search a corpus of speech audio based on an input segment of speech audio or an input text string.
  • One search task may be, for example, to find segments of audio in the corpus that discuss similar topics as the input speech segment.
  • Another search task may be, for example, to find segments of audio in the corpus that are spoken by the same speaker as the input speech segment, or for speakers with similar speech patterns as the input.
  • some embodiments may characterize speech audio according to the acoustic content of the speech audio or the semantic content of the speech audio. For example, an embodiment may relate to deriving a representation of speech audio that is related to the acoustic content of the speech audio. For example, segments of audio with the same person speaking would have similar representations, while segments of audio with a second person would have a distinct representation.
  • This acoustic representation may be used to, for example, search a corpus of acoustic audio data for particular sounds or acoustic signatures. An application of searching for sounds or acoustic signatures is speaker diarization, for example.
  • a representation of speech audio may be designed to be primarily related to the conceptual content of the speech audio, or the semantic meaning contained therein. For example, segments of speech audio of different people talking about the same subject matter would have similar representations.
  • a mixture of acoustic and semantic meaning may be contained in a representation.
  • Various portions of the representation may be more or less responsive to either acoustic or semantic information from the original speech audio.
  • Such a combined representation may be used in both semantic and acoustic discrimination tasks.
  • a particular segment or slice of a neural network may be selected and summarized or compressed to produce the internal state representation.
  • a portion of a neural network is selected, such as a selection of internal states such as a whole layer, certain portions of a layer, several layers, or portions of several layers. Given this portion of the neural network, a set of low-precision features is derived.
  • One method of deriving a low-precision feature is to quantize the output of an activation function of a node of a neural network.
  • the output of the activation function at each node of the portion may be simplified into a binary representation. That is, any output of the node above a threshold is treated as a first binary value, and any output of the node below the threshold is treated as a second binary value.
  • This low-precision representation may be more resilient to minor changes in the input because similar values may quantize to the same value.
  • Other quantization levels may similarly be used, providing a tradeoff between resultant size of the internal state representation and resolution, among other factors. For example, some embodiments may quantize activation functions into four or eight states.
  • Quantization may be performed by selecting n ⁇ 1 thresholds to create a set of n bins where n is the number of quantized states.
  • the real number valued output of the node is binned based on which pair of thresholds the real number valued output falls between and a numerical index of the bin may be used as the quantized value.
  • FIG. 18 illustrates an example of the process of generating low-precision features.
  • Neural network 1800 is provided and a subset of nodes of the neural network 1800 are selected for generating the features. As shown, nodes may be in the same layer or different layers of a neural network.
  • the outputs are real number values (such as floating point or double precision), but, in an embodiment, are quantized to binary numbers by use of a threshold, such as 0.5.
  • the quantized values are stored in tensor 1810 , where each node corresponds to a fixed location in the tensor 1810 .
  • the tensor 1810 provides a compressed representation of internal state of the neural network 1800 during the inference process.
  • a whole layer of the neural network may be selected for the internal state representation.
  • an internal state representation may be determined from a fully-connected stack that produces a word embedding of the input speech audio.
  • the internal state representation may be determined from second fully-connected stack 205 of the example neural network discussed above. This internal state may provide features that relate to semantic meaning of the speech audio, for example.
  • an internal state representation may be generated from a CNN layer.
  • Such an internal state may contain features related to the acoustic input or acoustic signature of the input speech audio, for example.
  • an internal state representation may be generated from CNN stack 202 of the example neural network discussed above.
  • a low-precision feature may be created from the internal state of a CNN layer, or from each non-linearity at the output of a CNN layer.
  • an internal state representation may be derived from a fully-connected layer that accepts the inputs of a CNN layer, such as first fully-connected layer 203 in the example embodiment discussed above.
  • a mixture of nodes from disparate portions of an internal state of a neural network may be selected for the internal state representation. These selections may include portions of the network from any layer, such that they encompass a range of information contained in the network.
  • an internal state representation may be derived from some nodes from a CNN layer, other nodes from an RNN layer, and other nodes from one or more fully-connected layers, such that the resultant representation contains information from each of these various layers.
  • a selection of which nodes to include in the internal state representation may be produced through a pruning process. For example, a portion of the internal state of a neural network may be set to a null value, and the effect on the output observed. If the output experiences a large change, the portion that was omitted may be of interest for inclusion in an internal state representation.
  • This process may be automated and iterative such that a pruning algorithm may determine an optimal subset of nodes for inclusion in an internal state representation by observing and learning their effect on the change of the output.
  • an approach based on principal component analysis may be used to determine an optimal subset of neural network nodes for inclusion in an internal state representation.
  • the architecture of the neural network may be designed to produce an internal state representation.
  • a neural network may include a fully-connected layer of a comparatively low dimension for the purposes of deriving an internal state representation. This layer may be referred to as a bottleneck feature layer.
  • the bottleneck feature layer is trained in the initial training of the speech recognition neural network to contain all information necessary to produce the output because all information must necessarily flow through the bottleneck layer. In this way, the initial training of the speech recognition neural network model also trains an optimal layer from which a reduced precision internal state representation may be derived.
  • a separate branch or branches of the neural network may be appended to or branched from the speech recognition neural network model and initially trained in parallel with the speech recognition portion. That is, additional outputs are added to the neural network with additional loss functions that train the network to produce a separate output that may be used to produce the internal state representation.
  • This technique is similar to the above bottleneck feature technique, but the output may be separately trained from the speech recognition output. Then, the neural network may produce two sets of outputs including a first output that produces speech transcriptions and a second output that produces a representation of the input that may be used for future processing.
  • this additional network may be an auto-encoder network that is trained to produce an output similar to the input. That is, the auto-encoder is trained alongside the speech recognition neural network with the state of the speech recognition network as an input and the input to the speech recognition network as the training data. Then, the auto-encoder network will learn an output representation most similar to the input. This type of auto-encoder network may then be used to, for example, generate an approximation of the original acoustic input to the speech recognition network based on the low-precision internal state representation.
  • an encoding network may be trained to encode a particular layer or layers of the original speech recognition network, such as a word embedding layer or an audio features layer.
  • a combination of such encoders may be jointly used to produce the internal state representation.
  • the internal state representation may be used to classify audio.
  • a corpus of audio may be transcribed by an end-to-end speech recognition neural network such as described above.
  • an internal state representation may be generated and recorded along with the audio and the corresponding transcription.
  • the internal state representation may contain more information than the corresponding text transcription, but less than the entire internal state of the neural network at the time of transcription.
  • This internal state representation may then be used later to perform novel classification on the original audio data while leveraging the work done previously during transcription.
  • the internal state representation may be used to determine speaker changes in audio, also known as speaker diarization.
  • a corpus of audio has been transcribed with an end-to-end neural network.
  • the original audio, the transcription produced by the end-to-end neural network, and a stream of internal state representations created during transcription are stored together.
  • a second machine learning model may be trained based on a portion of the corpus that has been manually classified.
  • the manually classified portion of the corpus is used as training data for the second machine learning model.
  • the manually classified training data may indicate when speakers change in the audio.
  • the indications may be an indication of an identity, or label, of a specific speaker that is talking or just an indication that a speaker change occurred.
  • the second machine learning model may then be trained based on the internal state representation stream and the training speaker diarization indications.
  • the internal state representation stream is provided as input to the second machine learning model and the training speaker diarization indications are provided as the target output.
  • the second machine learning model may then learn to recognize speaker diarization based on the internal state representation stream. It learns a model for identifying internal state representations corresponding to a speaker change, or a certain speaker identity, and identifying internal state representations not corresponding to a speaker change, or other speaker identities.
  • the rest of the corpus of transcribed audio which lack manual classifications, may then be classified by the second machine learning model based on the previously stored internal state representation stream.
  • the internal state representations corresponding to the non-manually classified audio are input to the second machine learning model.
  • Predicted classifications of the internal state representations are output by the second machine learning model based on the input internal state representations.
  • the predicted classifications may then be matched to the corresponding audio portions or transcription portions associated with those input internal state representations. In this way, the previously computed internal state representation stream may be leveraged by later processing.
  • classification tasks may be performed on the internal state representation. For example, some embodiments may classify the audio into classes such as gender (e.g., male/female), emotion or sentiment (e.g., angry, sad, happy, etc.), speaker identification (i.e., which user is speaking), speaker age, speaker stress or strain, or other such classifications. Because the internal state representation already contains a complex representation of the speech audio, each of these tasks that may done much more efficiently based on the internal state representation as compared to running a new neural network on the original speech audio.
  • gender e.g., male/female
  • emotion or sentiment e.g., angry, sad, happy, etc.
  • speaker identification i.e., which user is speaking
  • speaker age i.e., speaker age
  • speaker stress or strain i.e., speaker stress or strain
  • the internal state representation stream may be used for search tasks. For example, rather than searching on transcribed text, a search of a speech audio file may be performed on the internal state representations associated with the speech audio. Because the internal state representations contain more information than text alone, including acoustic and semantic, a search may find more relevant audio segments than one based on only the output text representation of the speech audio.
  • a large corpus of speech audio has been transcribed by a speech recognition neural network such as described above, and an internal state representation derived at the time of the original transcription stored along with the speech audio.
  • a second neural network may then be trained to produce an internal state representation based on a text input. That is, the network accepts as input the text of a word or phrase and produces an internal state representation such as would have been produced by the speech recognition neural network if the word or phrase was present in audio provided to the speech recognition neural network.
  • This second neural network may be trained on the existing data, that is, the corpus of speech audio containing both computed internal state representations and associated text outputs.
  • the second neural network is provided with training examples, where the training examples include an input comprising a text word or phrase and a target output comprising an internal state representation created by the speech recognition neural network when an audio recording of the word or phrase was presented.
  • the second neural network learns a model for producing synthetic internal state representations based on text words or phrases.
  • an input text word or phrase is presented and input to the second neural network, and an internal state representation is produced by the second neural network for the input word or phrase.
  • This produced state representation is a close approximation of what the speech recognition network would have produced if it had been provided audio input that produced the text that was input to the second network.
  • This state representation may then be used as a search input vector.
  • the search input vector is compared to those internal state representation vectors stored in the corpus for similarity to find matches and search results.
  • any method of comparing the representations which may be expressed as vectors, may be used.
  • a dot product vector similarity or cosine similarity may be used to determine a relationship between the search input and the stored internal state representations.
  • Dot product or cosine similarity are examples of vector or tensor distance metrics to measure similarity.
  • the audio associated with the store internal state representations with the closest matches is the result of the search. In some embodiments, a single search result is returned corresponding to the closest match, and, in other embodiments, a plurality of results are returned.
  • a classifier may be used to determine similarity between search input vectors and stored internal state vectors. That is, rather than using a dot product or cosine similarity, a measure of similarity may be determined by training a classifier network on search results.
  • This classifier may be a neural network or may be any other classifier such as a support vector machine or a Bayesian network, for example.
  • the classifier may be trained on ground-truth labelled search results, for example. It may accept training examples comprising sets of two internal state vectors as inputs and a target output comprising an indication of whether the internal state vectors are similar or not.
  • the target output is binary, and, in other embodiments, the target output is a real valued measure of similarity.
  • the classifier may be used to identify the closest matches to a search input vector.
  • the search input vector is compared to one or more of the stored internal state vectors by using the classifier to output a similarity value.
  • the audio associated with the most similar or set of most similar stored internal state representations is returned as the result of the search.
  • a blended similarity model may be used that combined mathematical similarity between internal state representations and classifier-based similarity.
  • the technique of generating internal state representations of a neural network based on sampling the outputs of neural network nodes for use in classification, search, or other applications, as described above, may be used in a variety of machine learning applications and is not limited to use for the application of speech recognition.

Abstract

Systems and methods are disclosed for customizing a neural network for a custom dataset, when the neural network has been trained on data from a general dataset. The neural network may comprise an output layer including one or more nodes corresponding to candidate outputs. The values of the nodes in the output layer may correspond to a probability that the candidate output is the correct output for an input. The values of the nodes in the output layer may be adjusted for higher performance when the neural network is used to process data from a custom dataset.

Description

    CROSS-REFERENCE TO RELATED APPLICATIONS
  • This application is a continuation of U.S. patent application Ser. No. 16/108,107, filed Aug. 22, 2018, which claims the benefit of U.S. Provisional Patent Application No. 62/703,892, filed Jul. 27, 2018, which are each hereby incorporated by reference in their entirety.
  • BACKGROUND
  • Neural networks are machine learning models that may be trained to produce outputs based on an input. Neural networks may include an output layer where one or more nodes of an output layer correspond to candidate outputs, and the value of output nodes is a probability that the candidate output is the correct output for the input.
  • Neural networks are often trained on general training sets. However, training the neural network on a general training set may not produce high quality outputs when the neural network is used on a more specific dataset. Therefore, it would be desirable to provide a mechanism for customizing a neural network that has been trained on a general training set for a specific dataset.
  • SUMMARY
  • Systems and methods are disclosed for customizing the output of a neural network for a custom dataset, when the neural network has been trained on a general training set.
  • One embodiment comprises providing a trained neural network, where the trained neural network includes a plurality of layers each having a plurality of nodes. The trained neural network may include an output layer with nodes corresponding to candidate outputs, wherein the values of the nodes in the output layer correspond to a probability of a candidate output being a correct output corresponding to an input. During inference using the trained neural network, the values of a plurality of nodes in the output layer may be adjusted for a custom model, wherein the custom model is different from a general training set used to generate the trained neural network.
  • One embodiment comprises providing a trained speech recognition neural network, where the trained speech recognition neural network including a plurality of layers each having a plurality of nodes. The trained speech recognition neural network may include an output layer with nodes corresponding to words of a vocabulary, wherein the values of the nodes in the output layer correspond to a probability of a word in the vocabulary being a correct transcription of an input. For a plurality of words in the vocabulary, the frequency of occurrence of the word in a general training set and the frequency of occurrence of the word in a custom dataset is determined. During inference using the trained speech recognition neural network, for each word in the plurality of words, the probability output by the output node for the word is multiplied by the frequency of occurrence of the word in the custom dataset and divided by the resulting product by the frequency of occurrence of the word in the general training set to obtain a custom model probability.
  • In an embodiment, a customization layer is provided in a neural network. The customization layer may customize the output of the neural network for a custom vocabulary by adjusting the probabilities of each output of the neural network according based on characteristics of the custom vocabulary and a general vocabulary. The customization may be performed based on the observed frequency of each output in the custom vocabulary as compared to the observed frequency of each output in the general vocabulary.
  • In one embodiment, the neural network is an end-to-end speech recognition system, end-to-end speech classification system, or end-to-end phoneme recognition system. In other embodiments, the neural network may perform tasks unrelated to speech recognition.
  • Further areas of applicability of the present disclosure will become apparent from the detailed description, the claims and the drawings. The detailed description and specific examples are intended for purposes of illustration only and are not intended to limit the scope of the disclosure.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • The present disclosure will become more fully understood from the detailed description and the accompanying drawings, wherein:
  • FIG. 1 illustrates an exemplary network environment where some embodiments of the invention may operate
  • FIG. 2 illustrates an end-to-end speech recognition system according to an embodiment;
  • FIG. 3 illustrates an example of audio features produced by a front-end module according to an embodiment;
  • FIG. 4 illustrates an example CNN stack architecture according to an embodiment;
  • FIG. 5 illustrates an example RNN stack architecture according to an embodiment;
  • FIG. 6 illustrates an example transcription output of an end-to-end speech recognition system according to an embodiment; and
  • FIG. 7 illustrates an end-to-end speech recognition system according to an embodiment.
  • FIG. 8 illustrates an end-to-end phoneme recognition system according to an embodiment.
  • FIG. 9A illustrates an iterative beam search according to an embodiment.
  • FIG. 9B illustrates exemplary radial basis functions used in an iterative beam search according to an embodiment.
  • FIG. 9C illustrates an example use of iterative beam search according to an embodiment.
  • FIG. 10 illustrates an example of looping training samples in a training batch that are shorter than a longest training sample.
  • FIGS. 11A-B illustrates an example attention mechanism for a neural network.
  • FIG. 12 illustrates an example of a general domain and a custom domain.
  • FIG. 13 illustrates an example system for predicting the weights of neural network nodes.
  • FIG. 14 illustrates an example customization layer of a neural network.
  • FIG. 15 illustrates an example method of training a neural network for a custom domain by selecting portions of a general training dataset to train on.
  • FIG. 16 illustrates an example training data augmentation and streaming system.
  • FIG. 17 illustrates an example process for parallelizing an inference task.
  • FIG. 18 illustrates an example method of generating an internal state representation of a neural network.
  • In the drawings, reference numbers may be reused to identify similar and/or identical elements.
  • DETAILED DESCRIPTION
  • Embodiments described herein relate to end-to-end neural network speech recognition systems. Some disclosed embodiments form a single neural network from input to output. Because of this unitary architecture, the disclosed speech recognition systems are able to be trained solely by data driven techniques, eschewing laborious hand-tuning and increasing accuracy.
  • Traditional speech pipelines need tens of people working together to build a model over several months. If one portion of the pipeline is altered, then all interfaces with the standard pipeline may be affected. Embodiments disclosed herein are trained by data-driven techniques only, without the need for human intervention.
  • In the following description, for purposes of explanation, specific details are set forth in order to provide an understanding of the invention. It will be apparent, however, to one skilled in the art that the invention can be practiced without these details. Furthermore, one skilled in the art will recognize that embodiments of the present invention, described below, may be implemented in a variety of ways, such as a process, an apparatus, a system, a device, or a method on a tangible computer-readable medium.
  • Components, or modules, shown in diagrams are illustrative of embodiments of the invention. It shall also be understood that throughout this disclosure that components may be described as separate functional units, which may comprise sub-units, but those skilled in the art will recognize that various components, or portions thereof, may be divided into separate components or may be integrated together, including integrated within a single system or component. It should be noted that functions or operations discussed herein may be implemented as components. Components may be implemented in software, hardware, or a combination thereof.
  • Furthermore, connections between components or systems within the figures are not intended to be limited to direct connections. Rather, data between these components may be modified, re-formatted, or otherwise changed by intermediary components. Also, additional or fewer connections may be used. It shall also be noted that the terms “coupled,” “connected,” or “communicatively coupled” shall be understood to include direct connections, indirect connections through one or more intermediary devices, and wireless connections.
  • Reference in the specification to “one embodiment,” “an embodiment,” “some embodiments,” or “embodiments” means that a particular feature, structure, characteristic, or function described in connection with the embodiment is included in at least one embodiment of the invention and may be included more than one embodiment. Also, the appearances of the above-noted phrases in various places in the specification are not necessarily all referring to the same embodiment or embodiments.
  • The use of certain terms in various places in the specification is for illustration and should not be construed as limiting. A service, function, or resource is not limited to a single service, function, or resource; usage of these terms may refer to a grouping of related services, functions, or resources, which may be distributed or aggregated. Furthermore, the use of memory, database, information base, data store, tables, hardware, and the like may be used herein to refer to system component or components into which information may be entered or otherwise recorded.
  • Furthermore, it shall be noted that unless otherwise noted: (1) steps may optionally be performed; (2) steps may not be limited to the specific order set forth herein; (3) steps may be performed in different orders; and (4) steps may be done concurrently.
  • FIG. 1 illustrates an exemplary network environment 100 where some embodiments of the invention may operate. The network environment 100 may include multiple clients 110, 111 connected to one or more servers 120, 121 via a network 140. Network 140 may include a local area network (LAN), a wide area network (WAN), a telephone network, such as the Public Switched Telephone Network (PSTN), an intranet, the Internet, or a combination of networks. Two clients 110, 111 and two servers 120, 121 have been illustrated for simplicity, though in practice there may be more or fewer clients and servers. Clients and servers may be computer systems of any type. In some cases, clients may act as servers and servers may act as clients. Clients and servers may be implemented as a number of networked computer devices, though they are illustrated as a single entity. Clients may operate web browsers 130, 131, respectively for display web pages, websites, and other content on the World Wide Web (WWW). Clients 110, 111 may also access content from the network 140 using applications, or apps, rather than web browsers 130, 131. Servers may operate web servers 150, 151, respectively for serving content over the network 140, such as the web.
  • The apparatuses and methods described in this application may be partially or fully implemented by one or more computer programs executed by one or more processors. The computer programs include processor-executable instructions that are stored on at least one non-transitory tangible computer readable medium. The computer programs may also include and/or rely on stored data.
  • FIG. 2 illustrates an end-to-end speech recognition system 200 according to an embodiment. The example end-to-end speech recognition system 200 illustrated in FIG. 2 is configured to transcribe spoken word into written text. Speech recognition system 200 comprises front-end module 201, convolutional neural network (CNN) stack 202, first fully-connected layer 203, recurrent neural network (RNN) stack 204, second fully-connected layer 205, output neural network stack 206, and optional customization layer 207. In end-to-end speech recognition system 200, each subcomponent connects directly to the next. The entire end-to-end speech recognition system 200 may operate as a single neural network. The input to end-to-end speech recognition system 200 is audio information, and the output is a word-by-word transcription of the input audio.
  • Neural networks comprise a plurality of neural network nodes organized in one or more layers. Each node has one or more inputs, an activation function, and an output. The inputs and output may generally be real number values. The inputs to the node are combined through a linear combination with weights and the activation function is applied to the result to produce the output. The output of a node may be expressed Output=g(W0+W1X1+W2X2++WiXi) where Wi are weights, Xi are input values, and g is the activation function. The output may be transmitted as an input to one or more other nodes in subsequent layers. The weights in the linear combination may be referred to as the weights of the node, and each node may have different weights. Neural network nodes may be organized in one or more layers. An input layer may comprise input nodes whose values may correspond to inputs to the neural network, without use of an activation function. An output layer may comprise one or more output nodes corresponding to output from the neural network. Neural network layers other than the input layer and output layer may be hidden layers, and the nodes in those layers may be referred to as hidden nodes.
  • For clarity in explanation, the primary stacks that make up end-to-end speech recognition system 200 may be roughly analogized to components of a traditional ASR system, though the components of end-to-end speech recognition system 200 are not so rigidly defined as in a traditional ASR system. For example, CNN stack 202 detects features of the input audio stream and RNN stack 204 classifies groups of features as words, roughly similar to an acoustic model and a pronunciation dictionary. However, CNN stack 202 does not produce a discrete phoneme stream output, and RNN stack 204 does not expressly use a language model or hand-coded dictionary. Instead, the features produced by CNN stack 202 are entirely learned in the training process, and RNN stack 204 learns relationships between sounds and words through training as well. No hand-coded dictionaries or manual interventions are used throughout. Each layer or stack of end-to-end speech recognition system 200 is described in further detail below.
  • Front-end module 201 produces acoustic features from audio input. Front-end module 201 receives raw audio data and applies a series of transformations and filters to generate acoustic features suitable for speech recognition by the following neural networks. In an embodiment, the input audio is a recording of an utterance that may be segmented on relative silence such that the input audio comprises an entire utterance. An utterance may be one or more words. For example, the input audio may be a 7-10 second long recording of a speaker speaking a word, phrase, or series of words and/or phrases. In some embodiments, the input audio may be an entire sentence. In some embodiments, the input audio is segmented based on time intervals rather than relative silence. In some embodiments, the input audio is segment is based on a combination of features, such as relative silence, time, and other features.
  • Front-end module 201 may filter the input audio to isolate or emphasize frequency bands relevant to speech recognition. For example, front-end module 201 may low-pass filter the input audio at a predetermined frequency to remove high frequency information beyond the range of speech. Similarly, front-end module may filter the input audio with high-pass filters, band-pass filters, dynamic range compressors, dynamic range expanders, or similar audio filtering techniques suitable for processing audio for speech recognition.
  • Front-end module 201 may then segment the input recording of an utterance into a series of frames. For example, the input utterance recording may be split into a series of frames of audio data 10 milliseconds long, such that one second of input audio may be split into 100 frames. In some embodiments, the frames may overlap. For example, one second of input audio may be divided into 100 frames that are 25 milliseconds in length, spaced at 10 millisecond intervals. Any frame duration, spacing, and overlap may be used as appropriate for any given implementation as determined by one skilled in the art.
  • In some embodiments, front-end module 201 may output raw audio information for consumption by subsequent layers. In other embodiments, front-end module 201 may further process the audio frames before outputting. For example, in some embodiments, front-end module 201 generates spectrograms of audio frames. The spectrograms for each frame may then be arranged sequentially, producing a two-dimensional representation of the input audio that reflects the frequency content over time. In this way, the front-end module may generate a visual, two-dimensional representation of the input audio for the following neural networks.
  • In some embodiments, front-end module 201 generates other features of the input audio frames. Examples of feature representations include: log-mel filterbanks, Mel-Frequency Cepstral Coefficients (MFCC), and perceptual linear prediction coefficients, among other similar acoustic feature representations. In an embodiment, an MFCC representation of each frame may be visualized as a linear vector similar to the spectrogram example above, and similarly rotated and stacked side-by-side to produce a 2-dimensional visual representation of the audio input over time.
  • The relevant parameters of front-end module 201 include the number of frames, the width and overlap of frames, the type of features determined, and the number of features per frame. Each of these parameters may be chosen by one skilled in the art for any given implementation.
  • FIG. 3 illustrates an example of audio features produced by a front-end module such as front-end module 201. In FIG. 3, audio input 301 is divided into windows 302 a-n. For the sake of illustration, only some audio windows 302 a-n are illustrated in FIG. 3. In most embodiments, audio windows would either abut or overlap such that the entire audio input is processed. Each window of audio data is then processed by a filter 303. In an embodiment, filter 303 produces an MFCC representation 304 of each window of audio data. For the purposes of illustration, MFCC representations 304 a-n comprise 12 coefficients, but any number of coefficients may be used. As illustrated, the shade of each coefficient in MFCC representations 304 a-n represent an intensity of each coefficient, corresponding to some feature or quality of the audio stream. A plurality of feature representations are joined together to form a single representation 305 of the entire audio input. This representation 305 may be illustrated as a 2-dimensional image as shown in FIG. 3.
  • Representations of greater or less than 1-dimension or 2-dimensions may also be used to represent frames, and frames may be represented in the system as tensors. The term tensor is used to refer to a vector or matrix of any number of dimensions. A tensor may have dimension 0 (scalar), dimension 1 (vector), dimension 2 (2-dimensional matrix), or any higher number of dimensions such as 3, 4, 5, and so on. The multi-dimensional property of some tensors makes them a useful tool for representing neural networks and also the data representations between neural network layers.
  • Returning to FIG. 2, CNN stack 202 receive the representation of the audio input from front-end module 201. CNN stack 202 processes the audio features to determine a first set of features. Specifically, CNN stack 202 generates a number of feature maps corresponding to a number of convolutional filters, where each convolutional filter represents some characteristic or feature of the audio input. This step may be regarded as roughly analogous to determining a phoneme representation of input audio, however CNN stack 202 does not discretize the output to a set number of acoustic representations. The features determined by CNN stack 202 are not limited to a predetermined set of phonemes. Because it is not so limited, CNN stack 202 can encode a wide range of information.
  • CNN stack 202 may include any number of convolutional layers, each including various size convolutional kernels. The relevant hyperparameters of CNN stack 202 include the dimension and number of CNN stack, the dimension and number of convolutional kernels at each layer, the stride of the convolutional kernels, and the number and function of any pooling stack. Convolutional kernels may be square, such as of size 5×5, or rectangular, such as of size 3×9, for example. Rectangular convolutional kernels that are ‘narrow’ along the time-axis may be more sensitive to features that are spread out over a wide range of frequencies but local to a short time period. Similarly, rectangular convolutional kernels that are ‘wider’ along the time-axis may detect acoustic features that are confined to a relatively narrow audio band but are of longer duration in time. Convolution kernels may also be referred to as windows, filters, or feature detectors.
  • In an embodiment, the size of the convolutional kernel also determines the number of connections between the input layer and at least the first hidden layer of neural network nodes of the CNN. Each node in the first hidden layer of the CNN has an input edge from each of the input values in the convolutional kernel centered on that node. For example, if the convolutional kernel has size 5×5, then a hidden neural network node in the first hidden layer has 25 inbound edges, one from each of the input values in a 5×5 square in the vicinity of the neural network node, and the hidden neural network node does not have inbound edges from other input values outside of the convolutional kernel. In an embodiment, the subsequent hidden layers of the same CNN stack or later CNN stacks operate in the same manner, but the inbound edges come not from the input values but from the preceding CNN layer. Each subsequent neural network node in the CNN stack has inbound connections from preceding CNN nodes in only a local area defined around the subsequent neural network node, where the local area may be defined by the size of the convolutional kernel. This property also implies that a given hidden layer node of a CNN also only has outbound edges to hidden layer nodes of the next layer that are in the vicinity of the given hidden layer node. The outbound connections of a hidden layer node may also correspond to the size of the convolutional kernel.
  • A CNN is one type of locally connected neural network because the neural network nodes of each layer are connected only to nodes of the preceding layer of the neural network that are in the local vicinity of the neural network nodes. Moreover, a CNN may also be referred to as one type of sparsely connected neural network because the edges are sparse, meaning that most neural network nodes in a layer are not connected to the majority of neural network nodes in the following layer. The aforementioned definitions may exclude the output or input layer as necessary given that the input layer has no preceding layer and the output layer has no subsequent layer. A CNN is only one type of locally connected or sparsely connected neural network, and there are other types of locally connected or sparsely connect neural networks.
  • Individual convolutional layers may produce an output activation map that is approximately the same dimensionality as the input to the layer. In other words, the convolutional kernel may operate on all or nearly all input values to a convolutional layer. Convolutional layers may also incorporate a stride factor wherein the convolutional kernel may be shifted by 2 or more pixels per iteration and produce an activation map of a correspondingly reduced dimensionality. Stride factors for each layer of CNN stack 202 may be determined by one of skill in the art for each implementation.
  • CNN stack 202 may include pooling layers in between convolutional layers. Pooling layers are another mechanism to reduce dimensionality. For example, a pooling layer may operate on a 2×2 window of an activation map with a stride of 2 and select the maximum value within the window, referred to as a max pooling operation. This example pooling layer reduces the dimensionality of an activation map by a factor of 4. Other dimensions of pooling stack may be used between convolutional stack to reduce dimensionality, for example 1×2, 1×3, or other pooling dimensions.
  • In some embodiments, the input to CNN stack 202 is all frames of audio features produced by front-end module 201 and no segmenting or windowing is involved. In these embodiments, convolutional kernel dimension, stride, and pooling dimensions may be selected so as to retain temporal information. In an embodiment, this is accomplished by reducing dimensions only the frequency dimension, such that the output of CNN stack 202 has a time dimension equal to its input. In any embodiment, CNN stack 202 produce a set of features corresponding to sounds in the audio input.
  • In some embodiments, the input to CNN stack 202 is a segment of frames of audio features produced by front-end module 201. For each output frame, a context of frames before and/or after the output frame may be included in the segment. For example, for each frame of audio, CNN stack 202 may operate on a ‘window’ of the 5 previous frames and the following 5 frames, for a total of 11 frames. In this example, if there are 40 audio features per frame, CNN stack 202 would then operate on an input having dimensions of 11×40. Through selection of the hyperparameters for CNN stack 202, the output for a segment may be dimensioned smaller in the time dimension than its input. In other words, CNN stack 202 may resize in the temporal dimension so as to produce a different dimensioned output for each input segment of frames. For example, an embodiment of CNN stack 202 may have an input of dimension 11×40 and an output for each feature of width 1 in the time dimension.
  • FIG. 4 illustrates an example CNN stack architecture according to an embodiment. Acoustic feature representation 401 may be a representation such as an MFCC representation as illustrated in FIG. 3. Each horizontal division is a frame, and each vertical division indicates a different MFCC coefficient value. In the illustration, a highlighted window 403 of 7 frames centered around a central frame 402. This segment of frames is then processed by one or more convolutional and pooling neural network layers that make up a convolutional neural network stack such as CNN stack 202 discussed above. In FIG. 4, a single convolutional kernel 404 is illustrated, and a number of network layers as illustrated by network layers 403 a-c. After a number of network layers, a final dataset 404 is produced corresponding to a number of features that describe input frame 402. As illustrated, the final dataset 404 may be a volume with a first dimension corresponding to time, a second dimension corresponding to features of the audio at a point in time, such as frequencies or coefficients, and a third dimension corresponding to various filters. The illustrated number and arrangement of datasets and layers is for illustrative purposes only, it is to be understood that any combination of convolutional and/or pooling layers would be used in an implementation as determined by one of skill in the art.
  • Returning to FIG. 2, first fully-connected layer 203 receives features from CNN stack 202 and produces a second set of features. A fully-connected neural network is a neural network in which all nodes in a layer of the neural network are connected to all nodes of the subsequent layer of the neural network. A fully-connected layer 203 comprises one or more fully-connected neural networks placed end-to-end. The term fully-connected comes from the fact that each layer is fully-connected to the subsequent layer. A fully-connected neural network is one kind of densely connected neural network, where a densely connected neural network is one where most of the nodes in each layer of the neural network have edge connections to most of the nodes in the subsequent layer. The aforementioned definitions may exclude the output layer which has no outbound connections.
  • In an embodiment, the first fully-connected layer 203 is implemented as a fully-connected neural network that is repeated across the entire segment that is output by the CNN stack 202, and each copy of the fully-connected neural network accepts as input a single strided frame. Strided frame refers to the frames output by the CNN stack 202, which may be obtained by slicing the final dataset 404 in the time dimension so that each strided frame refers to a single point in time. There may be fewer strided frames than input frames to the CNN stack 202 due to striding or pooling, though in some embodiments they could be the same in number. Each strided frame retains features of the audio at the point in time and features in the depth dimension created by the various convolutional filters. Each copy of the fully-connected neural network shares the same parameters, in particular each of the weights of all of the nodes of the fully-connected neural network, which allows for computational and memory efficiency because the size of the fully-connected neural network corresponds to a single strided frame rather than the segment and one copy of the fully-connected neural network may be stored and reused. It should be understood that the repetition of the fully-connected neural network across the segment is a reuse of the neural network per strided frame and would not require actually creating a separate copy of the neural network in memory per strided frame. The output of each fully-connected neural network is a tensor comprising features of the strided frame, which is input into the following layer.
  • First fully-connected layer 203 serves several functions. First, the dimensionality of the first fully-connected layer 203 may be selected so as to resize the output of CNN stack 202. Second, the fully-connected stack may learn additional features that the CNN stack 202 are not able to detect.
  • First fully-connected layer 203 may resize the output of CNN stack 202 for consumption by the subsequent stack. For example, CNN stack 202 may produce a high dimensioned output based on the number of feature maps used and the frequency context of the output. In some embodiments, the first fully-connected layer 203 may reduce the dimension of this output to reduce the number of parameters subsequent stack need to process. Further, this flexibility allows various implementations to optimize the hyperparameters of various stacks independently of one-another while retaining compatibility between stacks.
  • First fully-connected layer 203 may also learn additional features. In some embodiments, first fully-connected layer 203 may learn features that CNN stack 202 are not sensitive to. For example, the first fully-connected layer 203 is not limited to local connections between nodes so concepts that require considering tensor values that are distant may be learned. Moreover, the first fully-connected layer 203 may combine information collected from multiple different feature maps generated by different convolutional kernels.
  • The output of the CNN stack 202 and first fully-connected layer 203 may be thought of as roughly analogous to a phoneme representation of the input audio sequence, even though no hardcoded phoneme model is used. The similarity is that these network layers produce an output that describes the acoustic features of the input audio in sequence. In embodiments where the audio was segmented or windowed prior to the CNN stack 202, the output is a series of short temporal axis slices corresponding to acoustic features in each audio segment or window. In embodiments where the CNN stack 202 operate on the entirety of the audio input, the output of first fully-connected layer 203 is a representation of the activation of acoustic features over the entire time of the input. In any embodiment, the output from CNN stack 202 and first fully-connected layer 203 is a set of features that describe acoustic features of the audio input.
  • Recurrent Neural Network (RNN) stack 204 receives these features from first fully-connected stack 203 and produces a third set of features. In an embodiment, the input features comprises a set of tensors 501 a-n with one tensor corresponding to each strided frame, and the corresponding tensor produced by the first fully-connected layer representing features of the associated strided frame. Each of the tensors 501 a-n is generated from the fully-connected neural network that operates per strided frame produced by the CNN stack 202. All of the tensors may be iterated over by the RNN stack 204 in order to process the information in a sequential, temporal manner. RNN stack 204 may be regarded as roughly analogous to a language model in that it receives acoustic features and outputs features related to words that correspond to acoustic features. RNN stack 204 may include various types of recurrent neural network layers, such as Long Short-Term Memory (LSTM) neural network layers and/or Gated Recurrent Unit (GRU) neural network layers. LSTM and GRU type recurrent neural network cells and layers include mechanisms for retaining or discarding information from previous frames when updating their hidden states.
  • LSTM and GRU type RNNs include at least one back loop where the output activation of a neural network enters as an input to the neural network at the next time step. In other words, the output activation of at least one neural network node is an input to at least one neural network node of the same or a prior layer in a successive time step. More specifically, the LSTM or GRU compute a hidden state, comprising a vector, through a series of mathematical operations, which is produced as an output of the neural network at each time step. The hidden state is passed as an input to the next time step of the LSTM or GRU. In an embodiment, an LSTM has three inputs at a particular time step, the hidden step passed from the previous time step, the output tensor value of the previous time step, and the input frame or tensor representation of the frame of the current time step. At each time step, the LSTM produces both a hidden state and output tensor value. In an embodiment, a GRU has two inputs at a particular time step, the hidden step passed form the previous time step and the input frame or tensor representation of the frame of the current time step. In a GRU, the hidden state and output tensor value are the same tensor and thus only a single tensor value is output.
  • In an embodiment, the LSTM may comprise a forget gate layer comprising a neural network layer with a sigmoid activation function and a pointwise multiplication gate for determining which elements of the input hidden state to preserve. The LSTM may comprise an update gate layer comprising a neural network layer with a sigmoid activation function and a neural network layer with a tan h activation function that are both input to a pointwise multiplication gate. The product may be input to a pointwise addition gate with the hidden state to add data to the hidden state. The LSTM may comprise an output gate layer comprising a neural network layer with a sigmoid activation function input to a pointwise multiplication gate with the other input being the hidden state after being passed through the tan h function. The result of this operation may be output as the tensor output of the LSTM at the current time step. Other implementations and variations of an LSTM may also be used, and the LSTM is not limited to this embodiment.
  • In an embodiment, the GRU may comprise an update gate layer for determining how much information from the prior hidden state to pass on to the future. The update gate layer may comprise a pointwise addition gate and a neural network layer with a sigmoid activation function. The GRU may comprise a reset gate layer for deciding how much prior hidden state information to forget. The reset gate layer may comprise a pointwise addition gate and a neural network layer with a sigmoid activation function. Other implementations and variations of a GRU may also be used, and the GRU is not limited to this embodiment.
  • RNN stack 204 processes the tensors representing the strided frames in sequence, and its output for each strided frame is dependent on previously processed frames. RNN stack 204 may include either unidirectional or bidirectional RNN layers. Unidirectional RNN stack operate in one direction in time, such that current frame predictions are only based on previously observed inputs. Bidirectional RNN layers are trained both forward in time and backward in time. Bidirectional RNNs may therefore make current-frame predictions based on both preceding frames and following frames. In an unidirectional RNN, the tensors corresponding to frames are processed sequentially by the RNN in a single direction such as front to back or back to front. In a bidirectional RNN, the tensors corresponding to frames may be processed in both directions, front to back and back to front, with the information produced from the forward and backward runs combined at the end of processing, such as by concatenation, addition, or other operations.
  • FIG. 5 illustrates an example RNN stack architecture according to an embodiment. Features 501 a-n are received from first fully-connected layer 203. In an embodiment, each of features 501 a-n corresponds to a single strided frame. These features are input into recurrent neural network 502. Recurrent neural network 502 is illustrated as ‘unrolled’ network elements 502 a-n, each corresponding to the input from one of features 501 a-n, to show the temporal operation of RNN 502 at each time step. Recurrent neural network 502 is a bidirectional recurrent neural network, as illustrated by the bidirectional arrows connecting elements 502 a-n. The diagram shows that data is passed from the RNN at the prior time step to the next time step. As a bidirectional RNN, data is passed from the RNN at the successive time step to the prior time step in a backward pass through the features 501 a-n. Other embodiments may utilize unidirectional RNN architectures. While recurrent neural network 502 is illustrated as a single layer for the purposes of illustration, it is to be understood that the recurrent network may include any number of layers. For each time step, recurrent neural network 502 produces a set of features related to a word prediction 503 a-n at that time step. This set of features is expressed as a tensor or vector output and is directly input to subsequent layers.
  • Returning to FIG. 2, a second fully-connected stack 205 receives the output features from RNN stack 204 and produces a word embedding. Similar to first fully-connected stack 203, second fully-connected stack 205 serves several functions. In an embodiment, second fully-connected stack 205 reduces the dimensionality of the output of RNN stack 204 to something more concise. In an embodiment, second fully-connected stack 205 produces a word embedding of significantly reduced dimension compared to the output of RNN stack 204. This word embedding contains information related to the word predicted for a given time frame, and also information regarding words around the predicted word.
  • This word embedding, or word vector, representation is then passed to output stack 206. Output stack 206 has an output node for each word of a vocabulary and a blank or null output. For each frame of input audio data, output stack 206 produces a probability distribution over its output nodes for a word transcription or a null output. For each spoken word in the input audio, one frame of the output sequence will be desired to have a high probability prediction for a word of the vocabulary. All other frames of audio data that correspond to the word will be desired to contain the null or blank output. The alignment of a word prediction with the audio of the word is dependent on the hyperparameters of the various stacks and the data used for training. For example, if the recurrent stack is unidirectional, the word prediction must come after a sufficient amount of audio frames corresponding to the word have been processed, likely near or around the end of the spoken word. If the recurrent stack is bidirectional, the alignment of the word prediction may be more towards the middle of the spoken word, for example. The learned alignments are dependent on the training data used. If the training data have word transcriptions aligned to the beginning of words, the RNN stack will learn a similar alignment.
  • FIG. 6 illustrates an example output of a transcription from an example output stack of an example end-to-end speech recognition system. The output stack will produce a prediction of which word corresponds to the audio for each time frame. Here, the output 600 for an example time frame is illustrated as a table with words in the first column and corresponding probabilities in the second column. In this example, the word “Carrot” has the highest prediction for this time frame with a weighted prediction of 0.90, or 90% likelihood.
  • Returning to FIG. 2, in some embodiments, a complete transcription output may be determined from the output of end-to-end speech recognition system 200 by choosing the highest probability predicted word at each frame. In some embodiments, the output probabilities of end-to-end speech recognition system 200 may be modified by a customization layer 207 based on a set of custom prior probabilities to tailor the transcription behavior for certain applications. In this way, a single, general training set may be used for a number of different applications that have varying prior probabilities.
  • Customization layer 207 may be useful, for example, to resolve ambiguities between homophones, to increase priors for words that rarely occur in the training data but are expected to occur frequently in a particular application, or to emphasize particular proper nouns that are expected to occur frequently. In an embodiment, the custom priors applied may be determined from a statistical analysis of a corpus of data. For example, if end-to-end speech recognition system 200 is employed by a particular company, documents from that company may be analyzed to determine relative frequency of words. The output of end-to-end speech recognition system 200 may then be modified by these custom priors to reflect the language usage of the company. In this way, end-to-end speech recognition system 200 may be trained once on a general training dataset and customized for a number of particular use cases while using the same trained model.
  • FIG. 7 illustrates an end-to-end speech classification system 700 according to an embodiment. The example end-to-end speech classification system 700 illustrated in FIG. 7 is configured to classify spoken words into a set of classifications rather than generate a transcription. For example, end-to-end speech recognition classification 700 may classify a spoken word or set of words into classes such as semantic topic (e.g., sports, politics, news), gender (e.g., male/female), emotion or sentiment (e.g., angry, sad, happy, etc.), speaker identification (i.e., which user is speaking), speaker age, speaker stress or strain, or other such classifications.
  • An advantage of the disclosed neural network architecture over traditional ASR systems using discrete components is that the same neural network architecture described above may be repurposed to learn classifications, instead of speech recognition. The neural network architecture learns the appropriate features automatically instead of requiring hand tuning. As such, the architecture of end-to-end speech classification system 700 is identical to that of end-to-end speech recognition system 200 as illustrated in FIG. 2 except for the output neural network stack 706. Front-end module 701 may be identical to front-end module 201, CNN stack 702 may be identical to CNN stack 202, first fully-connected layer 703 may be identical to first fully-connected layer 203, and RNN stack 704 may be identical to RNN stack 204. While the identity and order of the components may be the same, the hyperparameters and number and order of hidden nodes of each particular layer or stack may be separately tuned for the classification task. The configuration of each implementation will depend on the particular categorization goal and various implementation concerns such as efficacy, efficiency, computing platform, and other such factors. Similarly, the trained hidden nodes of any layer or component are learned through the training process and may differ between various implementations. For example, the convolutional kernels used by a gender classification implementation may be very different than those used by a transcription implementation.
  • The architecture and implementation details of end-to-end speech recognition system 200 as shown in FIGS. 2-6 and as described in the related sections of the description may also be used for end-to-end classification system 700, aside from a change to the output neural network stack 706. In other words, end-to-end speech recognition system 200 may be used for speech classification by simply changing the output layer, removing output network 206 and replacing it with output network 706.
  • One difference between end-to-end speech classification system 700 and end-to-end speech recognition system 200 is the output neural network stack 706. The output neural network stack 706 of end-to-end speech classification system 700 contains categories related to the classification scheme being used rather than words in a vocabulary. As an example, an output neural network stack 706 of an example end-to-end speech recognition system 700 may have two output nodes, one for male and one for female. Alternatively, a single output node may be used for the binary classification of male or female. The output of this example would be to classify spoken word as either male or female. Any number of classifications may be used to classify speech by output neural network stack 706. For multi-class classification, such as semantic topic, emotion or sentiment, speaker identification, speaker age, or speaker stress or strain, a single output node may be provided in output layer 706 for each potential classification, where the value of each output node is the probability that the spoken word or words corresponds to the associated classification. While not illustrated, there may be a customization layer that modifies the output of output neural network stack 706 similar to customization layer 207 discussed in connection with FIG. 2. A customization layer may alter predicted outputs based on some external guidance, similar to customization layer 207.
  • FIG. 8 illustrates an end-to-end phoneme recognition system 800 according to an embodiment. The example end-to-end phoneme recognition system 800 illustrated in FIG. 8 is configured to generate a set of phonemes from audio rather than generate a transcription. For example, end-to-end phoneme recognition system 800 may generate a sequence of phonemes corresponding to spoken words rather than a transcription of the words. A useful application of the end-to-end phoneme recognition system 800 is for addressing the text alignment problem, in other words, aligning an audio file with a set of text that is known to correspond to the audio. Text alignment may be used to split training examples that comprise lengthy audio files with lengthy corresponding text transcripts into shorter training examples that are easier to fit into computer memory. By performing text alignment, portions of the audio file may be associated with their corresponding portions of the text transcript. These portions may then be extracted or used as points of division and used as shorter training examples.
  • As described above, the disclosed neural network architecture has the advantage over traditional ASR systems of being able to be repurposed to other classification-type tasks without hand tuning. The architecture of end-to-end phoneme recognition system 800 is identical to that of end-to-end speech recognition system 200 as illustrated in FIG. 2 and end-to-end speech classification system 700 as illustrated in FIG. 7 except for the output neural network stack 806. Front-end module 801 may be identical to front-end module 201, CNN stack 802 may be identical to CNN stack 202, first fully-connected layer 803 may be identical to first fully-connected layer 203, and RNN stack 804 may be identical to RNN stack 204. While the identity and order of the components may be the same, the hyperparameters and number and order of hidden nodes of each particular layer or stack may be separately tuned for the phoneme recognition task. The configuration of the implementation will depend on implementation concerns such as efficacy, efficiency, computing platform, and other such factors. Similarly the trained hidden nodes of any layer or component are learned through the training process and may differ between various implementations. For example, the convolutional kernels used by a phoneme recognition implementation may be very different than those used by a transcription implementation.
  • The architecture and implementation details of end-to-end speech recognition system 200 and end-to-end speech classification system 700 as shown in FIGS. 2-7 and as described in the related sections of the description may also be used for end-to-end phoneme recognition system 800, aside from a change to the output neural network stack 806. In other words, end-to-end speech recognition system 200 may be used for phoneme recognition by simply changing the output layer, removing output network 206 and replacing it with output network 806.
  • One difference between end-to-end speech recognition system 200 and end-to-end phoneme recognition system 800 is the output neural network stack 806. The output neural network stack 806 of end-to-end phoneme recognition system 800 contains phonemes rather than words in a vocabulary. In an embodiment, one output node may be provided in the output layer 806 per phoneme, where the value of each output node is the probability that the audio input corresponds to the associated phoneme. In one embodiment, 40 phonemes may be provided via a total of 40 nodes in the output layer 806. In an embodiment, other numbers of phonemes may be provided such as 26, 36, 42, or 44. While not illustrated, there may be a customization layer that modifies the output of output neural network stack 806 similar to customization layer 207 discussed in connection with FIG. 2. A customization layer may alter predicted outputs based on some external guidance, similar to customization layer 207.
  • The phoneme recognition system 800 may be used to perform text alignment. An audio file and a corresponding text transcript are provided, and it is desired to match the corresponding audio features to the appropriate text. Initially, the audio file may be processed through phoneme recognition system 800 to produce a predicted sequence of audio phonemes. The text file may also be processed to translate the textual words to text phonemes. The text file may be converted to phonemes by iterating over the text and using known mappings of words to the corresponding phonemes. Alternatively, mappings from syllables to phonemes or from sequences of characters to phonemes may be used and may be applied to the text iteratively.
  • FIG. 9A illustrates an iterative beam search that is used in some embodiments. In the first iteration of the iterative beam search, the mapping of the audio phonemes and text phonemes may be set in a few possible ways. First, the text phonemes could be assumed to be evenly spaced in time and mapped to the audio phoneme at the corresponding time stamp of the audio file. Second, an estimated distribution of text phonemes over time may be determined based on the rate of speech in the audio file and regions of dead silence or high density talking. An estimated time stamp for each text phoneme may be derived for each time stamp based on this distribution, and each text phoneme may then be mapped to the audio phoneme at the corresponding time stamp of the audio file. Third, the audio phonemes and text phonemes could be matched one-to-one starting from the beginning of the audio phonemes and beginning of the text phonemes until the number of phonemes is exhausted. The first iteration of the iterative beam search is represented by the starting node of the search at layer 901.
  • At each iteration of the iterative beam search, the mappings or alignments of audio phonemes to text phonemes from the prior iteration are used as a starting point and then changed to create multiple new mappings or alignments, which are known as candidates. The candidates are scored and the n best-scoring candidates are selected for expansion at the next level of the iterative beam search, where n is the branching factor of the iterative beam search. By expanding only n candidates at each level, the algorithm avoids having to expand an exponential number of candidate nodes, which could be the case if a traditional breadth-first search is used.
  • Layer 902, for example, is the next layer following starting layer 901 of the iterative beam search. Each of the nodes at layer 902 are generated by adjusting the alignment provided at the starting node in layer 901. The best n in layer 902 are selected according to a heuristic scoring function as shown by nodes highlighted by the rectangles in FIG. 9A. Candidates at layer 903 are created by using the selected best n nodes at layer 902 as a starting point and adjusting the alignments provided at those nodes. Nodes at layer 902 that were not selected for the set of best n are not expanded and not used as the starting point for adjustments. Therefore, iterative beam search is not guaranteed to find the optimal solution because it prunes parts of the tree during the search. However, the iterative beam search performs well in practice and is computationally efficient.
  • At layer 903, the candidates are again scored and the n best scoring are again expanded for the next level. The process may continue until a stopping condition is reached. In an embodiment, the process stops when the number of matching phonemes between the audio phonemes and text phonemes does not change at the next iteration.
  • A novel feature of the iterative beam search is the use of the parent alignment from the prior iteration as a hint to the nodes at the next level. The hint increases the score of candidates that are closer to the alignment of the prior mapping and decreases the score of candidates that are farther from the alignment of the prior mapping. In an embodiment, the hint is implemented by increasing the value of the scoring function when a candidate alignment changes little from its parent alignment but decreasing the value of the scoring function when a candidate alignment changes a lot from its parent alignment.
  • In an embodiment, the scoring function for evaluating candidate alignments produces a score based on the number of matching phonemes, that is, the number of audio phonemes and text phonemes that are mapped to each other and are the same phoneme; the number of missed phonemes, meaning the number of audio phonemes or text phonemes that are not mapped to any phoneme in the other set; and the distance from the hint, where the hint is the alignment at the parent iteration of the beam search. In an embodiment, the distance from the hint is evaluated by iterating over the audio phonemes or text phonemes and producing a score for each of the phonemes. The score is higher when the audio phoneme or text phoneme has stayed in the same position or changed position only a little and lower when the audio phoneme or text phoneme has moved to a significantly farther position, where the distance may be measured by, for example, time or number of phoneme positions moved. The per-phoneme scores are then combined, such as by summation, to produce a score for the distance from the hint. The hint may act as a weight keeping the children alignments closer to the parent alignment.
  • As illustrated in FIG. 9B, in an embodiment, the distance score for phonemes may be implemented with a radial basis function (RBF). In an embodiment, the RBF accepts as input the distance between the phoneme at its parent location and its current location in the new candidate alignment. When the distance is zero, the RBF is at its peak value. The RBF is symmetric around the origin, and the value may drop steeply for input values farther from the origin. In an embodiment, the parameters of the RBF may be adjusted between iterations of the iterative beam search make the curve steeper at later iterations of the beam search. As a result, the penalty in the scoring function for the phoneme's current location not matching its location in the parent alignment increases in later iterations. The effect is to allow the iterative beam search to make relatively large adjustments to the alignment in initial iterations but to reduce the amount of change in the alignments in later iterations. FIG. 9B illustrates two RBFs, a broader RBF on the left that may be used in earlier iterations of the iterative beam search and a steeper RBF on the right that may be used in later iterations of the iterative beam search. The illustrated RBFs are exemplary and other RBFs and non-RBF functions may be used for scoring distance between a phoneme's prior alignment and the current alignment.
  • FIG. 9C illustrates an embodiment of the text alignment algorithm using iterative beam search using a well-known tongue twister. In the initial iteration, a mapping between audio phonemes and text phonemes is created. The initial mapping is close but not exactly correct. In the subsequent iteration, the alignments of the phonemes are adjusted from the initial mapping and the new candidate alignments are rescored. A candidate alignment 1A is created, which matches the phonemes for “the” and “sixth” but misses several other phonemes and has unmatched phonemes for “sixth,” “sheep's”, and “sick.” Moreover, the candidate alignment 1A moves the phonemes two words to the right from the parent alignment, which is lower scoring than if the phonemes were moved a smaller distance. In an embodiment, candidate alignment 1B has a higher score, according to the heuristic scoring function, candidate alignment 1A. It matches a higher number of phonemes and has no missing phonemes. Moreover, the phonemes were moved a smaller distance from the location of the phonemes in the parent alignment (only moved one word to the left). The example shown in FIG. 9C is illustrative only and other embodiments may operate in a different manner and use different scoring functions.
  • Iterative beam search may be used in a variety of machine learning applications and is not limited to use with neural networks or for the application of speech recognition.
  • Turning to the method of training the neural networks, in some embodiments, all layers and stacks of an end-to-end speech recognition system 200, end-to-end speech classification system 700, or end-to-end phoneme recognition system 800 are jointly trained as a single neural network. For example, end-to-end speech recognition system 200, end-to-end speech classification system 700, or end-to-end phoneme recognition system 800 may be trained as a whole, based on training data that contains audio and an associated ground-truth output, such as a transcription. In some embodiments, training may use stochastic gradient descent with initial weights randomly initialized. In an embodiment, training may use back propagation to adjust the weights of the neural network nodes in the neural network layers by using the partial derivative of a loss function. In one embodiment, the loss function may be represented by
  • J ( θ ) = - 1 m [ i = 1 m k = 1 K y k ( i ) log ( h θ ( x ( i ) ) ) k + ( 1 - y k ( i ) ) log ( 1 - ( h θ ( x ( i ) ) ) ) k ] .
  • The value of the loss function depends on the training examples used and the difference between the output of the system 200, system 700, or system 800 and the known ground-truth value for each training example. An optional regularization expression may be added to the loss function in which case the value of the loss function may also depend on the magnitude of the weights of the neural network. Backpropagation may be used to compute the partial derivative of the loss function with respect to each weight of each node of each layer of the neural network, starting from the final layer and iteratively processing the layers from back to front. Each of the weights may then be updated according to the computed partial derivative by using, for example, gradient descent. For example, a percentage of the weight's partial derivative, or gradient, may be subtracted from the weight, where the percentage is determined by a configurable learning rate.
  • In an embodiment, training is performed on a batch of utterances at a time. In some embodiments, the utterances in a training batch must be of the same length. Having samples of the same length may simplify tensor operations performed in the forward propagation and backward propagation stages, which may be implemented in part through matrix multiplications with matrices of fixed dimension. For the matrix operations to be performed, it may be necessary that each of the training samples have the same length. The batch of training samples may be created by splitting an audio file into utterances, such as 7-10 second long portions which may correspond to a word, phrase, or series of words and/or phrases. In an audio file, naturally some utterances may be longer or shorter than others. In an embodiment where training samples must be the same length, techniques may be used to adjust the length of some of the samples.
  • In the past, the length of training samples has been adjusted by padding shorter samples with zeros or other special characters indicating no data. While this allows creating training samples of the same size, the zeros or special characters may lead to artifacts in the model and cause slower training.
  • FIG. 10 illustrates an example of looping each of the shorter training samples in a training batch so that the shorter training samples are repeated until they are the same length as the longest training sample. A set of training samples is created by splitting an audio file. The training samples are processed by front-end module 201 to create a sequence of frames comprising each training sample, where the frames may be of any of the types described above such as log-mel filterbanks, MFCC, perceptual linear prediction coefficients, or spectrograms. Each of the training samples may be stored as a row of tensor 1000 to create a training batch. The length of the tensor 1000 in the time dimension is determined by the length of the longest sample 1002 in terms of the number of frames. Longest sample 1002 is not changed. Each of the shorter samples 1001, 1003, 1004, 1005, 1006 in the batch are repeated until they are the same length as the longest sample 1002, so that every row of the tensor has the same length. The shorter samples are repeated exactly in all of their elements starting from the first element through the last element of the sample. When the length of a sample does not divide evenly into the length of the tensor, the last repetition of the sample may only be a partial repetition until the desired length is reached. The partial repetition is a repetition of the shorter sample starting from the first element and iteratively repeating through subsequent elements of the sample until the desired length is reached. In an embodiment, shorter sample 1001 is repeated k times where
  • k = f loor ( N M )
  • where N is me length of the longest sample and M is the length of shorter sample 1001, and the last repetition of shorter sample 1001 is of length Z=N mod M. Although only two dimensions of the tensor 1000 are illustrated, the tensor 1000 may have many more dimensions. For example, each row may be a multi-dimensional tensor, such as when the frames in the rows are multi-dimensional tensors.
  • In an embodiment, the training samples of a training batch are stored as rows in a single tensor. In other embodiments, the training samples are not stored in a single tensor. For example, the training samples may be stored as a list or set and input into the neural network one by one. In an embodiment, the CNN layer (such as CNN layer 202, CNN layer 702, or CNN layer 802) is of a fixed size. In an embodiment, the CNN layer accepts input tensor representations up to a fixed length, and the longest sample in a training batch is selected to be less than the fixed length of the CNN layer.
  • In an embodiment, during training, a ground-truth output value may be provided in tensor 1000 attached to each of the frames of the training samples in tensor 1000. In this embodiment, the ground-truth output values may also be repeated for the shorter samples, when the frames of the shorter samples are repeated in tensor 1000. In an embodiment, a second tensor, separate from tensor 1000, is provided with the ground-truth output values, instead of storing the ground-truth values in tensor 1000. The ground-truth output values in the second tensor may be repeated for shorter samples just as with tensor 1000. However, in other embodiments, the ground-truth output values in the second tensor are not repeated, even though the corresponding training samples in tensor 1000 are repeated.
  • Padding the shorter training samples by repetition has several advantages over padding with zeros or special characters indicating no data. When zeros or other meaningless data is used, no information is encoded and computation time is wasted in processing that data leading to slower learning or model convergence. By repeating the input sequence, the neural network can learn from all elements of the input, and there is no meaningless or throw-away padding present. The result is faster convergence and learning, better computational utilization, and better behaved and regularized models.
  • Although looping of shorter samples in a batch was described above with reference to training, the repetition of shorter samples to be the same length as a longest sequence may also be performed during inference. In some embodiments, inference is performed on a tensor similar to tensor 1000 with multiple samples obtained by splitting an audio file. Each sample may be stored in a row of the tensor. The same process described above for training may be applied during inference. A longest sample may be unchanged, and each of the shorter samples may be repeated until they are the same length as the longest sample so that every row of the tensor is the same length. The tensor, with the repetitions of shorter samples, may be input to the neural network for inference.
  • The technique of looping shorter training samples in a training batch may be used in a variety of machine learning applications and is not limited to use for the application of speech recognition.
  • FIGS. 11A-B illustrate an example attention mechanism for a neural network, called “Neural Network Memory,” that may be used in end-to-end speech recognition system 200, end-to-end speech classification system 700, end-to-end phoneme recognition system 800, or other neural networks. One problem with neural networks and other machine learning techniques is that the size of the machine learning model constrains the amount of knowledge that can be learned. It is one version of the mathematical pigeon hole principle, which states that if n items are put into m containers, with n>m, then one container must contain more than one item. In the same way, a machine learning model that is trying learn a complex decision boundary on a large amount of data cannot, in general, learn the complex decision boundary exactly if the machine learning model is significantly smaller in size than the amount of data being trained on. As the complexity of the decision boundary exceeds the size of what can be easily expressed in the size of the model, various components of the neural network, such as weights and hidden nodes, become overloaded and must try to learn more than one function, causing the learning rate of the neural network to slow down significantly over time as more training examples are seen. In some cases, the quality of the machine learning model that is learned by the neural network may plateau or even become worse.
  • Neural Network Memory addresses this problem by creating an expert knowledge store, which is a data store in memory, that stores expert neural network layer portions that may be inserted into the neural network at the right time. In an embodiment, the expert knowledge store is a database. The expert neural network layer portions may be a portion of a neural network layer or an entire neural network layer. The expert neural network layer portions may learn specialized functions that apply in specific conditions and be swapped in and out of the neural network automatically when those conditions are detected.
  • Example neural network 1100 is a fully-connected neural network with multiple layers of hidden states. Neural network layer portion 1110 is a selector and neural network layer portion 1120 is a gap with no hidden nodes and that is filled by swapping expert neural network layer portions in and out. After an audio file is input to the neural network system, whether for training or inference, forward propagation occurs as normal. When the gap 1120 is reached, forward propagation cannot continue until an expert layer is inserted. In order to select the expert layer, forward propagation occurs through selector neural network layer portion 1110 as normal. The activation outputs of the nodes of the selector neural network layer portion 1110 are used as a query to find the expert neural network layer to insert into gap 1120. Expert knowledge store 1130 stores selectors 1115 that each serve as an index for one expert neural network layer portion 1125 that corresponds to the selector. Each expert neural network layer may comprise the weights for the inbound edges to the nodes of the expert neural network layer and the activation function of the nodes.
  • In an embodiment, the activation outputs of the nodes of the selector neural network layer portion 1110 are stored in a tensor. The activation outputs are output from the activation function of each node. Each element of the tensor may correspond to one node output. In selector neural network layer portion 1110 there are three nodes, which means that there are three output values stored in the tensor. The tensor of activation outputs is compared with all of the selectors 1115 in the expert knowledge store 1130. In an embodiment, the comparison is performed by using a distance metric. In an embodiment, the distance metric is the cosine similarity between the tensor of activation outputs and a selector 1115. In an embodiment, the distance metric is the dot product between the tensor of activation outputs and a selector 1115. The closest selector 1115 according to the distance metric is chosen as the correct row of the expert knowledge store. The expert neural network layer associated with the closest selector 1115 is then inserted into the neural network 1100 in the gap 1120. After insertion of the expert neural network layer into the gap 1120, forward propagation continues through the neural network 1100 just as if the expert neural network layer were a permanent layer of the neural network 1100. If the neural network 1100 is performing inference, then after neural network 1100 produces its output, the expert neural network layer may be deleted from portion 1120 so that portion 1120 is once again empty and ready to be filled in at the next iteration. If the neural network 1100 is performing training, then training of the expert neural network layer and the selector may be performed. In an embodiment, after forward propagation is completed, the output of the neural network may be compared with the ground-truth output associated with the input. Backpropagation is performed based on the difference between those two values, the ground-truth output and the actual output of the neural network. The backpropagation is performed through the expert neural network layer inserted into gap 1120 just as if the expert neural network layer was a permanent part of neural network 1100 and adjusts the weights of each of the nodes of the expert neural network layer through training. After backpropagation, the updated expert neural network layer is stored back in the expert knowledge store, overwriting the prior version. The backpropagation trains the expert neural network layer to become more accurate, for those conditions where it is inserted in the network, and allows it to become specialized for particular use cases. In addition, the selector associated with the expert neural network layer is trained to become more similar to the tensor of activation outputs from selector neural network layer portion 1110. This process allows the selectors to become specialized to the correct conditions. In an embodiment, the selector is adjusted pointwise to become more similar to the values of the tensor of activation outputs from selector neural network layer portion 1110, such as by reducing the distance between the selector and tensor in vector space. A selector learning rate may be set to control the rate at which selectors are adjusted and may be a scalar value. In an embodiment, the values of the selector are changed by a percentage of the distance between the selector and the tensor of activation outputs multiplied by the selector learning rate. In an embodiment, the values of the selector are changed by a fixed value in the direction of the tensor of activation outputs multiplied by the selector learning rate.
  • In neural network 1100, the selector neural network layer portion 1110 and gap 1120 for inserting the expert neural network layer are two halves of the same neural network layer. In other embodiments, the relative location of these portions may be different. They can be of different sizes and do not need to be exactly half of a neural network layer. Moreover, the selector neural network layer portion 1110 and the gap 1120 are not required to be in the same layer.
  • In an embodiment, Neural Network Memory may be used in neural network 1150 where the selector neural network layer 1160 is a full neural network layer and a gap 1170 for insertion for an expert neural network layer is a full neural network layer. The process described with respect to neural network 1100 is the same, except that the expert knowledge store 1180 stores selectors corresponding to activation outputs for an entire layer and the expert neural network layer portions are entire neural network layers. In neural network 1150, the selector neural network layer 1160 directly precedes the portion 1170 for inserting the expert neural network layer. In other embodiments, the selector neural network layer 1160 and the gap 1170 for inserting the expert neural network layer may be in different relative locations.
  • In one embodiment, Neural Network Memory is used in the first fully-connected layer 203, 703, 803. In an embodiment, Neural Network Memory is used in the second fully-connected layer 205, 705, 805. Although Neural Network Memory has been illustrated in fully-connected neural networks 1100, 1150 it may be used in any other form of neural network, such as CNN layers 202, 702, 802 or RNN layers 204, 704, 804. Moreover, multiple selector neural network layers and gaps for inserting expert neural network layers may exist in the same neural network.
  • In an embodiment, the size of expert knowledge store 1130, 1180 increases over time as more training examples are seen by the neural network. As more training is performed, more expert neural network layers are expected to be needed to address the pigeon hole principle. In an embodiment, a counter stores the number of training examples that have been run through the neural network. The counter is incremented with each new training example. A threshold, which may be a threshold value or threshold function, defines the points at which the size of the expert knowledge store increases in size. When the counter of training examples exceeds the threshold, one or more new rows are added to the expert knowledge store. Each row includes a selector and an associated expert neural network layer. New selectors and expert neural network layers may be initialized to random values, may be initialized as an average of the rows above it, or may be initialized with values from existing neural network layer portions of the neural network. In an embodiment, the growth rate at which new rows are added to the expert knowledge store 1130, 1180 decreases over time. The growth rate is, for example, the rate at which new expert neural network layers are added to the store. As more training examples are seen, the rate at which new information is learned is expected to decrease because more and more of the variations in the training data will have already been seen. In an embodiment, the growth rate at which rows are added to the expert knowledge store 1130, 1180 is inversely proportional to the total number of training examples ever processed by the neural network.
  • Neural Network Memory may be used in a variety of machine learning applications and is not limited to use for the application of speech recognition.
  • FIG. 12 illustrates an example of a general domain 1210 and a custom domain 1220. Neural networks, such as end-to-end speech recognition system 200, end-to-end speech classification system 700, and end-to-end phoneme recognition system 800, may be trained on a general dataset, which trains them to perform in a general domain 1210 for multiple possible applications or situations. In an embodiment, the general domain 1210 is the domain learned by learning across a set of training examples that come from a plurality of different datasets. The different datasets may be aggregated into a general training set. Advantages of training a neural network for a general domain 1210 include the ability to use more training data and also building a model that may work well in multiple situations. However, it may also be desirable to train a neural network, such as end-to-end speech recognition system 200, end-to-end speech classification system 700, and end-to-end phoneme recognition system 800, specifically for a custom domain 1220. A custom domain 1220 may differ from the general domain 1210 in numerous aspects, such as frequencies of words, classifications, and phonemes, audio features (such as background noise, accents, and so on), pronunciations, new words that are present in the custom domain 1220 but unseen in the general domain 1210, and other aspects. The statistical distribution of audio examples in general domain 1210 may differ from the distribution in custom domain 1220. It may be desirable to customize the neural network for the custom domain 1220, which can potentially improve performance significantly in the custom domain 1220. In some embodiments, the custom domain 1220 may include a set of training examples from the custom domain 1220. However, in some embodiments, a training set may not be available for custom domain 1220 and only some information about the distribution in custom domain 1220 may be known, such as a list of frequent words and their frequencies. The neural network trained on the general training set may be referred to as the general model and the neural network customized for the custom domain may be referred to as the custom model.
  • An example of a custom domain 1220 for speech recognition is performing speech recognition on the phone calls of a particular company. Some words in the custom domain 1220 are likely to have a higher frequency in the domain of phone calls for the company than for general speech recordings. It is likely that the name of the company and names of employees will occur with higher frequency in the custom domain 1220 than in general. Moreover, some words in the custom domain may not exist in a general training set, such as the names of the companies' products or brands.
  • In the past, customization for custom domain 1220 has been performed by first training a neural network with a general training set to build a general model and then training the neural network on a set of training examples from the custom domain 1220 to customize it. Significant downsides of this approach are that there may not be sufficient data from the custom domain 1220 to customize the neural network by training and that the process of re-training may be slow. Techniques herein address this problem and allows more effective customization of a neural network for a custom domain 1220 more quickly and even when only limited custom training data is available.
  • FIG. 13 illustrates an example supervised learning approach for predicting the weights of neural network nodes to improve performance in a custom domain. The predicted weights may be used to replace weights in a neural network that has been trained on a general training set in order to customize the neural network for a custom domain. A machine learning model, separate from the neural network, is trained to predict weights of nodes in the neural network based on phonemes and the frequency of a word. The approach may be used for words that are unseen in the general domain or for words that are seen in the general domain but are more frequent in the custom domain.
  • In an embodiment, a neural network layer is selected for which new weights will be predicted. In one embodiment, the output layer, such as output layers 206, 706, 806, is selected. The predicted weights will be the weights of the node, which are the weights applied to input values to the node prior to application of the activation function. A weights predictor 1320, which is a machine learning model, is provided. The weights predictor 1320 is trained to predict neural network node weights for a particular word in the vocabulary based on the phonetic representation of the word and its frequency in the general domain. In an embodiment, the weights predictor 1320 is trained by iterating over all of the words of the vocabulary and inputting tensor 1310 comprising the concatenation of a one-hot encoding 1302 of the phonetic representation of the word and the frequency 1304 of the word in the general training set, which may be normalized such as by log normalization, into predictor 1320. The one-hot encoding has zeroes in all positions except for one location having a one representing the phonetic representation of the word. The resulting sparse input vector has two non-zero values, the one-hot encoded location representing the phonetic representation and a value representing the frequency of the word in the general domain. Based on the input vector 1310, the weights predictor 1320 generates output vector 1330 representing the weights for this word in the selected neural network layer. In one embodiment, the predicted weights are the weights for the output node for the word.
  • In one embodiment, the weights predictor 1320 is linear regression. When using linear regression, the predictor 1320 may be trained using least squares fit. The target value for training examples is the neural network node weights in the general model. Generated values of the predictor 1320 may be compared to the true neural network node weights in the general model and the differences reduced using the least squares method. In one embodiment, the weights predictor 1320 is a neural network, which may have one or more layers. The weights predictor 1320 may be trained using backpropagation. Generated values of the predictor 1320 may be compared to the true neural network node weights in the general model and the weights of the predictor 1320 may be adjusted by backpropagation and gradient descent. The weights predictor 1320 may be other regression models such as polynomial regression, logistic regression, nonlinear regression, and so on.
  • In an embodiment, a training set is provided for a custom domain. The training set comprises audio files and corresponding text transcripts. Frequent words in the custom dataset that are unseen or have low frequencies in the general training set are identified. In other embodiments, no training set of custom data is provided, but a list of frequent words and their frequencies is provided for their custom domain. For each of the frequent words that are unseen or have low frequencies in the general model, a set of weights is predicted. A one-hot encoding is created for the phonetic representation of the word and the frequency of the word in the custom domain, optionally with normalization such as log normalization, is concatenated to the one-hot encoding. The resulting vector is input into the weights predictor 1320. The output vector provides the predicted weights. The predicted weights are used to replace the weights of the corresponding layer of the neural network in order to customize the neural network for the custom domain. If a word was unseen in the general training set, then a new node is added to the output layer and the weights of the node are initialized to be the predicted weights. In some embodiments, customized weights are predicted for all words in the vocabulary and not just words that occur with high frequency. Optionally, the neural network may be further trained on training examples that come from the custom domain.
  • In a variation, the input tensor 1310 to weights predictor 1320 also includes bigram information. The bigram information characterizes information about words frequently occurring immediately adjacent to the left or right of the word. In an embodiment, the bigram information may be a vector with one entry per word of the vocabulary and the value at each location represents the probability that the word appears adjacent to the current word. The bigram vector may be concatenated to input tensor 1310. In this variation, the weights predictor 1320 may be trained by computing the bigram information in the general training set for each word of the vocabulary, concatenating that to the input tensors 1310 for each word, and training on all of the words of the vocabulary as described above. During inference, bigram information may collected based on rate of co-occurrence as adjacent words in the custom domain, which may either be provided or be computed from a custom training set. The bigram information is attached to the input tensor 1310 during inference. The predicted output weights are used in the same way as described above.
  • The technique of predicting neural network node weights, as described herein, may be used in a variety of machine learning applications and is not limited to use for the application of speech recognition.
  • FIG. 14 illustrates an example unsupervised learning approach for customizing a neural network for a custom domain by using a customization layer, such as customization layer 207. As described above, some words may occur with higher frequency or lower frequency in a custom domain than in the general domain. Customization layer 207 may change the probability that words are produced according to these frequencies. For example, the concept of prior probability, also called a prior, refers to the probability of an occurrence before any observations are made. Statistically, the prior probability should be taken into account in the probabilities of words generated by the neural network.
  • In an embodiment, frequent words in the custom dataset that are unseen or have low frequencies in the general training set are identified. In other embodiments, no training set of custom data is provided, but a list of frequent words and their frequencies is provided for their custom domain. For each of the frequent words that are unseen or have low frequencies in the general model, customization is performed as described below. In other embodiments, customization is performed for all words in the vocabulary regardless of whether they are frequently occurring or not.
  • In example neural network 1400 an output layer 1410 is provided that outputs the probability that the input corresponds to the associated word represented by the output node. In step 1420, corresponding to customization layer 207, the probabilities are adjusted by dividing by the frequency of the word in the general training set and multiplying by the frequency of the word in the custom training set. The resulting values are used as the new word probabilities, and the word with the highest probability after customization is selected as the output of the neural network. The effect of the customization is, roughly, to remove the prior for the word from the general domain and replace it with the prior for the word from the custom domain.
  • In an embodiment, the frequency of words in the general training set may be tracked and stored as general training is performed. Words that were unseen in the general training set may be given a small non-zero frequency value to allow the division step to be performed. In some embodiments, the frequency of the words in the custom domain may be provided. In other embodiments, the frequency of words in the custom dataset may be generated by running a custom training set through the general model to obtain a transcription of the custom training set. The frequency of the word may then be determined by parsing the transcription.
  • In a variation, customization is performed on a per-bigram basis instead of a per-word basis. Bigrams may be formed by combining the current word with the preceding word or succeeding word. The frequency of word bigrams in the general training set is tracked, and the frequency of word bigrams in the custom training set is also determined, using the methods described above. Word probabilities are computed as normal in output layer 1410. In a customization step, the correct bigram is determined based on the combination of the current word with the preceding word or succeeding word as appropriate. The word probability is then divided by the bigram frequency in the general training set and multiplied by the bigram frequency in the custom training set.
  • The technique of customizing a neural network by using a customization layer, as described herein, may be used in a variety of machine learning applications and is not limited to use with neural networks or for the application of speech recognition.
  • FIG. 15 illustrates an example of dynamically training on a general training set to customize a neural network, such as such as end-to-end speech recognition system 200, end-to-end speech classification system 700, and end-to-end phoneme recognition system 800, for a custom domain. General training set 1510 with audio examples from general domain 1210 and custom training set 1520 with audio examples from custom domain 1220 may be provided. The general training set 1510 may have significantly more data and training samples than custom training set 1520. In an embodiment, the general training set 1510 has tens of thousands, hundreds of thousands, or millions of hours of audio data and the custom training set 1520 has a few hours of audio data or less. Re-training a general model, trained on the general training set 1510, with the custom training set 1520 may not be effective because there may not be enough custom training data to customize the model.
  • In an embodiment, the general training set 1510 is a collection of training subsets 1511-1515 collected from various sources. Although five training subsets 1511-1515 are illustrated, many more may be used in practice. The training subsets 1511-1515 may have different characteristics, such as source (e.g., public dataset, proprietary inhouse data, data acquired from third parties), types of speakers (e.g., mix of male and female, mix of accents), topics (e.g., news, sports, daily conversation), audio quality (e.g., phone conversations, in-person recordings, speaker phones), and so on. Some training subsets 1511-1515 may be more similar to the examples in custom training set 1520 and others less similar. Each training subset 1511-1515 may have a handle that identifies it.
  • In a first approach, the entire general training set 1510 is used for training the neural network. However, this approach does not customize the neural network for the custom domain 1220 represented by the custom training set 1520. Instead, in an embodiment, some of the custom training data may be set aside as a custom evaluation subset 1522. Only some of the general training subsets 1511-1515 are used for training and the quality of the results are tested against the custom evaluation subset 1522. The set of general training subsets 1511-1515 used for training may be adjusted to improve performance on the custom evaluation subset 1522. In a second approach, a neural network is trained on general training set 1510 to create a general model and different mixes of general training subsets 1511-1515 are used for further training to customize the neural network. An AB testing approach may be taken with different combinations of general training subsets 1511-1515 tried according to a selection algorithm, which may use randomization, and the quality of the results measured against the custom evaluation subset 1522. The combination of general training subsets 1511-1515 that provides the lowest word error rate (number of words misidentified) on the custom evaluation subset 1522 may be selected as the best combination to use for customization. That combination may be used for additional training of the neural network to customize it for the custom domain 1220. In a third approach, a fully dynamic method is used where the mix of general training subsets 1511-1515 to train on is never finalized because the mix can continue to change over time. The combination of general training subsets is fully dynamic and is chosen in a way balance exploration and exploitation on an ongoing basis. This third approach is described in more detail below.
  • In an embodiment, a reinforcement learning algorithm is used to dynamically select general training subsets to train on for customization of a neural network. The neural network is initially trained on the general training set 1510 to create a general model. The custom training set 1520 may be divided into three pieces, a custom evaluation subset 1522, a custom validation subset 1524, and a custom training subset 1526. Although the subsets are illustrated as roughly equal in size, they may have varying relative sizes. The reinforcement learning system takes actions, which in this case are selections of a general training subset to train on for a number of training batches, and receives rewards for those actions, which are the word error rate on the custom evaluation subset 1522. A decreased word error rate is a positive reward, and an increased or unchanged word error rate may be a negative reward. The reinforcement learning system may learn a policy for choosing general training subsets to train on in order to improve the word error rate on the custom evaluation subset 1522 and thereby customize the neural network for the custom domain 1220.
  • In an embodiment, the reinforcement learning system has an agent, actions, environment, state, state transition function, reward function, and policy. In an embodiment, the agent is the customization system that chooses the next general training subset to train on. In an embodiment, the actions are the choice of which general training subset to train on for the next iteration. In an embodiment, the environment is an environment that is affected by the agent's actions and comprises the state, state transition function, and reward function. In an embodiment, the state is the current neural network state, whose weights are determined by the prior training iterations. The state may also include tracked information about the distribution of past rewards for each action (e.g., choice of general action subset) including the expected rewards for each action and tracked information about uncertainty associated with each action, such as how many times each action has been taken. In an embodiment, the state transition function is the function that defines the transition to a new state based on the selected action. The state transition function may be implicitly defined by the act of training the neural network with the selected general training subset to obtain new weights for the neural network. In an embodiment, the reward function is a function determining reward values based on the change in word error rate in the custom evaluation subset 1522 after training with the selected general training subset. In some embodiments, the reward function outputs the percent change in word error rate as the reward. In other embodiments, the reward output by the reward function is a transformed value based on the percent change in word error rate. In an embodiment, the policy is a function for selecting the action to take, what general training subset to choose in the next iteration, based on the current state.
  • In an embodiment, the reinforcement learning system trains the custom model iteratively. At each iteration, it a selects general training subset 1511-1515 to train on. The neural network is trained on the selected general training subset for a number of training batches, where the number of training batches may be configurable. After training, the neural network is tested on the custom evaluation subset 1522. The word error rate in the custom evaluation set 1522 is measured and stored. The reinforcement learning system may update its policy based on the word error rate. The reinforcement learning system then selects the general training subset to train on at the next iteration based on, for example, the distribution of past rewards for each general training subset, expected rewards for each general training subset, uncertainty values associated with each general training subset, and/or the number of times each general training subset has already been trained on. In an embodiment, this process continues indefinitely to iteratively improve the neural network's performance in the custom domain 1220. The training policy of the reinforcement learning system may be continuously adjusted based on rewards and need not ever reach a “final” policy.
  • A multi-armed bandit algorithm, referred to as a bandit algorithm, is one example of a reinforcement learning system. The multi-armed bandit algorithm provides a policy of which actions to take, where the actions provide differing rewards and the distribution of rewards for each action is not known. The multi-armed bandit problem, addressed by the bandit algorithm, is deciding which action to take at each iteration to balance exploration, that is learning which actions are the best, with exploitation, that is taking advantage of the best action to maximize the total rewards over time. The multi-armed bandit problem takes its name from a hypothetical problem of choosing which of a set of slot machines to play, where the slot machines pay out at different, unknown rates. In an embodiment, a bandit algorithm may be used where the actions for the bandit algorithm are the choice of which general training subset to train on and the rewards for the bandit training algorithm are the change in word error rate on the custom evaluation set 1522 or a function based on that value. The bandit algorithm iteratively chooses general training subsets to train on according to a policy that balances exploration and exploitation. The bandit algorithm may run indefinitely and continuously and dynamically update its policy on an ongoing basis, never stopping at a “final” policy.
  • In an embodiment, a bandit algorithm is used to iteratively select general training subsets to train on to customize a neural network for a custom domain 1220. In one embodiment, the bandit algorithm has a scoring function, and the bandit algorithm's policy is to select the general training subset that has the highest score according to the scoring function. The value of the scoring function may be based on the distribution of past rewards for each general training subset, expected rewards for each general training subset, uncertainty values associated with each general training subset, and/or the number of times each general training subset has already been trained on. In one embodiment, the value of the scoring function increases with the mean reward observed for the general training subset and decreases with the number of times the general training subset has been chosen. In an embodiment, an uncertainty value is stored for each general training subset and increases over time when the subset is not chosen. The value of the scoring function may increase with increases in the uncertainty value of the general training subset. Use of uncertainty values models the uncertainty produced by the non-stationary rewards of this bandit problem. The distribution of rewards from the general training subsets is not fixed over time because the neural network weights are changing as it is trained and so the effect of each general training subset on the neural network will also change. A bandit problem with non-stationary rewards may be referred to as a non-stationary bandit problem and a bandit algorithm configured for addressing a non-stationary bandit problem may be referred to as a non-stationary bandit algorithm.
  • In an embodiment, at each iteration, the bandit algorithm selects a general training subset to train on by applying the scoring function to each subset and choosing the highest scoring one. The neural network is trained on the selected general training subset for a number of training batches, where the number of training batches may be configurable. After training, the neural network is tested on the custom evaluation subset 1522. The word error rate in the custom evaluation set 1522 is measured and stored. The word error rate corresponds to a reward, with reductions in word error rate corresponding to a positive reward and increases in word error rate corresponding to a negative reward, or penalty. Stored information about the distribution of rewards and mean reward for this general training subset may be updated based on the observed word error rate. A counter of the number of times the general training subset was trained on may be incremented. An uncertainty value associated with the selected general training subset may be decreased, and the uncertainty values associated with all other general training subsets, which were not chosen, may be increased. The next iteration then begins with the bandit algorithm selecting the next general training subset to train on. The process may continue indefinitely to iteratively improve the neural network's performance in the custom domain 1220. No final “best” mix of general training subsets is chosen, rather the bandit algorithm continues to select the general training subsets based on information about the past rewards observed and its measures for uncertainty regarding each subset.
  • The bandit algorithm may be the upper confidence bound (UCB) algorithm, the UCB1 algorithm, the epsilon greedy algorithm, or other bandit algorithms. In one embodiment, the scoring function for the bandit algorithm is given by
  • UCB i , t := µ ^ i , t + ln t n i , t
  • where i is the index or handle of the general training subset, t is the iteration number, and
  • µ ^ i , t = s = 1 : I s = i r s t n i , t
  • the mean reward observed for the general training subset in past iterations. In the aforementioned equation, the IS term is the choice of general training subset at time t. As seen from the equation, the exemplary scoring function has one term that is the expected reward for the general training subset and one term that is inversely related to the number of times that the general training subset has been chosen, and the two terms are combined by addition. In a variation, the UCB1 algorithm uses the related scoring function
  • UCB 1 i , t := µ ^ i , t + 2 log t n i , t
  • In other embodiments, other scoring functions may be used. In an embodiment, the bandit algorithm may initially iterate through the general training subsets and train on each of them once, and then switch to choosing the general training subset through the scoring function.
  • As described above, a reinforcement learning system may be used to select general training subsets to train on to condition a neural network for a custom domain. One reinforcement learning system is implemented with a bandit algorithm. Optionally, a portion of custom training set 1520 may be reserved as a custom training subset 1526 to further condition the neural network. The neural network may be trained on the custom training subset 1526 in the usual manner, by inputting the values, comparing the outputs to ground-truth results, and adjusting the neural network node weights with backpropagation. Moreover, a custom validation subset 1524 may be used for validation to independently test the quality of the custom model after it has been customized using the reinforcement learning system or bandit algorithm and optional custom training subset 1526. Validation may be performed by testing the performance of the neural network on custom validation subset 1524 on word error rate or other measures.
  • The use of reinforcement learning and/or bandit algorithms for selecting general training subsets to train on and customize for a custom domain, as described herein, may be used in a variety of machine learning applications and is not limited to use with neural networks or for the application of speech recognition.
  • FIG. 16 illustrates an example training data augmentation and streaming system 1600 according to an embodiment. In some embodiments, it is valuable to augment existing training data by applying one or more augmentations to the data. Augmentations may also be referred to as “effects.” The augmentations expand the dataset to provide more data with more variety and can increase the robustness of the learned model. In traditional systems, augmentations are difficult to perform because the number of different combinations of potential augmentations can be combinatorially large. The training dataset itself may already be large and additionally storing all of the augmented versions of the dataset may not be feasible due to the large amount of memory it would occupy. To address this problem, the training data augmentation and streaming system 1600 provides training data augmentation as a service through an Application Programming Interface (API). The system 1600 provides a service that generates augmented training data just-in-time when it is requested by a training process.
  • In the system 1600, training data store 1610 stores training data in the form of audio files or other data. In an embodiment, the training data store 1610 comprises one or more Redundant Array of Redundant Disk (RAID) arrays, which provide fault tolerance. Meta-data store 1620 stores meta-data about the training data sets. It may store information about the name and source of the training data sets and associate names to handles and locations in the training data store 1610. Computer servers 1640, 1650 perform the processing necessary to train a machine learning model, such as the neural networks discussed herein. Training processes 1644, 1646, 1648, 1654, 1656, 1658 perform training of a neural network such as by accepting training data, performing forward propagation through a neural network, and performing backpropagation based on the results. The training processes may be training the same single neural network in parallel or may be training different neural networks. Training manager 1643 manages the training processes on server 1640, and training manager 1653 manages the training processes on server 1650. Training data augmentation system 1642 provides training data augmentation service to the training processes 1644, 1646, and 1648. In an embodiment, the training processes 1644, 1646, and 1648 communicate with the training data augmentation system 1642 through an API. In an embodiment, the API is implemented with UNIX sockets. Training data augmentation system 1652 provides training data augmentation service to the training processes 1654, 1656, and 1658. In an embodiment, the connection between servers 1640, 1650 and the training data store 1610 and meta-data store 1620 is implemented over the Network File System (NFS).
  • An embodiment will be described with respect to training data augmentation system 1642, and training data augmentation system 1652 operates in the same manner. Training data augmentation system 1642 waits for a training process 1644 to connect to it using an API call. The training process 1644 connects to the training data augmentation system 1642, and training process 1644 transmits via an API call an indication of the training dataset that it wants to train on and which augmentations it desires to be applied. The indication of the training dataset may be provided in the form of a handle. In an embodiment, the augmentations provided may be reverb, with a selection of kernels; noise from varying noise profiles; background tracks, such as for emulation of background speaking; pitch shifting; tempo shifting; and compression artifacts for any of one or more compression algorithms. The training augmentation system 1642 accesses the meta-data store using the provided handle to identify the location of the requested training data in the training data store 1610. Training augmentation system 1642 then accesses the training data store 1610 at the identified location to download the requested training data through a streaming process. Streaming provides the data in a continuous flow and allows the data to be processed by the training augmentation system 1642 even before an entire file is downloaded. As portions of the training data are downloaded from the training data store 1610, the training augmentation system 1642 buffers it in the memory of the server 1640. Training data augmentation system 1642 monitors the streaming download to determine if sufficient data has been downloaded to begin training. Training data augmentation system 1642 determines when the amount of data downloaded exceeds a threshold to determine when to begin training. Training may begin before the entire training dataset is downloaded, by training using the buffered portions. Once sufficient training data is buffered on the server 1640, the training data augmentation system 1642 applies the requested augmentations to the buffered data. It sends the augmented training data as a stream to the training process 1644. The training data augmentation system 1642 continues to stream additional training data from the training data store 1610. As this data is buffered on server 1640, training data augmentation system 1642 applies the requested augmentations to the data and streams it to the training process 1644. The training data augmentation system 1642 receiving streaming training data from the training data store 1610, applying augmentations to other buffered training data at the training data augmentation system 1642, and transmitting streaming augmented training data to the training process 1644 may occur concurrently and in parallel. After the training process 1644 has completed training on the augmented version of the data, the augmented stream of data is deleted. In an embodiment, portions of the augmented stream of training data are deleted as soon as the training process 1644 completes training on the portion, and even when streaming of the remainder of the augmented training data from the same training dataset to the training process 1644 continues.
  • The buffered, un-augmented training dataset downloaded from the training data store 1610 to server 1640 may be stored temporarily or permanently on server 1640 to provide caching. When training process 1646 requests to train on the same training data, the training data augmentation system 1642 may check the cache to see if the training dataset is already buffered in local memory of the server 1640. If the training dataset is already present, the training data augmentation system may use the cached version of the training dataset, instead of fetching the training dataset from the training data store 1610. If the training dataset is not in the cache, then the training data augmentation system 1642 may initiate a fetch of the training dataset from the training data store 1610.
  • In an embodiment, the training datasets are stored as audio files. Training data augmentation system 1642 may optionally perform preprocessing on the training data before applying augmentations. In an embodiment, training data augmentation system 1642 performs the functionality of front- end module 201, 701, or 801. In one embodiment, the training data augmentation system 1642 decompresses the audio files and performs feature extraction to generate features. The training data augmentation system 1642 may provide the feature data and the corresponding text transcripts for the training audio files to the training processes. In one embodiment, the training processes may access the training data augmentation system 1642 through the training manager 1643.
  • Training data augmentation and streaming system 1600 may be used in a variety of machine learning applications and is not limited to use with neural networks or for the application of speech recognition.
  • FIG. 17 illustrates example process 1700 for massively parallelizing the inference processing using neural networks, such as end-to-end speech recognition system 200, end-to-end speech classification system 700, or end-to-end phoneme recognition system 800. Traditional ASR systems do not parallelize well, which may lead to performance difficulties in production systems with many requests. For example, the Hidden Markov Models and Gaussian Mixture Models coupled to language models, as used in traditional ASR, are typically not easy to parallelize. On the other hand, neural networks are well-suited to parallelization, leading to significant advantages for end-to-end neural network systems.
  • In example process 1700, a client process submits an audio file 1710 for transcription. This inference task may be transmitted from the client process over a network to a server hosting the end-to-end speech recognition system 200. A server process identifies locations, which may be identified by timestamps, where the audio file can be split. In an embodiment, the server process identifies splitting locations by identifying low-energy points in the audio file, such as locations of relative silence. In an embodiment, the low-energy points are determined by applying a convolutional filter. In another embodiment, a neural network is trained to learn a convolutional filter that identifies desirable locations in the audio file to split at. The neural network may be trained by providing training examples of audio files and ground-truth timestamps where the audio files were split. The neural network may learn a convolutional filter for determining splitting locations through backpropagation. In an embodiment, the split portions of the audio file may be approximately 7-10 seconds in length.
  • The audio file 1710 is split into portions 1711, 1712, 1713. The portions may be referred to as chunks. Although three chunks are illustrated, the audio file 1710 may be split into more or fewer chunks. The server process applies an index to each chunk to preserve an indication of their order so that the chunks may be reassembled after inference. In an embodiment, the index stored is a timestamp of the temporal location of the chunk in the audio file, such as a starting timestamp, ending timestamp, or both.
  • The chunks 1711, 1712, 1713 are routed to a scheduler 1720, which assigns each chunk to a GPU for performing the inference to determine the transcription. The scheduler 1720 may dynamically assign chunks to GPUs based on characteristics of the GPUs and the chunks. The scheduler 1720 may assign chunks based on how busy GPUs are, the size of the GPU's queue of waiting tasks, the processing power of the GPUs, the size of the chunks, and other characteristics.
  • GPUs perform inference processes 1732, 1742, 1752 for end-to-end speech recognition, end-to-end speech classification, end-to-end phoneme recognition, or other inference tasks. Each GPU maintains a queue, 1731, 1741, 1751 of waiting jobs. A scheduling protocol determines when each GPU begins processing the chunks in its queue. In embodiment, there is a separate scheduler per GPU to assign the GPU to start processing the tasks in its queue. In another embodiment, the central scheduler 1720 performs this task for all of the GPUs. The GPUs perform their inference tasks in parallel to each other, thereby allowing massive speedups by converting a single inference task into a set of parallel inference tasks.
  • In an embodiment, the scheduling protocol for determining when the GPU begins processing a training batch in its queue is dynamic. The GPU begins processing a batch when the batch in the queue reaches a target batch size. The GPU compares the target batch size with the number of tasks in its queue, or their aggregate size in memory, to determine when to begin processing. In an embodiment, the target batch size starts at the maximum size that fits in the GPU memory. The scheduling protocol also maintains a time out, and the GPU begins processing the batch in its queue if the time out is reached, even if the target batch size is not met. After the GPU finishes processing the batch, if there are tasks left in the queue, then the scheduling protocol sets the target batch size to the number of tasks in the queue. However, if no tasks are left in the queue, then the scheduling protocol sets the target batch size to the maximum size that fits in the GPU memory.
  • The inference processes 1732, 1742, 1752 may produce inference results, such as transcriptions of the audio chunks 1711, 1712, 1713. The inference results and chunks may be provided to recombination process 1760. The transcribed text is stitched back together, such as by concatenation, into a single output based on their indices, which may be timestamps. The recombination process 1760 orders the transcribed text in the correct temporal arrangement based on the value of the indices of their corresponding audio chunks in order to produce final output 1762, which is a transcription of the entire audio input 1710.
  • The technique of chunking an input file and dynamically scheduling the chunks for processing by GPUs may be used in a variety of machine learning applications and is not limited to use with neural networks or for the application of speech recognition.
  • A trained neural network such as disclosed above may be used for purposes in addition to speech recognition. For example, the internal state of a trained neural network may be used for characterizing speech audio or deriving an internal state representation of the speech audio.
  • In an embodiment, an internal state representation is determined based on the internal state of a trained speech recognition neural network while transcribing a speech audio sample. The internal state representation is a concise representation of the internal state of the trained neural network while processing the audio input. The total internal state of a trained neural network may be very large—on the order of hundreds of megabytes of data to describe the entire internal state. In some embodiments, the internal state representation obtained by sampling or compressing the total internal state may be significantly smaller, on the order of hundreds of bytes of data. In an example, an internal state representation may be 256 bytes derived from an internal state of approximately 300 MB.
  • The internal state representation may be recorded at the time of initial transcription by a trained neural network and stored alongside the original audio. The internal state representations may be associated with the particular frames or timestamps of the original audio that produced them. Then, at a later time, various discrimination tasks or search tasks may be performed on the original audio by way of the stored internal state representations without needing to run the original audio through a full end-to-end transcription or classification neural network model a second time. That is, many applications in audio classification or search may be performed on the stored audio without processing the original audio with a potentially computationally-intensive speech recognition or classification neural network a second time. The work performed by the initial speech recognition may be leveraged by any future processing of the audio that would otherwise potentially require a computationally intensive process.
  • The types of tasks that may use this stored internal state representation include classification and search tasks. For example, a classification task may be to determine when speakers in an audio segment change, sometimes referred to as speaker diarization. Another example of speech classification may be, for example, to determine a mood, sentiment, accent, or any other quality or feature of the speech audio. A search task may be, for example, to search a corpus of speech audio based on an input segment of speech audio or an input text string. One search task may be, for example, to find segments of audio in the corpus that discuss similar topics as the input speech segment. Another search task may be, for example, to find segments of audio in the corpus that are spoken by the same speaker as the input speech segment, or for speakers with similar speech patterns as the input.
  • Depending on the particular implementation, some embodiments may characterize speech audio according to the acoustic content of the speech audio or the semantic content of the speech audio. For example, an embodiment may relate to deriving a representation of speech audio that is related to the acoustic content of the speech audio. For example, segments of audio with the same person speaking would have similar representations, while segments of audio with a second person would have a distinct representation. This acoustic representation may be used to, for example, search a corpus of acoustic audio data for particular sounds or acoustic signatures. An application of searching for sounds or acoustic signatures is speaker diarization, for example.
  • In some embodiments, a representation of speech audio may be designed to be primarily related to the conceptual content of the speech audio, or the semantic meaning contained therein. For example, segments of speech audio of different people talking about the same subject matter would have similar representations.
  • In some embodiments, a mixture of acoustic and semantic meaning may be contained in a representation. Various portions of the representation may be more or less responsive to either acoustic or semantic information from the original speech audio. Such a combined representation may be used in both semantic and acoustic discrimination tasks.
  • Several different embodiments illustrate varying approaches and techniques used to select and determine the internal state representation. In some embodiments, a particular segment or slice of a neural network may be selected and summarized or compressed to produce the internal state representation. In an embodiment, a portion of a neural network is selected, such as a selection of internal states such as a whole layer, certain portions of a layer, several layers, or portions of several layers. Given this portion of the neural network, a set of low-precision features is derived.
  • One method of deriving a low-precision feature is to quantize the output of an activation function of a node of a neural network. For example, in an embodiment, the output of the activation function at each node of the portion may be simplified into a binary representation. That is, any output of the node above a threshold is treated as a first binary value, and any output of the node below the threshold is treated as a second binary value. This low-precision representation may be more resilient to minor changes in the input because similar values may quantize to the same value. Other quantization levels may similarly be used, providing a tradeoff between resultant size of the internal state representation and resolution, among other factors. For example, some embodiments may quantize activation functions into four or eight states. Quantization may be performed by selecting n−1 thresholds to create a set of n bins where n is the number of quantized states. The real number valued output of the node is binned based on which pair of thresholds the real number valued output falls between and a numerical index of the bin may be used as the quantized value.
  • FIG. 18 illustrates an example of the process of generating low-precision features. Neural network 1800 is provided and a subset of nodes of the neural network 1800 are selected for generating the features. As shown, nodes may be in the same layer or different layers of a neural network. During the inference process to transcribe audio, the output of the activation function of each node in the subset of nodes is recorded, as shown for node 1820 and other nodes. The outputs are real number values (such as floating point or double precision), but, in an embodiment, are quantized to binary numbers by use of a threshold, such as 0.5. The quantized values are stored in tensor 1810, where each node corresponds to a fixed location in the tensor 1810. The tensor 1810 provides a compressed representation of internal state of the neural network 1800 during the inference process.
  • In some embodiments, a whole layer of the neural network may be selected for the internal state representation. In an example, an internal state representation may be determined from a fully-connected stack that produces a word embedding of the input speech audio. For example, the internal state representation may be determined from second fully-connected stack 205 of the example neural network discussed above. This internal state may provide features that relate to semantic meaning of the speech audio, for example.
  • In an embodiment, an internal state representation may be generated from a CNN layer. Such an internal state may contain features related to the acoustic input or acoustic signature of the input speech audio, for example. For example, an internal state representation may be generated from CNN stack 202 of the example neural network discussed above. In one example, a low-precision feature may be created from the internal state of a CNN layer, or from each non-linearity at the output of a CNN layer. In an embodiment, an internal state representation may be derived from a fully-connected layer that accepts the inputs of a CNN layer, such as first fully-connected layer 203 in the example embodiment discussed above.
  • In some embodiments, a mixture of nodes from disparate portions of an internal state of a neural network may be selected for the internal state representation. These selections may include portions of the network from any layer, such that they encompass a range of information contained in the network. For example, an internal state representation may be derived from some nodes from a CNN layer, other nodes from an RNN layer, and other nodes from one or more fully-connected layers, such that the resultant representation contains information from each of these various layers.
  • In one embodiment, a selection of which nodes to include in the internal state representation may be produced through a pruning process. For example, a portion of the internal state of a neural network may be set to a null value, and the effect on the output observed. If the output experiences a large change, the portion that was omitted may be of interest for inclusion in an internal state representation. This process may be automated and iterative such that a pruning algorithm may determine an optimal subset of nodes for inclusion in an internal state representation by observing and learning their effect on the change of the output. Similarly, an approach based on principal component analysis may be used to determine an optimal subset of neural network nodes for inclusion in an internal state representation.
  • In some embodiments, the architecture of the neural network may be designed to produce an internal state representation. For example, in an embodiment, a neural network may include a fully-connected layer of a comparatively low dimension for the purposes of deriving an internal state representation. This layer may be referred to as a bottleneck feature layer. The bottleneck feature layer is trained in the initial training of the speech recognition neural network to contain all information necessary to produce the output because all information must necessarily flow through the bottleneck layer. In this way, the initial training of the speech recognition neural network model also trains an optimal layer from which a reduced precision internal state representation may be derived.
  • In another example, a separate branch or branches of the neural network may be appended to or branched from the speech recognition neural network model and initially trained in parallel with the speech recognition portion. That is, additional outputs are added to the neural network with additional loss functions that train the network to produce a separate output that may be used to produce the internal state representation. This technique is similar to the above bottleneck feature technique, but the output may be separately trained from the speech recognition output. Then, the neural network may produce two sets of outputs including a first output that produces speech transcriptions and a second output that produces a representation of the input that may be used for future processing.
  • In some embodiments, this additional network may be an auto-encoder network that is trained to produce an output similar to the input. That is, the auto-encoder is trained alongside the speech recognition neural network with the state of the speech recognition network as an input and the input to the speech recognition network as the training data. Then, the auto-encoder network will learn an output representation most similar to the input. This type of auto-encoder network may then be used to, for example, generate an approximation of the original acoustic input to the speech recognition network based on the low-precision internal state representation.
  • Other configurations of additional encoding networks may be used to produce the internal state representation. For example, an encoding network may be trained to encode a particular layer or layers of the original speech recognition network, such as a word embedding layer or an audio features layer. In some embodiments, a combination of such encoders may be jointly used to produce the internal state representation.
  • Once the internal state representation is determined, by any method described above, it may be used for future processing tasks. For example, in some embodiments, the internal state representation may be used to classify audio. A corpus of audio may be transcribed by an end-to-end speech recognition neural network such as described above. During the initial transcription, an internal state representation may be generated and recorded along with the audio and the corresponding transcription. The internal state representation may contain more information than the corresponding text transcription, but less than the entire internal state of the neural network at the time of transcription. This internal state representation may then be used later to perform novel classification on the original audio data while leveraging the work done previously during transcription. For example, the internal state representation may be used to determine speaker changes in audio, also known as speaker diarization.
  • In an embodiment, a corpus of audio has been transcribed with an end-to-end neural network. The original audio, the transcription produced by the end-to-end neural network, and a stream of internal state representations created during transcription are stored together. At a later time, a second machine learning model may be trained based on a portion of the corpus that has been manually classified. The manually classified portion of the corpus is used as training data for the second machine learning model. For example, in a speaker diarization embodiment, the manually classified training data may indicate when speakers change in the audio. The indications may be an indication of an identity, or label, of a specific speaker that is talking or just an indication that a speaker change occurred. The second machine learning model may then be trained based on the internal state representation stream and the training speaker diarization indications. The internal state representation stream is provided as input to the second machine learning model and the training speaker diarization indications are provided as the target output. The second machine learning model may then learn to recognize speaker diarization based on the internal state representation stream. It learns a model for identifying internal state representations corresponding to a speaker change, or a certain speaker identity, and identifying internal state representations not corresponding to a speaker change, or other speaker identities. The rest of the corpus of transcribed audio, which lack manual classifications, may then be classified by the second machine learning model based on the previously stored internal state representation stream. The internal state representations corresponding to the non-manually classified audio are input to the second machine learning model. Predicted classifications of the internal state representations are output by the second machine learning model based on the input internal state representations. The predicted classifications may then be matched to the corresponding audio portions or transcription portions associated with those input internal state representations. In this way, the previously computed internal state representation stream may be leveraged by later processing.
  • Other such classification tasks may be performed on the internal state representation. For example, some embodiments may classify the audio into classes such as gender (e.g., male/female), emotion or sentiment (e.g., angry, sad, happy, etc.), speaker identification (i.e., which user is speaking), speaker age, speaker stress or strain, or other such classifications. Because the internal state representation already contains a complex representation of the speech audio, each of these tasks that may done much more efficiently based on the internal state representation as compared to running a new neural network on the original speech audio.
  • In some embodiments, the internal state representation stream may be used for search tasks. For example, rather than searching on transcribed text, a search of a speech audio file may be performed on the internal state representations associated with the speech audio. Because the internal state representations contain more information than text alone, including acoustic and semantic, a search may find more relevant audio segments than one based on only the output text representation of the speech audio.
  • In an embodiment, a large corpus of speech audio has been transcribed by a speech recognition neural network such as described above, and an internal state representation derived at the time of the original transcription stored along with the speech audio. A second neural network may then be trained to produce an internal state representation based on a text input. That is, the network accepts as input the text of a word or phrase and produces an internal state representation such as would have been produced by the speech recognition neural network if the word or phrase was present in audio provided to the speech recognition neural network. This second neural network may be trained on the existing data, that is, the corpus of speech audio containing both computed internal state representations and associated text outputs. During training, the second neural network is provided with training examples, where the training examples include an input comprising a text word or phrase and a target output comprising an internal state representation created by the speech recognition neural network when an audio recording of the word or phrase was presented. The second neural network learns a model for producing synthetic internal state representations based on text words or phrases. During a search, an input text word or phrase is presented and input to the second neural network, and an internal state representation is produced by the second neural network for the input word or phrase. This produced state representation is a close approximation of what the speech recognition network would have produced if it had been provided audio input that produced the text that was input to the second network. This state representation may then be used as a search input vector. The search input vector is compared to those internal state representation vectors stored in the corpus for similarity to find matches and search results.
  • Any method of comparing the representations, which may be expressed as vectors, may be used. For example, a dot product vector similarity or cosine similarity may be used to determine a relationship between the search input and the stored internal state representations. Dot product or cosine similarity are examples of vector or tensor distance metrics to measure similarity. The audio associated with the store internal state representations with the closest matches is the result of the search. In some embodiments, a single search result is returned corresponding to the closest match, and, in other embodiments, a plurality of results are returned.
  • In an embodiment, a classifier may be used to determine similarity between search input vectors and stored internal state vectors. That is, rather than using a dot product or cosine similarity, a measure of similarity may be determined by training a classifier network on search results. This classifier may be a neural network or may be any other classifier such as a support vector machine or a Bayesian network, for example. The classifier may be trained on ground-truth labelled search results, for example. It may accept training examples comprising sets of two internal state vectors as inputs and a target output comprising an indication of whether the internal state vectors are similar or not. In some embodiments, the target output is binary, and, in other embodiments, the target output is a real valued measure of similarity. After training, the classifier may be used to identify the closest matches to a search input vector. The search input vector is compared to one or more of the stored internal state vectors by using the classifier to output a similarity value. The audio associated with the most similar or set of most similar stored internal state representations is returned as the result of the search. In addition, a blended similarity model may be used that combined mathematical similarity between internal state representations and classifier-based similarity.
  • The technique of generating internal state representations of a neural network based on sampling the outputs of neural network nodes for use in classification, search, or other applications, as described above, may be used in a variety of machine learning applications and is not limited to use for the application of speech recognition.
  • The foregoing description is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses. The broad teachings of the disclosure can be implemented in a variety of forms. Therefore, while this disclosure includes particular examples, the true scope of the disclosure should not be so limited since other modifications will become apparent upon a study of the drawings, the specification, and the following claims.

Claims (20)

What is claimed is:
1. A method for customizing a neural network trained on a general dataset to a custom dataset, the method comprising:
providing a trained speech recognition neural network, the trained speech recognition neural network including a plurality of layers each having a plurality of nodes, the trained speech recognition neural network including an output layer with nodes corresponding to words of a vocabulary, the nodes of the output layer outputting values, wherein the values output by the nodes in the output layer correspond to a probability of the corresponding word in the vocabulary being a correct transcription of an input;
for a plurality of words in the vocabulary, determining the frequency of occurrence of the word in a general training set and the frequency of occurrence of the word in a custom dataset;
during inference using the trained speech recognition neural network, for each word in the plurality of words, adjusting the value output by the output node for the word based on the frequency of occurrence of the word in the custom dataset and the frequency of occurrence of the word in the general training set to obtain a custom model probability; and
generating a transcription of a spoken input based on the custom model probability.
2. The method of claim 1, wherein the plurality of words comprises all of the words in the vocabulary.
3. The method of claim 1, wherein the frequency of occurrence of the word in the general training set is set to a threshold minimum value if the word does not appear in the general training set.
4. The method of claim 1, wherein the trained speech recognition neural network includes one or more fully-connected neural network layers.
5. The method of claim 1, wherein the trained speech recognition neural network includes one or more locally connected neural network layers.
6. The method of claim 5, wherein the trained speech recognition neural network includes one or more recurrent neural network layers.
7. The method of claim 6, wherein the trained speech recognition neural network has been trained in an end-to-end training process including backpropagation through each of its layers.
8. A non-transitory computer-readable medium comprising instructions for:
providing a trained speech recognition neural network, the trained speech recognition neural network including a plurality of layers each having a plurality of nodes, the trained speech recognition neural network including an output layer with nodes corresponding to words of a vocabulary, the nodes of the output layer outputting values, wherein the values output by the nodes in the output layer correspond to a probability of the corresponding word in the vocabulary being a correct transcription of an input;
for a plurality of words in the vocabulary, determining the frequency of occurrence of the word in a general training set and the frequency of occurrence of the word in a custom dataset;
during inference using the trained speech recognition neural network, for each word in the plurality of words, adjusting the value output by the output node for the word based on the frequency of occurrence of the word in the custom dataset and the frequency of occurrence of the word in the general training set to obtain a custom model probability; and
generating a transcription of a spoken input based on the custom model probability.
9. The non-transitory computer-readable medium of claim 8, wherein the plurality of words comprises all of the words in the vocabulary.
10. The non-transitory computer-readable medium of claim 8, wherein the frequency of occurrence of the word in the general training set is set to a threshold minimum value if the word does not appear in the general training set.
11. The non-transitory computer-readable medium of claim 8, wherein the trained speech recognition neural network includes one or more fully-connected neural network layers.
12. The non-transitory computer-readable medium of claim 8, wherein the trained speech recognition neural network includes one or more locally connected neural network layers.
13. The non-transitory computer-readable medium of claim 12, wherein the trained speech recognition neural network includes one or more recurrent neural network layers.
14. The non-transitory computer-readable medium of claim 13, wherein the trained speech recognition neural network has been trained in an end-to-end training process including backpropagation through each of its layers.
15. A non-transitory computer-readable medium comprising instructions for:
providing a trained speech recognition neural network, the trained speech recognition neural network including a plurality of layers each having a plurality of nodes, the trained speech recognition neural network including an output layer with nodes corresponding to words of a vocabulary, the nodes of the output layer outputting values, wherein the values output by the nodes in the output layer correspond to a probability of the corresponding word in the vocabulary being a correct transcription of an input;
during inference using the trained speech recognition neural network, adjusting the values output by a plurality of nodes in the output layer based on the frequency of occurrence of the corresponding word in a general training set and a custom dataset to obtain a custom model probability; and
generating a transcription of a spoken input based on the custom model probability.
16. The non-transitory computer-readable medium of claim 15, further comprising instructions for adjusting the values of all of the nodes in the output layer based on the frequency of occurrence of the corresponding word in the general training set and the custom dataset to obtain the custom model probability.
17. The non-transitory computer-readable medium of claim 15, wherein the frequency of occurrence of the word in the general training set is set to a threshold minimum value if the word does not appear in the general training set.
18. The non-transitory computer-readable medium of claim 15, wherein the trained speech recognition neural network includes one or more locally connected neural network layers.
19. The non-transitory computer-readable medium of claim 18, wherein the trained speech recognition neural network includes one or more recurrent neural network layers.
20. The non-transitory computer-readable medium of claim 19, wherein the trained speech recognition neural network has been trained in an end-to-end training process including backpropagation through each of its layers.
US16/232,652 2018-07-27 2018-12-26 Augmented generalized deep learning with special vocabulary Active US10540959B1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US16/232,652 US10540959B1 (en) 2018-07-27 2018-12-26 Augmented generalized deep learning with special vocabulary

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US201862703892P 2018-07-27 2018-07-27
US16/108,107 US10210860B1 (en) 2018-07-27 2018-08-22 Augmented generalized deep learning with special vocabulary
US16/232,652 US10540959B1 (en) 2018-07-27 2018-12-26 Augmented generalized deep learning with special vocabulary

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US16/108,107 Continuation US10210860B1 (en) 2018-07-27 2018-08-22 Augmented generalized deep learning with special vocabulary

Publications (2)

Publication Number Publication Date
US10540959B1 US10540959B1 (en) 2020-01-21
US20200035219A1 true US20200035219A1 (en) 2020-01-30

Family

ID=65322747

Family Applications (8)

Application Number Title Priority Date Filing Date
US16/108,107 Active US10210860B1 (en) 2018-07-27 2018-08-22 Augmented generalized deep learning with special vocabulary
US16/108,110 Active US10720151B2 (en) 2018-07-27 2018-08-22 End-to-end neural networks for speech recognition and classification
US16/108,109 Active US10380997B1 (en) 2018-07-27 2018-08-22 Deep learning internal state index-based search and classification
US16/232,652 Active US10540959B1 (en) 2018-07-27 2018-12-26 Augmented generalized deep learning with special vocabulary
US16/417,722 Active US10847138B2 (en) 2018-07-27 2019-05-21 Deep learning internal state index-based search and classification
US16/887,866 Active US11367433B2 (en) 2018-07-27 2020-05-29 End-to-end neural networks for speech recognition and classification
US17/073,149 Active 2039-02-18 US11676579B2 (en) 2018-07-27 2020-10-16 Deep learning internal state index-based search and classification
US18/208,454 Pending US20230317062A1 (en) 2018-07-27 2023-06-12 Deep learning internal state index-based search and classification

Family Applications Before (3)

Application Number Title Priority Date Filing Date
US16/108,107 Active US10210860B1 (en) 2018-07-27 2018-08-22 Augmented generalized deep learning with special vocabulary
US16/108,110 Active US10720151B2 (en) 2018-07-27 2018-08-22 End-to-end neural networks for speech recognition and classification
US16/108,109 Active US10380997B1 (en) 2018-07-27 2018-08-22 Deep learning internal state index-based search and classification

Family Applications After (4)

Application Number Title Priority Date Filing Date
US16/417,722 Active US10847138B2 (en) 2018-07-27 2019-05-21 Deep learning internal state index-based search and classification
US16/887,866 Active US11367433B2 (en) 2018-07-27 2020-05-29 End-to-end neural networks for speech recognition and classification
US17/073,149 Active 2039-02-18 US11676579B2 (en) 2018-07-27 2020-10-16 Deep learning internal state index-based search and classification
US18/208,454 Pending US20230317062A1 (en) 2018-07-27 2023-06-12 Deep learning internal state index-based search and classification

Country Status (1)

Country Link
US (8) US10210860B1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111611892A (en) * 2020-05-14 2020-09-01 青岛翰林汇力科技有限公司 Comprehensive intelligent deep learning method applying neural network
US11514368B2 (en) * 2019-03-29 2022-11-29 Advanced New Technologies Co., Ltd. Methods, apparatuses, and computing devices for trainings of learning models

Families Citing this family (124)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8676904B2 (en) 2008-10-02 2014-03-18 Apple Inc. Electronic devices with voice command and contextual data processing capabilities
US10417037B2 (en) 2012-05-15 2019-09-17 Apple Inc. Systems and methods for integrating third party services with a digital assistant
KR20230137475A (en) 2013-02-07 2023-10-04 애플 인크. Voice trigger for a digital assistant
US10170123B2 (en) 2014-05-30 2019-01-01 Apple Inc. Intelligent assistant for home automation
US9715875B2 (en) 2014-05-30 2017-07-25 Apple Inc. Reducing the need for manual start/end-pointing and trigger phrases
US9338493B2 (en) 2014-06-30 2016-05-10 Apple Inc. Intelligent automated assistant for TV user interactions
US10540957B2 (en) * 2014-12-15 2020-01-21 Baidu Usa Llc Systems and methods for speech transcription
US9886953B2 (en) 2015-03-08 2018-02-06 Apple Inc. Virtual assistant activation
JP6678930B2 (en) * 2015-08-31 2020-04-15 インターナショナル・ビジネス・マシーンズ・コーポレーションInternational Business Machines Corporation Method, computer system and computer program for learning a classification model
US10747498B2 (en) 2015-09-08 2020-08-18 Apple Inc. Zero latency digital assistant
US10691473B2 (en) 2015-11-06 2020-06-23 Apple Inc. Intelligent automated assistant in a messaging environment
DK201670540A1 (en) 2016-06-11 2018-01-08 Apple Inc Application integration with a digital assistant
CN106228976B (en) * 2016-07-22 2019-05-31 百度在线网络技术(北京)有限公司 Audio recognition method and device
WO2018033137A1 (en) * 2016-08-19 2018-02-22 北京市商汤科技开发有限公司 Method, apparatus, and electronic device for displaying service object in video image
US11037330B2 (en) * 2017-04-08 2021-06-15 Intel Corporation Low rank matrix compression
DK180048B1 (en) 2017-05-11 2020-02-04 Apple Inc. MAINTAINING THE DATA PROTECTION OF PERSONAL INFORMATION
DK179496B1 (en) 2017-05-12 2019-01-15 Apple Inc. USER-SPECIFIC Acoustic Models
DK201770429A1 (en) 2017-05-12 2018-12-14 Apple Inc. Low-latency intelligent automated assistant
DE102017219673A1 (en) * 2017-11-06 2019-05-09 Robert Bosch Gmbh Method, device and computer program for detecting an object
US11145294B2 (en) 2018-05-07 2021-10-12 Apple Inc. Intelligent automated assistant for delivering content from user experiences
US10928918B2 (en) 2018-05-07 2021-02-23 Apple Inc. Raise to speak
DK180639B1 (en) 2018-06-01 2021-11-04 Apple Inc DISABILITY OF ATTENTION-ATTENTIVE VIRTUAL ASSISTANT
JP2019211627A (en) * 2018-06-05 2019-12-12 日本電信電話株式会社 Model learning device, method and program
US10210860B1 (en) * 2018-07-27 2019-02-19 Deepgram, Inc. Augmented generalized deep learning with special vocabulary
KR102637339B1 (en) * 2018-08-31 2024-02-16 삼성전자주식회사 Method and apparatus of personalizing voice recognition model
US10805029B2 (en) 2018-09-11 2020-10-13 Nbcuniversal Media, Llc Real-time automated classification system
JP6471825B1 (en) * 2018-09-11 2019-02-20 ソニー株式会社 Information processing apparatus and information processing method
US11462215B2 (en) 2018-09-28 2022-10-04 Apple Inc. Multi-modal inputs for voice commands
US11556776B2 (en) * 2018-10-18 2023-01-17 Microsoft Technology Licensing, Llc Minimization of computational demands in model agnostic cross-lingual transfer with neural task representations as weak supervision
KR102608470B1 (en) * 2018-10-23 2023-12-01 삼성전자주식회사 Data recognition device and method and training device and method
JP6832329B2 (en) * 2018-12-18 2021-02-24 富士通株式会社 data structure
US10839792B2 (en) * 2019-02-05 2020-11-17 International Business Machines Corporation Recognition of out-of-vocabulary in direct acoustics-to-word speech recognition using acoustic word embedding
US11138966B2 (en) * 2019-02-07 2021-10-05 Tencent America LLC Unsupervised automatic speech recognition
US11348573B2 (en) 2019-03-18 2022-05-31 Apple Inc. Multimodality in digital assistant systems
US11100920B2 (en) * 2019-03-25 2021-08-24 Mitsubishi Electric Research Laboratories, Inc. System and method for end-to-end speech recognition with triggered attention
CN109948722B (en) * 2019-03-27 2021-05-18 中国人民解放军战略支援部队航天工程大学 Method for identifying space target
US20210209315A1 (en) * 2019-03-29 2021-07-08 Google Llc Direct Speech-to-Speech Translation via Machine Learning
CN113646835A (en) * 2019-04-05 2021-11-12 谷歌有限责任公司 Joint automatic speech recognition and speaker binarization
US20200322377A1 (en) * 2019-04-08 2020-10-08 Pindrop Security, Inc. Systems and methods for end-to-end architectures for voice spoofing detection
US10860809B2 (en) 2019-04-09 2020-12-08 Sas Institute Inc. Word embeddings and virtual terms
US11307752B2 (en) 2019-05-06 2022-04-19 Apple Inc. User configurable task triggers
DK201970509A1 (en) 2019-05-06 2021-01-15 Apple Inc Spoken notifications
US11140099B2 (en) 2019-05-21 2021-10-05 Apple Inc. Providing message response suggestions
CN110097894B (en) * 2019-05-21 2021-06-11 焦点科技股份有限公司 End-to-end speech emotion recognition method and system
CN110415686B (en) * 2019-05-21 2021-08-17 腾讯科技(深圳)有限公司 Voice processing method, device, medium and electronic equipment
CN110210024B (en) * 2019-05-28 2024-04-02 腾讯科技(深圳)有限公司 Information processing method, device and storage medium
US11381651B2 (en) * 2019-05-29 2022-07-05 Adobe Inc. Interpretable user modeling from unstructured user data
US11289073B2 (en) * 2019-05-31 2022-03-29 Apple Inc. Device text to speech
US11227599B2 (en) 2019-06-01 2022-01-18 Apple Inc. Methods and user interfaces for voice-based control of electronic devices
CN110378479B (en) * 2019-06-11 2023-04-14 平安科技(深圳)有限公司 Image input method and device based on deep learning and terminal equipment
US20220238118A1 (en) * 2019-06-14 2022-07-28 Cedat 85 S.R.L. Apparatus for processing an audio signal for the generation of a multimedia file with speech transcription
CN110223673B (en) * 2019-06-21 2020-01-17 龙马智芯(珠海横琴)科技有限公司 Voice processing method and device, storage medium and electronic equipment
CN110310628B (en) 2019-06-27 2022-05-20 百度在线网络技术(北京)有限公司 Method, device and equipment for optimizing wake-up model and storage medium
US11514311B2 (en) * 2019-07-03 2022-11-29 International Business Machines Corporation Automated data slicing based on an artificial neural network
US11551393B2 (en) * 2019-07-23 2023-01-10 LoomAi, Inc. Systems and methods for animation generation
KR20210014949A (en) * 2019-07-31 2021-02-10 삼성전자주식회사 Decoding method and apparatus in artificial neural network for speech recognition
CN110491400B (en) * 2019-08-21 2021-05-28 浙江树人学院(浙江树人大学) Speech signal reconstruction method based on depth self-encoder
US10930301B1 (en) * 2019-08-27 2021-02-23 Nec Corporation Sequence models for audio scene recognition
US11244116B2 (en) 2019-09-03 2022-02-08 International Business Machines Corporation Automatically bootstrapping a domain-specific vocabulary
KR20190113693A (en) * 2019-09-18 2019-10-08 엘지전자 주식회사 Artificial intelligence apparatus and method for recognizing speech of user in consideration of word usage frequency
CN110473516B (en) * 2019-09-19 2020-11-27 百度在线网络技术(北京)有限公司 Voice synthesis method and device and electronic equipment
US20210089867A1 (en) * 2019-09-24 2021-03-25 Nvidia Corporation Dual recurrent neural network architecture for modeling long-term dependencies in sequential data
CN110634474B (en) * 2019-09-24 2022-03-25 腾讯科技(深圳)有限公司 Speech recognition method and device based on artificial intelligence
CN110991171B (en) * 2019-09-30 2023-10-13 奇安信科技集团股份有限公司 Sensitive word detection method and device
US11250337B2 (en) * 2019-11-04 2022-02-15 Kpn Innovations Llc Systems and methods for classifying media according to user negative propensities
US20210133583A1 (en) * 2019-11-05 2021-05-06 Nvidia Corporation Distributed weight update for backpropagation of a neural network
CN112819020A (en) * 2019-11-15 2021-05-18 富士通株式会社 Method and device for training classification model and classification method
CN110942147B (en) * 2019-11-28 2021-04-20 支付宝(杭州)信息技术有限公司 Neural network model training and predicting method and device based on multi-party safety calculation
US11341954B2 (en) * 2019-12-17 2022-05-24 Google Llc Training keyword spotters
US11172294B2 (en) * 2019-12-27 2021-11-09 Bose Corporation Audio device with speech-based audio signal processing
KR20210086008A (en) * 2019-12-31 2021-07-08 삼성전자주식회사 Method and apparatus for personalizing content recommendation model
CN111178546B (en) * 2019-12-31 2023-05-23 华为技术有限公司 Searching method of machine learning model and related device and equipment
CN113093967A (en) 2020-01-08 2021-07-09 富泰华工业(深圳)有限公司 Data generation method, data generation device, computer device, and storage medium
US11183178B2 (en) 2020-01-13 2021-11-23 Microsoft Technology Licensing, Llc Adaptive batching to reduce recognition latency
CN111275571B (en) * 2020-01-14 2020-12-11 河海大学 Resident load probability prediction deep learning method considering microclimate and user mode
CN112750425B (en) * 2020-01-22 2023-11-03 腾讯科技(深圳)有限公司 Speech recognition method, device, computer equipment and computer readable storage medium
US20230056315A1 (en) * 2020-01-24 2023-02-23 Northeastern University, Northeastern Univ. Computer-implemented methods and systems for compressing recurrent neural network (rnn) models and accelerating rnn execution in mobile devices to achieve real-time inference
US11496775B2 (en) * 2020-02-20 2022-11-08 Tencent America LLC Neural network model compression with selective structured weight unification
CN111353313A (en) * 2020-02-25 2020-06-30 四川翼飞视科技有限公司 Emotion analysis model construction method based on evolutionary neural network architecture search
TWI792216B (en) * 2020-03-11 2023-02-11 宏達國際電子股份有限公司 Reinforcement learning system and training method
CN113469364B (en) * 2020-03-31 2023-10-13 杭州海康威视数字技术股份有限公司 Reasoning platform, method and device
US11657799B2 (en) * 2020-04-03 2023-05-23 Microsoft Technology Licensing, Llc Pre-training with alignments for recurrent neural network transducer based end-to-end speech recognition
US20210319786A1 (en) * 2020-04-08 2021-10-14 Oregon Health & Science University Mispronunciation detection with phonological feedback
CN113536026B (en) * 2020-04-13 2024-01-23 阿里巴巴集团控股有限公司 Audio searching method, device and equipment
CN111508000B (en) * 2020-04-14 2024-02-09 北京交通大学 Deep reinforcement learning target tracking method based on parameter space noise network
US11061543B1 (en) 2020-05-11 2021-07-13 Apple Inc. Providing relevant data items based on context
US11928429B2 (en) * 2020-05-22 2024-03-12 Microsoft Technology Licensing, Llc Token packing for sequence models
CN111667836B (en) * 2020-06-19 2023-05-05 南京大学 Text irrelevant multi-label speaker recognition method based on deep learning
CN111784159B (en) * 2020-07-01 2024-02-02 深圳市检验检疫科学研究院 Food risk traceability information grading method and device
US11308971B2 (en) * 2020-07-15 2022-04-19 Bank Of America Corporation Intelligent noise cancellation system for video conference calls in telepresence rooms
US11490204B2 (en) 2020-07-20 2022-11-01 Apple Inc. Multi-device audio adjustment coordination
US11438683B2 (en) 2020-07-21 2022-09-06 Apple Inc. User identification using headphones
US11824977B2 (en) * 2020-07-28 2023-11-21 Arm Limited Data processing system and method
US10915836B1 (en) * 2020-07-29 2021-02-09 Guy B. Olney Systems and methods for operating a cognitive automaton
US11450310B2 (en) * 2020-08-10 2022-09-20 Adobe Inc. Spoken language understanding
US11181988B1 (en) * 2020-08-31 2021-11-23 Apple Inc. Incorporating user feedback into text prediction models via joint reward planning
CN111933124B (en) * 2020-09-18 2021-04-30 电子科技大学 Keyword detection method capable of supporting self-defined awakening words
CN111967585B (en) * 2020-09-25 2022-02-22 深圳市商汤科技有限公司 Network model processing method and device, electronic equipment and storage medium
CN112214915B (en) * 2020-09-25 2024-03-12 汕头大学 Method for determining nonlinear stress-strain relation of material
US11875261B2 (en) * 2020-10-16 2024-01-16 Ford Global Technologies, Llc Automated cross-node communication in distributed directed acyclic graph
WO2022107393A1 (en) * 2020-11-20 2022-05-27 The Trustees Of Columbia University In The City Of New York A neural-network-based approach for speech denoising statement regarding federally sponsored research
US11847555B2 (en) 2020-12-04 2023-12-19 International Business Machines Corporation Constraining neural networks for robustness through alternative encoding
CN112397075A (en) * 2020-12-10 2021-02-23 北京猿力未来科技有限公司 Human voice audio recognition model training method, audio classification method and system
US11532312B2 (en) * 2020-12-15 2022-12-20 Microsoft Technology Licensing, Llc User-perceived latency while maintaining accuracy
CN112669852B (en) * 2020-12-15 2023-01-31 北京百度网讯科技有限公司 Memory allocation method and device and electronic equipment
US11741967B2 (en) * 2021-01-04 2023-08-29 Kwai Inc. Systems and methods for automatic speech recognition based on graphics processing units
US11551694B2 (en) 2021-01-05 2023-01-10 Comcast Cable Communications, Llc Methods, systems and apparatuses for improved speech recognition and transcription
CN112885372B (en) * 2021-01-15 2022-08-09 国网山东省电力公司威海供电公司 Intelligent diagnosis method, system, terminal and medium for power equipment fault sound
JP2024506544A (en) * 2021-02-04 2024-02-14 グーグル エルエルシー Systems and methods for incremental learning for machine learning models to optimize training speed
US11908453B2 (en) * 2021-02-10 2024-02-20 Direct Cursus Technology L.L.C Method and system for classifying a user of an electronic device
JPWO2022185437A1 (en) * 2021-03-03 2022-09-09
CN113129869B (en) * 2021-03-22 2022-01-28 北京百度网讯科技有限公司 Method and device for training and recognizing voice recognition model
CN113011193B (en) * 2021-04-09 2021-11-23 广东外语外贸大学 Bi-LSTM algorithm-based method and system for evaluating repeatability of detection consultation statement
CN113378383B (en) * 2021-06-10 2024-02-27 北京工商大学 Food supply chain hazard prediction method and device
US11875785B2 (en) * 2021-08-27 2024-01-16 Accenture Global Solutions Limited Establishing user persona in a conversational system
CN113704627B (en) * 2021-09-06 2022-05-17 中国计量大学 Session recommendation method based on time interval graph
CN113627157B (en) * 2021-10-13 2022-02-11 京华信息科技股份有限公司 Probability threshold value adjusting method and system based on multi-head attention mechanism
US20230177309A1 (en) * 2021-12-07 2023-06-08 Deepmind Technologies Limited Training conditional computation neural networks using reinforcement learning
CN114171053B (en) * 2021-12-20 2024-04-05 Oppo广东移动通信有限公司 Training method of neural network, audio separation method, device and equipment
US11948599B2 (en) * 2022-01-06 2024-04-02 Microsoft Technology Licensing, Llc Audio event detection with window-based prediction
CN115482837B (en) * 2022-07-25 2023-04-28 科睿纳(河北)医疗科技有限公司 Emotion classification method based on artificial intelligence
WO2024036213A1 (en) * 2022-08-09 2024-02-15 The Board Of Trustees Of The Leland Stanford Junior University Systems and methods for decoding speech from neural activity
CN115409217B (en) * 2022-11-01 2023-09-26 之江实验室 Multitasking predictive maintenance method based on multi-expert hybrid network
CN116229581B (en) * 2023-03-23 2023-09-19 珠海市安克电子技术有限公司 Intelligent interconnection first-aid system based on big data

Family Cites Families (141)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4994983A (en) * 1989-05-02 1991-02-19 Itt Corporation Automatic speech recognition system using seed templates
EP0513652A2 (en) * 1991-05-10 1992-11-19 Siemens Aktiengesellschaft Method for modelling similarity function using neural network
US5390278A (en) * 1991-10-08 1995-02-14 Bell Canada Phoneme based speech recognition
US5787393A (en) * 1992-03-30 1998-07-28 Seiko Epson Corporation Speech recognition apparatus using neural network, and learning method therefor
JP2924555B2 (en) 1992-10-02 1999-07-26 三菱電機株式会社 Speech recognition boundary estimation method and speech recognition device
JP2986345B2 (en) 1993-10-18 1999-12-06 インターナショナル・ビジネス・マシーンズ・コーポレイション Voice recording indexing apparatus and method
US5594834A (en) * 1994-09-30 1997-01-14 Motorola, Inc. Method and system for recognizing a boundary between sounds in continuous speech
KR0173923B1 (en) 1995-12-22 1999-04-01 양승택 Phoneme Segmentation Using Multi-Layer Neural Networks
US6076056A (en) 1997-09-19 2000-06-13 Microsoft Corporation Speech recognition system for recognizing continuous and isolated speech
US7177795B1 (en) * 1999-11-10 2007-02-13 International Business Machines Corporation Methods and apparatus for semantic unit based automatic indexing and searching in data archive systems
US6560582B1 (en) * 2000-01-05 2003-05-06 The United States Of America As Represented By The Secretary Of The Navy Dynamic memory processor
WO2005034086A1 (en) 2003-10-03 2005-04-14 Asahi Kasei Kabushiki Kaisha Data processing device and data processing device control program
US7206389B1 (en) * 2004-01-07 2007-04-17 Nuance Communications, Inc. Method and apparatus for generating a speech-recognition-based call-routing system
JP5223673B2 (en) * 2006-06-29 2013-06-26 日本電気株式会社 Audio processing apparatus and program, and audio processing method
US8504361B2 (en) * 2008-02-07 2013-08-06 Nec Laboratories America, Inc. Deep neural networks and methods for using same
US8060513B2 (en) * 2008-07-01 2011-11-15 Dossierview Inc. Information processing with integrated semantic contexts
US8364481B2 (en) 2008-07-02 2013-01-29 Google Inc. Speech recognition with parallel recognition tasks
US8645131B2 (en) 2008-10-17 2014-02-04 Ashwin P. Rao Detecting segments of speech from an audio stream
US9424246B2 (en) * 2009-03-30 2016-08-23 Touchtype Ltd. System and method for inputting text into electronic devices
US20100299131A1 (en) * 2009-05-21 2010-11-25 Nexidia Inc. Transcript alignment
US9117453B2 (en) 2009-12-31 2015-08-25 Volt Delta Resources, Llc Method and system for processing parallel context dependent speech recognition results from a single utterance utilizing a context database
US9031844B2 (en) 2010-09-21 2015-05-12 Microsoft Technology Licensing, Llc Full-sequence training of deep structures for speech recognition
US8756061B2 (en) * 2011-04-01 2014-06-17 Sony Computer Entertainment Inc. Speech syllable/vowel/phone boundary detection using auditory attention cues
US10672399B2 (en) * 2011-06-03 2020-06-02 Apple Inc. Switching between text data and audio data based on a mapping
US10453479B2 (en) 2011-09-23 2019-10-22 Lessac Technologies, Inc. Methods for aligning expressive speech utterances with text and systems therefor
US9916538B2 (en) * 2012-09-15 2018-03-13 Z Advanced Computing, Inc. Method and system for feature detection
US8873813B2 (en) * 2012-09-17 2014-10-28 Z Advanced Computing, Inc. Application of Z-webs and Z-factors to analytics, search engine, learning, recognition, natural language, and other utilities
US8996371B2 (en) 2012-03-29 2015-03-31 Nice-Systems Ltd. Method and system for automatic domain adaptation in speech recognition applications
US8996372B1 (en) * 2012-10-30 2015-03-31 Amazon Technologies, Inc. Using adaptation data with cloud-based speech recognition
US9240184B1 (en) 2012-11-15 2016-01-19 Google Inc. Frame-level combination of deep neural network and gaussian mixture models
US9477925B2 (en) 2012-11-20 2016-10-25 Microsoft Technology Licensing, Llc Deep neural networks training for speech and pattern recognition
US9263036B1 (en) * 2012-11-29 2016-02-16 Google Inc. System and method for speech recognition using deep recurrent neural networks
US9177550B2 (en) 2013-03-06 2015-11-03 Microsoft Technology Licensing, Llc Conservatively adapting a deep neural network in a recognition system
US9454958B2 (en) 2013-03-07 2016-09-27 Microsoft Technology Licensing, Llc Exploiting heterogeneous data in deep neural network-based speech recognition systems
US9842585B2 (en) * 2013-03-11 2017-12-12 Microsoft Technology Licensing, Llc Multilingual deep neural network
US9558743B2 (en) 2013-03-15 2017-01-31 Google Inc. Integration of semantic context information
JP6164639B2 (en) * 2013-05-23 2017-07-19 国立研究開発法人情報通信研究機構 Deep neural network learning method and computer program
US10867597B2 (en) * 2013-09-02 2020-12-15 Microsoft Technology Licensing, Llc Assignment of semantic labels to a sequence of words using neural network architectures
US9519859B2 (en) * 2013-09-06 2016-12-13 Microsoft Technology Licensing, Llc Deep structured semantic model produced using click-through data
US9484025B2 (en) 2013-10-15 2016-11-01 Toyota Jidosha Kabushiki Kaisha Configuring dynamic custom vocabulary for personalized speech recognition
US9620145B2 (en) 2013-11-01 2017-04-11 Google Inc. Context-dependent state tying using a neural network
US9514753B2 (en) * 2013-11-04 2016-12-06 Google Inc. Speaker identification using hash-based indexing
US9715660B2 (en) 2013-11-04 2017-07-25 Google Inc. Transfer learning for deep neural network based hotword detection
US10019985B2 (en) * 2013-11-04 2018-07-10 Google Llc Asynchronous optimization for sequence training of neural networks
US9401143B2 (en) * 2014-03-24 2016-07-26 Google Inc. Cluster specific speech model
CN113255885A (en) * 2014-04-11 2021-08-13 谷歌有限责任公司 Parallelizing training of convolutional neural networks
US20150310862A1 (en) 2014-04-24 2015-10-29 Microsoft Corporation Deep learning for semantic parsing including semantic utterance classification
US20160034811A1 (en) 2014-07-31 2016-02-04 Apple Inc. Efficient generation of complementary acoustic models for performing automatic speech recognition system combination
US10089580B2 (en) * 2014-08-11 2018-10-02 Microsoft Technology Licensing, Llc Generating and using a knowledge-enhanced model
CN105960672B (en) 2014-09-09 2019-11-26 微软技术许可有限责任公司 Variable component deep neural network for Robust speech recognition
US9646634B2 (en) 2014-09-30 2017-05-09 Google Inc. Low-rank hidden input layer for speech recognition neural network
US20160132787A1 (en) 2014-11-11 2016-05-12 Massachusetts Institute Of Technology Distributed, multi-model, self-learning platform for machine learning
US9996768B2 (en) * 2014-11-19 2018-06-12 Adobe Systems Incorporated Neural network patch aggregation and statistics
US10540957B2 (en) 2014-12-15 2020-01-21 Baidu Usa Llc Systems and methods for speech transcription
US9607618B2 (en) * 2014-12-16 2017-03-28 Nice-Systems Ltd Out of vocabulary pattern learning
EP3234946A1 (en) 2014-12-17 2017-10-25 Intel Corporation System and method of automatic speech recognition using parallel processing for weighted finite state transducer-based speech decoding
US9721559B2 (en) 2015-04-17 2017-08-01 International Business Machines Corporation Data augmentation method based on stochastic feature mapping for automatic speech recognition
US9524716B2 (en) * 2015-04-17 2016-12-20 Nuance Communications, Inc. Systems and methods for providing unnormalized language models
US10909329B2 (en) 2015-05-21 2021-02-02 Baidu Usa Llc Multilingual image question answering
US9595002B2 (en) 2015-05-29 2017-03-14 Sas Institute Inc. Normalizing electronic communications using a vector having a repeating substring as input for a neural network
US10360911B2 (en) 2015-06-01 2019-07-23 AffectLayer, Inc. Analyzing conversations to automatically identify product features that resonate with customers
US10121471B2 (en) 2015-06-29 2018-11-06 Amazon Technologies, Inc. Language model speech endpointing
US20170032245A1 (en) * 2015-07-01 2017-02-02 The Board Of Trustees Of The Leland Stanford Junior University Systems and Methods for Providing Reinforcement Learning in a Deep Learning System
US9786270B2 (en) 2015-07-09 2017-10-10 Google Inc. Generating acoustic models
US10878320B2 (en) * 2015-07-22 2020-12-29 Qualcomm Incorporated Transfer learning in neural networks
US10026396B2 (en) * 2015-07-28 2018-07-17 Google Llc Frequency warping in a speech recognition system
US10262654B2 (en) * 2015-09-24 2019-04-16 Microsoft Technology Licensing, Llc Detecting actionable items in a conversation among participants
US10504010B2 (en) * 2015-10-02 2019-12-10 Baidu Usa Llc Systems and methods for fast novel visual concept learning from sentence descriptions of images
US10796335B2 (en) * 2015-10-08 2020-10-06 Samsung Sds America, Inc. Device, method, and computer readable medium of generating recommendations via ensemble multi-arm bandit with an LPBoost
US10733979B2 (en) * 2015-10-09 2020-08-04 Google Llc Latency constraints for acoustic modeling
KR102313028B1 (en) 2015-10-29 2021-10-13 삼성에스디에스 주식회사 System and method for voice recognition
CN108475505B (en) * 2015-11-12 2023-03-17 谷歌有限责任公司 Generating a target sequence from an input sequence using partial conditions
US10332509B2 (en) 2015-11-25 2019-06-25 Baidu USA, LLC End-to-end speech recognition
US10318008B2 (en) 2015-12-15 2019-06-11 Purdue Research Foundation Method and system for hand pose detection
US9792896B2 (en) * 2015-12-15 2017-10-17 Facebook, Inc. Providing intelligent transcriptions of sound messages in a messaging application
US10013640B1 (en) * 2015-12-21 2018-07-03 Google Llc Object recognition from videos using recurrent neural networks
US20170186446A1 (en) 2015-12-24 2017-06-29 Michal Wosk Mouth proximity detection
US10032463B1 (en) * 2015-12-29 2018-07-24 Amazon Technologies, Inc. Speech processing with learned representation of user interaction history
KR102434604B1 (en) * 2016-01-05 2022-08-23 한국전자통신연구원 Voice recognition terminal, voice recognition server and voice recognition method performing a personalized voice recognition for performing personalized voice recognition
US10373073B2 (en) 2016-01-11 2019-08-06 International Business Machines Corporation Creating deep learning models using feature augmentation
US11264044B2 (en) * 2016-02-02 2022-03-01 Nippon Telegraph And Telephone Corporation Acoustic model training method, speech recognition method, acoustic model training apparatus, speech recognition apparatus, acoustic model training program, and speech recognition program
WO2017142629A1 (en) * 2016-02-18 2017-08-24 Google Inc. Image classification neural networks
US9799327B1 (en) * 2016-02-26 2017-10-24 Google Inc. Speech recognition with attention-based recurrent neural networks
KR102158743B1 (en) 2016-03-15 2020-09-22 한국전자통신연구원 Data augmentation method for spontaneous speech recognition
US11049495B2 (en) 2016-03-18 2021-06-29 Fluent.Ai Inc. Method and device for automatically learning relevance of words in a speech recognition system
US10373612B2 (en) * 2016-03-21 2019-08-06 Amazon Technologies, Inc. Anchored speech detection and speech recognition
US10909450B2 (en) 2016-03-29 2021-02-02 Microsoft Technology Licensing, Llc Multiple-action computational model training and operation
US9984682B1 (en) 2016-03-30 2018-05-29 Educational Testing Service Computer-implemented systems and methods for automatically generating an assessment of oral recitations of assessment items
US9934775B2 (en) 2016-05-26 2018-04-03 Apple Inc. Unit-selection text-to-speech synthesis based on predicted concatenation parameters
US10255905B2 (en) 2016-06-10 2019-04-09 Google Llc Predicting pronunciations with word stress
CN107492382B (en) 2016-06-13 2020-12-18 阿里巴巴集团控股有限公司 Voiceprint information extraction method and device based on neural network
JP6671020B2 (en) 2016-06-23 2020-03-25 パナソニックIpマネジメント株式会社 Dialogue act estimation method, dialogue act estimation device and program
US11449744B2 (en) * 2016-06-23 2022-09-20 Microsoft Technology Licensing, Llc End-to-end memory networks for contextual language understanding
US9715496B1 (en) * 2016-07-08 2017-07-25 Asapp, Inc. Automatically responding to a request of a user
US9984683B2 (en) 2016-07-22 2018-05-29 Google Llc Automatic speech recognition using multi-dimensional models
US9972339B1 (en) * 2016-08-04 2018-05-15 Amazon Technologies, Inc. Neural network based beam selection
US10657437B2 (en) * 2016-08-18 2020-05-19 International Business Machines Corporation Training of front-end and back-end neural networks
US20180060724A1 (en) * 2016-08-25 2018-03-01 Microsoft Technology Licensing, Llc Network Morphism
CN107785015A (en) 2016-08-26 2018-03-09 阿里巴巴集团控股有限公司 A kind of audio recognition method and device
US10679643B2 (en) 2016-08-31 2020-06-09 Gregory Frederick Diamos Automatic audio captioning
US10154051B2 (en) * 2016-08-31 2018-12-11 Cisco Technology, Inc. Automatic detection of network threats based on modeling sequential behavior in network traffic
US11080591B2 (en) 2016-09-06 2021-08-03 Deepmind Technologies Limited Processing sequences using convolutional neural networks
US9824692B1 (en) 2016-09-12 2017-11-21 Pindrop Security, Inc. End-to-end speaker recognition using deep neural network
US10255910B2 (en) * 2016-09-16 2019-04-09 Apptek, Inc. Centered, left- and right-shifted deep neural networks and their combinations
US10482336B2 (en) 2016-10-07 2019-11-19 Noblis, Inc. Face recognition and image search system using sparse feature vectors, compact binary vectors, and sub-linear search
US11205110B2 (en) 2016-10-24 2021-12-21 Microsoft Technology Licensing, Llc Device/server deployment of neural network data entry system
US20180129937A1 (en) * 2016-11-04 2018-05-10 Salesforce.Com, Inc. Quasi-recurrent neural network
WO2018088794A2 (en) * 2016-11-08 2018-05-17 삼성전자 주식회사 Method for correcting image by device and device therefor
US10170110B2 (en) * 2016-11-17 2019-01-01 Robert Bosch Gmbh System and method for ranking of hybrid speech recognition results with neural networks
US9911413B1 (en) 2016-12-28 2018-03-06 Amazon Technologies, Inc. Neural latent variable model for spoken language understanding
US10170107B1 (en) 2016-12-29 2019-01-01 Amazon Technologies, Inc. Extendable label recognition of linguistic input
US10490182B1 (en) * 2016-12-29 2019-11-26 Amazon Technologies, Inc. Initializing and learning rate adjustment for rectifier linear unit based artificial neural networks
US11010431B2 (en) 2016-12-30 2021-05-18 Samsung Electronics Co., Ltd. Method and apparatus for supporting machine learning algorithms and data pattern matching in ethernet SSD
US20180189641A1 (en) * 2017-01-04 2018-07-05 Stmicroelectronics S.R.L. Hardware accelerator engine
KR20180080446A (en) 2017-01-04 2018-07-12 삼성전자주식회사 Voice recognizing method and voice recognizing appratus
US11250311B2 (en) * 2017-03-15 2022-02-15 Salesforce.Com, Inc. Deep neural network-based decision network
KR102415508B1 (en) * 2017-03-28 2022-07-01 삼성전자주식회사 Convolutional neural network processing method and apparatus
US10819724B2 (en) * 2017-04-03 2020-10-27 Royal Bank Of Canada Systems and methods for cyberbot network detection
WO2018184102A1 (en) * 2017-04-03 2018-10-11 Royal Bank Of Canada Systems and methods for malicious code detection
US10346944B2 (en) * 2017-04-09 2019-07-09 Intel Corporation Machine learning sparse computation mechanism
US11640526B2 (en) * 2017-05-23 2023-05-02 Intel Corporation Methods and apparatus for enhancing a neural network using binary tensor and scale factor pairs
US11562287B2 (en) * 2017-10-27 2023-01-24 Salesforce.Com, Inc. Hierarchical and interpretable skill acquisition in multi-task reinforcement learning
US10657259B2 (en) * 2017-11-01 2020-05-19 International Business Machines Corporation Protecting cognitive systems from gradient based attacks through the use of deceiving gradients
US11270187B2 (en) * 2017-11-07 2022-03-08 Samsung Electronics Co., Ltd Method and apparatus for learning low-precision neural network that combines weight quantization and activation quantization
US20190180183A1 (en) * 2017-12-12 2019-06-13 Amazon Technologies, Inc. On-chip computational network
US10593321B2 (en) * 2017-12-15 2020-03-17 Mitsubishi Electric Research Laboratories, Inc. Method and apparatus for multi-lingual end-to-end speech recognition
US10657426B2 (en) * 2018-01-25 2020-05-19 Samsung Electronics Co., Ltd. Accelerating long short-term memory networks via selective pruning
US11741346B2 (en) * 2018-02-08 2023-08-29 Western Digital Technologies, Inc. Systolic neural network engine with crossover connection optimization
US20190266246A1 (en) * 2018-02-23 2019-08-29 Microsoft Technology Licensing, Llc Sequence modeling via segmentations
US10629193B2 (en) * 2018-03-09 2020-04-21 Microsoft Technology Licensing, Llc Advancing word-based speech recognition processing
US10740647B2 (en) * 2018-03-14 2020-08-11 Adobe Inc. Detecting objects using a weakly supervised model
US10836379B2 (en) * 2018-03-23 2020-11-17 Sf Motors, Inc. Multi-network-based path generation for vehicle parking
US10902302B2 (en) * 2018-04-23 2021-01-26 International Business Machines Corporation Stacked neural network framework in the internet of things
US10621991B2 (en) * 2018-05-06 2020-04-14 Microsoft Technology Licensing, Llc Joint neural network for speaker recognition
US10714122B2 (en) * 2018-06-06 2020-07-14 Intel Corporation Speech classification of audio for wake on voice
US10705892B2 (en) * 2018-06-07 2020-07-07 Microsoft Technology Licensing, Llc Automatically generating conversational services from a computing application
US10665222B2 (en) * 2018-06-28 2020-05-26 Intel Corporation Method and system of temporal-domain feature extraction for automatic speech recognition
US11210475B2 (en) * 2018-07-23 2021-12-28 Google Llc Enhanced attention mechanisms
US10210860B1 (en) * 2018-07-27 2019-02-19 Deepgram, Inc. Augmented generalized deep learning with special vocabulary
US10650807B2 (en) * 2018-09-18 2020-05-12 Intel Corporation Method and system of neural network keyphrase detection
US10909970B2 (en) * 2018-09-19 2021-02-02 Adobe Inc. Utilizing a dynamic memory network to track digital dialog states and generate responses

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11514368B2 (en) * 2019-03-29 2022-11-29 Advanced New Technologies Co., Ltd. Methods, apparatuses, and computing devices for trainings of learning models
CN111611892A (en) * 2020-05-14 2020-09-01 青岛翰林汇力科技有限公司 Comprehensive intelligent deep learning method applying neural network

Also Published As

Publication number Publication date
US20200294492A1 (en) 2020-09-17
US20200035222A1 (en) 2020-01-30
US10380997B1 (en) 2019-08-13
US20230317062A1 (en) 2023-10-05
US20210035565A1 (en) 2021-02-04
US10847138B2 (en) 2020-11-24
US20200035224A1 (en) 2020-01-30
US10720151B2 (en) 2020-07-21
US10210860B1 (en) 2019-02-19
US11367433B2 (en) 2022-06-21
US10540959B1 (en) 2020-01-21
US11676579B2 (en) 2023-06-13

Similar Documents

Publication Publication Date Title
US11676579B2 (en) Deep learning internal state index-based search and classification
US11210475B2 (en) Enhanced attention mechanisms
EP2727103B1 (en) Speech recognition using variable-length context
US9336771B2 (en) Speech recognition using non-parametric models
Lozano-Diez et al. An analysis of the influence of deep neural network (DNN) topology in bottleneck feature based language recognition
Hakkani-Tür et al. Beyond ASR 1-best: Using word confusion networks in spoken language understanding
US11205444B2 (en) Utilizing bi-directional recurrent encoders with multi-hop attention for speech emotion recognition
US20170148433A1 (en) Deployed end-to-end speech recognition
US11521071B2 (en) Utilizing deep recurrent neural networks with layer-wise attention for punctuation restoration
US20190385610A1 (en) Methods and systems for transcription
US20220189456A1 (en) Unsupervised Learning of Disentangled Speech Content and Style Representation
Sarthak et al. Spoken language identification using convnets
EP3739570A1 (en) Attention-based neural sequence to sequence mapping applied to speech synthesis and vocal translation
CN114023300A (en) Chinese speech synthesis method based on diffusion probability model
CN117037789B (en) Customer service voice recognition method and device, computer equipment and storage medium
US20230252994A1 (en) Domain and User Intent Specific Disambiguation of Transcribed Speech
US20230237990A1 (en) Training speech processing models using pseudo tokens
Protserov et al. Segmentation of Noisy Speech Signals
CN114783413A (en) Method, device, system and equipment for re-scoring language model training and voice recognition
Humayun et al. A review of social background profiling of speakers from speech accents
WO2023234942A1 (en) Spoken language understanding using machine learning

Legal Events

Date Code Title Description
FEPP Fee payment procedure

Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY

AS Assignment

Owner name: DEEPGRAM, INC., CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:SYPNIEWSKI, ADAM;WARD, JEFF;STEPHENSON, SCOTT;SIGNING DATES FROM 20180817 TO 20180820;REEL/FRAME:047875/0886

FEPP Fee payment procedure

Free format text: ENTITY STATUS SET TO SMALL (ORIGINAL EVENT CODE: SMAL); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY

Free format text: ENTITY STATUS SET TO MICRO (ORIGINAL EVENT CODE: MICR); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY

STCF Information on status: patent grant

Free format text: PATENTED CASE

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YR, SMALL ENTITY (ORIGINAL EVENT CODE: M2551); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY

Year of fee payment: 4