US9892726B1 - Class-based discriminative training of speech models - Google Patents
Class-based discriminative training of speech models Download PDFInfo
- Publication number
- US9892726B1 US9892726B1 US14/574,239 US201414574239A US9892726B1 US 9892726 B1 US9892726 B1 US 9892726B1 US 201414574239 A US201414574239 A US 201414574239A US 9892726 B1 US9892726 B1 US 9892726B1
- Authority
- US
- United States
- Prior art keywords
- mixture model
- subspace
- gaussian mixture
- subspace matrix
- mean vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active, expires
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/14—Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
- G10L2015/0631—Creating reference templates; Clustering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L2015/088—Word spotting
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/223—Execution procedure of a spoken command
Definitions
- Models representing data relationships and patterns may accept audio data input (sometimes referred to as an input vector), and produce output (sometimes referred to as an output vector) that corresponds to the input in some way.
- a model is used to generate a probability or set of probabilities that the input corresponds to a particular language unit (e.g., phoneme, phoneme portion, triphone, word, n-gram, part of speech, etc.).
- ASR automatic speech recognition
- a model may utilize various models, such as an acoustic model and a language model, to recognize speech.
- the acoustic model is used to generate hypotheses regarding which words or subword units (e.g., phonemes) correspond to an utterance based on the acoustic features of the utterance.
- the language model is used to determine which of the hypotheses generated using the acoustic model is the most likely transcription of the utterance.
- ASR systems commonly utilize Gaussian mixture models (“GMMs”) to model acoustic input.
- GMMs Gaussian mixture models
- Features can be extracted from an utterance in the form of feature vectors, which include one or more numbers that describe the audio input.
- the feature vectors can be processed using GMMs to determine the most likely word or subword unit that was spoken and that resulted in the corresponding feature vectors extracted.
- GMMs are trained, using training data, to maximize the likelihood of the correct words or subword units corresponding to the feature vectors of the training data.
- FIG. 1 is a flow diagram of an illustrative process for learning a discriminative model parameter subspace according to some embodiments.
- FIG. 2 is a block diagram of an illustrative statistical modeling environment including a model training server, a speech recognition server, and various user computing devices according to some embodiments.
- FIG. 3 is a flow diagram of an illustrative process for generating data for use in learning a discriminative model parameter subspace according to some embodiments.
- FIG. 4 is a flow diagram of an illustrative process for using a model modified with a discriminative model parameter subspace according to some embodiments.
- FIG. 5 is a block diagram of an illustrative computing system configured to modify and use models according to some embodiments.
- input may be classified as any of a finite set of classes that can be determined from audio data, such as: speech or non-speech; one of multiple (e.g., two or more) different keywords; speech from a male speaker or a female speaker; etc.
- the input may be classified using models for the individual classes.
- the models may be modified based on a selected portion of the model parameters. For example, the models may be trained, adapted, adjusted, etc. using a portion or subset of a total parameter space, such as a low-dimension subspace of the total parameter space (e.g., a subspace with fewer dimensions than the total parameter space).
- the subspace may be learned specifically to discriminate between the possible classes (e.g., points in the subspace, corresponding to the various classes, are determined such that a model modified with the points is better able to discriminate the various classes).
- GMMs Gaussian Mixture Models
- a GMM may be used to model statistical data computed from audio signals, such as a GMM in which the individual components are Gaussian distributions of statistical features computed from audio signals.
- the individual Gaussians may be weighted with respect to each other, and may be defined by parameter information regarding the mean and the variance of the Gaussian distribution, such as a mean vector and a covariance matrix.
- a GMM may be defined by a set of weights and corresponding parameters for its component Gaussians.
- One common way, among others, of modifying the Gaussians of a GMM is to adjust their mean vectors.
- modifying a model may be referred to as “adapting” or “adjusting” the model.
- the mean vectors for all components of the GMM may be concatenated to obtain the total parameter space for the mean vectors of the GMM.
- a subspace of that total parameter space may then be identified by computing the best possible coordinate within the subspace for each segment of input data being used to train the GMM.
- the subspace can then be adjusted such that the likelihood of the training data is maximized when the components of the GMM are modified based on the subspace.
- an illustrative GMM has two Gaussians, each modeling a distribution in 10-dimensional space.
- the mean vectors for each of the two Gaussians are therefore 10-dimensional vectors (e.g., vectors with 10 separate values corresponding to the means of the 10 dimensions of the distribution).
- One way to modify the GMM is to determine new mean vectors anywhere within a 20-dimensinal space defined by the combination of the two 10-dimensional mean vectors of the component Gaussians.
- Another way to modify the GMM is to place a constraint on the total parameter space within which the new mean vectors may be determined (e.g., the 20-dimensional space), thereby using only a subspace of the total parameter space of the GMM (e.g., some lower-dimensional subspace of the 20-dimensional space).
- the subspace can be selected using points that maximize the likelihood of the training data being used.
- a subspace matrix defining the subspace may be added to the mean vectors to modify (or “adapt” or “adjust”) the GMM.
- models modified using these and other techniques are not able to classify input data because they are not designed to deal with classes, or they are designed with the assumption that all data is properly classified in a single class. Thus, models modified using such existing techniques are not able to adequately discriminate between classes of input data.
- modifying the model may include adding a subspace matrix defining the learned subspace (or adding data derived from the subspace matrix, such as a product of the matrix and a particular data value) to the mean vectors of the model.
- the training data may include a set of data segments (e.g., each data segment may correspond to an utterance, and may be composed of one or more feature vectors) and correct class labels for the data segments.
- a discriminative objective function (e.g., a function that corresponds to the objective of the process: to discriminate classes) can use data generated using the generative model, or values derived therefrom, to learn the desired subspace that maximizes the probability of the correct classes for the training data.
- data is generated by randomly selecting a class, selecting a Gaussian from an existing GMM that models the selected class, generating a point in an initial guess for the subspace, and then generating values based on the selected Gaussian modified using the initial guess for the subspace and the point generated in the subspace.
- a generative model can be used to optimize the discriminative objective function.
- the subspace is learned such that modifying the model within the learned subspace improves the probability that the model correctly classifies data.
- the discriminative objective function also referred to simply as the “objective function,” may include one or more hidden or latent variables and any number of observed variables.
- an auxiliary function with the same gradient (e.g., derivative, or slope and direction of maximum increase in function value) as the objective function may be used in place of the objective function for convenience, such as when performing computations with the auxiliary function is easier or less resource-intensive than performing computations with the objective function.
- the hidden variables of the objective function may include, among others, the mean of the posterior distribution of the subspace points corresponding to the individual data segments for a given class, the posterior probability of the correct class for a given data segment, and the like. Using the data described above and in greater detail below, such hidden variables can be computed and then used in the objective function (or auxiliary function) to learn the subspace.
- the class models can be used to classify input data into particular classes. For example, a keyword spotting system may use such class models to detect when a particular keyword has been spoken by classifying audio input as including the particular keyword out of multiple possible keywords, as including or excluding the particular keyword, etc.
- the class models may be used to determine the gender of a speaker so that speech recognition models tailored for the speaker's gender can be used.
- the class models may be used to determine whether input contains speech of any kind, such that speech recognition systems only proceed to perform speech recognition on input that is classified as human speech (as opposed to non-speech).
- FIG. 1 depicts an illustrative process 100 for learning a discriminative subspace that can be used to modify class models for improved class discrimination.
- the process 100 may be used to learn the subspace such that when a model is modified using points in that subspace, the probability of correct classification using the model is maximized or otherwise improved.
- the process 100 begins at block 102 .
- the process 100 may be embodied in a set of executable program instructions stored on a computer-readable medium, such as one or more disk drives, of a computing system, such as the model training system 202 shown in FIG. 2 and described in greater detail below.
- the executable program instructions can be loaded into memory, such as RAM, and executed by one or more processors of the computing system.
- the system executing the process 100 can obtain models for the various classes of input data to be classified.
- the models may be previously-generated or previously-trained GMMs that include any number of individual Gaussians.
- the GMM for a given class may have been generated using training data for the class, and the GMM may have been trained to maximize the likelihood of the training data for the class.
- the system executing the process 100 can determine an initial subspace T 0 of the overall parameter space for the GMMs to use as a starting point in the subspace learning process.
- the initial subspace may be a subspace matrix T 0 that is randomly initialized.
- the subspace matrix T 0 may be intelligently initialized. For example, techniques such as principal component analysis (“PCA”) may be used to initialize the subspace by applying a statistical transformation to values from the total parameter space.
- PCA principal component analysis
- the process of learning a better (e.g., more discriminative) subspace matrix T final to use in modifying the class GMMs may be based on an objective function (e.g., a function that corresponds to the objective of the process: to accurately discriminate classes).
- an objective function e.g., a function that corresponds to the objective of the process: to accurately discriminate classes.
- the gradient of the objective function may be computed with respect to the initial guess T 0 for the subspace matrix, and the process may be repeated to adjust the subspace matrix as described in greater detail below.
- the objective function may be given by equation [1], below:
- C s is a random variable representing the class of the s th data segment determined using the model
- y s is the true class label of the s th data segment
- X s is the set of feature vectors for the s th segment
- ⁇ is the set of known parameters for the class GMM (weights, mean vectors and covariance matrices for the component Gaussians) and also the subspace matrix T currently being used (initially, T 0 ).
- the sample objective function may be constructed as the difference of two terms: the first term corresponds to the log probability of observing feature vectors for the data segment and also determining the correct class for that data segment using the model with parameters ⁇ , and the second term corresponds to the log probability of observing the feature vectors for the data segment using the model with parameters ⁇ .
- the gradient of the objective function may be computed with respect to the initialized subspace matrix in order to learn a better (e.g., more discriminative) subspace matrix.
- an auxiliary function (a function that can be shown to have the same gradient) may be used instead.
- the weak sense auxiliary function of the objective function in equation [1] may be constructed as the difference between the auxiliary functions of the individual terms in equation [1], as shown in equation [2], below:
- the initial subspace T 0 may be part of the initial model parameters ⁇ 0 shown in equation [2].
- one or more hidden variables e.g., variables which cannot be directly observed from the data
- the posterior distribution of the subspace points q s may be computed, the posterior probability of the correct class y s for a given data segment s may be computed, etc. using data generated below.
- the system executing the process 100 can generate data using the class models modified with the current subspace matrix T.
- FIG. 3 illustrated in greater detail below, illustrates an example process for generating data using the class models modified with the subspace matrix.
- Such a generative model can be used for computing the hidden variables.
- block 108 may be omitted or replaced with an alternative method of obtaining data from which to compute the hidden variables described below.
- the system executing the process 100 can use the data generated above to, among other things, compute the hidden variables of the auxiliary function.
- the specific Gaussian component Z s used to generate the feature vectors X s for the s th segment, the subspace point q s randomly generated and used to modify the Gaussian component Z s , and the correct class y s for the s th segment may be used to compute the mean posterior distribution of subspace points, the posterior probability of the correct class, etc.
- Equation [3] gives the mean of the posterior distribution of the subspace points q for a given class j: f ⁇ ( q s
- X s ,Z s j ,C s j ) ⁇ N (.; ⁇ q s j , ⁇ q s j ) [3] where ⁇ q s j corresponds to the mean of the distribution, and ⁇ q s j corresponds to the covariance matrix of the distribution. Equation [4] below gives the posterior probability of the correct class for a given data segment X s :
- the system executing the process 100 can compute the gradient of the objective function (or auxiliary function) with respect to the current subspace matrix T.
- the gradient of the auxiliary function may be computed with respect to T 0 . If the subspace has been updated or modified to subspace is T 1 then the gradient of the auxiliary function may be computed with respect to T 1 , and so on.
- the system executing the process 100 can update the subspace matrix using the gradient computed above to generate T 1 (when the gradient above is computed with respect to T 0 ), T 2 (when the gradient above is computed with respect to T 1 ), and so on.
- T i when the gradient above is computed with respect to T 0
- T 2 when the gradient above is computed with respect to T 1
- T 5 One method for updating a subspace matrix T i for a given class j is given by equation [5], below:
- the system executing the process 100 can determine whether to return to block 108 (or, in some embodiments, to block 110 ) to iteratively learn the final subspace matrix to be used. If the final subspace matrix T final has been learned, the process 100 may terminate at block 118 . In some embodiments, the system executing the process 100 can deteiinine that the final subspace matrix T final has been learned when objective function calculations are converging (e.g., the output from the objective function is no longer changing, or is changing by less than a threshold amount, from iteration-to-iteration of the process 100 ). In some embodiments, the subspace matrix can be used on training data, and accuracy can be compared from iteration-to-iteration.
- T final can be determined (e.g., the last version of the matrix used before accuracy began to degrade can be identified as T final ).
- FIG. 2 illustrates an example statistical modeling environment 200 including a model training system 202 , a speech recognition system 204 , a various client devices 206 , 208 in communication via a communication network 210 .
- the model training system 202 may train models to discriminate between classes of input data, as described above.
- the trained models may then be used by the speech recognition system 204 to classify or otherwise process audio input obtained from user devices 206 and 208 .
- a user can issue spoken commands to a user device 206 in order to get directions, listen to music, query a data source, dictate a document or message, or the like, and the audio input may be processed by the speech recognition, using models trained by the model training system 202 , to classify the audio input or some portion thereof.
- the user devices 206 and 208 can correspond to a wide variety of electronic devices.
- a user device 206 or 208 may be a computing device that includes one or more processors and a memory which may contain software applications executed by the processors.
- the user device 206 or 208 may include a microphone or other audio input component for accepting speech input on which to perform speech recognition.
- the software of the user device 206 or 208 may include components for establishing communications over wireless communication networks or directly with other computing devices.
- the user device 206 or 208 may be a personal computing device, laptop computing device, hand held computing device, terminal computing device, mobile device (e.g., mobile phones or tablet computing devices), wearable device configured with network access and program execution capabilities (e.g., “smart eyewear” or “smart watches”), wireless device, electronic reader, media player, home entertainment system, gaming console, set-top box, television configured with network access and program execution capabilities (e.g., “smart TVs”), or some other electronic device or appliance.
- mobile device e.g., mobile phones or tablet computing devices
- wearable device configured with network access and program execution capabilities
- wireless device e.g., “smart eyewear” or “smart watches”
- wireless device e.g., electronic reader, media player, home entertainment system, gaming console, set-top box
- television e.g., “smart TVs”
- the model training system 202 can be any computing system that is configured to train or modify models using a learned discriminative subspace matrix.
- the model training system 202 may include server computing devices, desktop computing devices, mainframe computers, other computing devices, some combination thereof, etc.
- the model training system 202 can include several devices physically or logically grouped together, such as an application server and a database server.
- the application server may execute the process 100 described above, the process 300 described below, or other additional or alternative processes to train or modify models using a learned discriminative subspace matrix.
- the database server may store existing models for use by the application server, trained or modified models generated by the application server, or the like.
- the model training system 202 may be implemented on one or more of the server computing devices 500 as shown in FIG. 5 and described in greater detail below.
- the speech recognition system 204 can be any computing system that is configured to perform speech recognition or classification on audio input.
- the speech recognition system 204 may include server computing devices, desktop computing devices, mainframe computers, other computing devices, some combination thereof, etc.
- the speech recognition system 204 can include several devices physically or logically grouped together, such as an application server and a database server.
- the speech recognition system 204 can include an automatic speech recognition (“ASR”) module or system, a natural language understanding (“NLU”) module or system, one or more applications, other modules or systems, some combination thereof, etc.
- ASR automatic speech recognition
- NLU natural language understanding
- the speech recognition system 204 may be implemented on one or more of the server computing devices 500 as shown in FIG. 5 and described in greater detail below.
- the speech recognition system 204 may be physically or logically associated with the model training system 202 (or certain modules or components thereof) such that they speech recognition system 204 and model training system 202 are not separate systems, but rather parts of a single integrated system.
- such an integrated system may be implemented on one or more of the server computing devices 500 as shown in FIG. 5 and described in greater detail below.
- the features and services provided by the speech recognition system 204 and/or model training system 202 may be implemented as web services consumable via the communication network 210 .
- the speech recognition system 204 and/or model training system 202 are provided by one more virtual machines implemented in a hosted computing environment.
- the hosted computing environment may include one or more rapidly provisioned and released computing resources, which computing resources may include computing, networking and/or storage devices.
- a hosted computing environment may also be referred to as a cloud computing environment.
- the speech recognition system 204 may be physically located on a user device 206 , 208 .
- a user device 206 may include an integrated ASR module or system such that no network access is required in order to use some or all of its features.
- the models trained or generated by the model training system 202 may be provided to or otherwise received by the user device 206 for use in speech recognition, input classification, etc.
- a portion of ASR processing or other speech recognition system 204 functionality may implemented on a user device 206 or 208 , and other speech recognition system 204 components and features may be accessible via a communication network 210 .
- the network 210 may be a publicly accessible network of linked networks, possibly operated by various distinct parties, such as the Internet.
- the network 504 may include a private network, personal area network (“PAN”), location are network (“LAN”), wide area network (“WAN”), cable network, satellite network, etc. or some combination thereof, each with access to and/or from the Internet.
- PAN personal area network
- LAN local area network
- WAN wide area network
- the devices of the speech recognition system 204 and/or the model training system 202 may be located within a single data center, and may communicate via a private network as described above.
- the user devices 206 , 208 may communicate with speech recognition system 204 and/or the model training system 202 via the Internet.
- the user devices 206 , 208 may have access to the Internet via a wired or WiFi connection, or via a cellular telephone network (e.g., a Long Term Evolution or LTE network).
- a cellular telephone network e.g., a Long Term Evolution or L
- the model training system 202 may obtain existing models at (A) and learn a discriminative subspace at (B).
- the model training system 202 may use the process 100 described above to learn a discriminative subspace for the models.
- the model training system 202 may generate data to use during the process, as described above with respect to FIG. 1 .
- One method of generating such data is shown in FIG. 3 .
- the model training system 202 may use the process 300 shown in FIG. 3 to generate data using points within the current subspace so that the discriminative effect of the current subspace can be evaluated and/or improved.
- the process 300 begins at block 302 .
- the process 300 may be embodied in a set of executable program instructions stored on a computer-readable medium, such as one or more disk drives, of a computing system, such as the model training system 202 .
- the executable program instructions can be loaded into memory, such as RAM, and executed by one or more processors of the computing system.
- the process may be part of the process 100 described above (e.g., block 108 ), or it may be a separate process that is invoked by the process 100 .
- the system executing the process 300 can generate a point q within the current subspace (e.g., T 0 , T 1 , etc.).
- the point may be generated randomly according to a predetermined or dynamically determined distribution.
- the point may be randomly generated from a Gaussian distribution having a zero mean and a unit variance such that over a statistically significant number of iterations, the distribution of generated subspace points corresponds to a simple Gaussian distribution.
- the system executing the process 300 can select a class j and access the model (e.g., the GMM) for the class.
- the class may be randomly selected.
- the class may be randomly selected according to a prior probability distribution such that classes with higher prior probabilities have a greater chance of being selected.
- the system executing the process 300 can select a component z from the class model (e.g., a particular Gaussian from the GMM) for the selected class.
- the component may be selected based on the component weights w such that components with a higher weight have a greater chance of being selected.
- a data segment may include one or more feature vectors.
- the data segment may be generated by randomly generating values for the feature vectors according to the selected Gaussian component, modified using the subspace matrix and the subspace point, such that over a statistically significant number of iterations, the distribution of feature vector values corresponds to the selected Gaussian modified using the subspace matrix and the subspace point.
- the system executing the process 300 can determine whether to generate additional data. If so, the process 300 can return to blocks 304 and 306 . Otherwise, the process 300 can terminate at block 314 .
- the amount of data that the system generates using the process 300 may be application-specific. In some embodiments, a single data segment or a small number of data segments may be generated and used for an iteration of the process 100 . In other embodiments, a larger number of data segments may be generated for use during each iteration of the process 100 , or some subset thereof
- the model training system 202 may determine the final subspace and subspace points with which to modify the model at (C), as described above, and may provide the models to the speech recognition system 204 or some other system or device at (D).
- the speech recognition system 204 may use the class models provided by the model training system 202 to classify audio received from user computing devices 206 , 208 .
- user computing device 206 may capture audio at ( 1 ), and the audio may be provided to the speech recognition system 204 at ( 2 ).
- the speech recognition system 204 may then classify the audio at ( 3 ) as non-speech, or as failing to include a keyword that indicates that the user is addressing the device and the subsequent speech is to be processed.
- the speech recognition system 204 may end processing at ( 4 ).
- One example process for making such classifications is shown in FIG. 4 .
- the process 400 shown in FIG. 4 begins at block 402 .
- the process 400 may be embodied in a set of executable program instructions stored on a computer-readable medium, such as one or more disk drives, of a computing system, such as the speech recognition system 204 .
- the executable program instructions can be loaded into memory, such as RAM, and executed by one or more processors of the computing system.
- the process 400 may be executed in response to the receipt of audio from a user computing device or some other source.
- the system executing the process 400 can extract features from the audio, as is known to those of skill the art.
- the system executing the process 400 can process the features, extracted as feature vectors from the audio, using the discriminative models provided by the model training system 202 .
- the system may use equation [4], or some variant thereof, to determine the most probable true class for the feature vectors extracted from the audio input. For example, the probability that audio input (from which feature vectors X. are extracted for each segment s of audio input) is of a particular class (j) can be computed using equation [4].
- the GMM parameters ( ⁇ ) for each class (j) can include weights, mean vectors, covariance matrices, and learned subspace matrices (T) for each of the individual component Gaussians of the GMM, or for some subset thereof.
- the system executing the process 400 can determine if the class identified above is “non-speech.” If so, the process 400 may terminate at block 416 . Otherwise, the process may proceed to block 414 , where speech recognition is performed.
- the system executing the process 400 can determine at decision block 410 whether the class identified above is “non-keyword.” If so (e.g., the audio may have included speech, but nevertheless failed to include a keyword that triggers speech processing), the process 400 may terminate at block 416 . Otherwise, the process 400 may proceed to block 414 , where speech recognition is performed.
- user computing device 208 may capture audio at (I), and the audio may be provided to the speech recognition system 204 at (II).
- the speech recognition system 204 may then classify, at (III), the speaker of speech detected in the audio as a member of a particular class (e.g., speaker of a specific gender, speaker of a specific language, etc.).
- the speech recognition system 204 may process the speech using class-specific models at (IV) that provide greater accuracy than general models, and speech recognition results may be returned to the user device 208 at (IV).
- One example process for making such classifications is shown in FIG. 4 .
- the system executing the process 400 can extract features from the audio, as is known to those of skill the art.
- the system executing the process 400 can process the features, extracted as feature vectors from the audio, using the discriminative models provided by the model training system 202 .
- the model may use equation [4], or some variant thereof, to determine the most probable true class for the feature vectors extracted from the audio.
- the system executing the process 400 can access one or more class-specific models for performing speech recognition on the audio. For example, if the audio is classified as including speech from a male speaker, then male-specific models may be accessed. The process may then proceed to block 414 , where speech recognition is performed using the class-specific models.
- FIG. 5 illustrates an example server computing device computing device 500 configured to execute the processes and implement the features described above.
- the computing device 500 can be a server or other computing device, and can comprise a processing unit 502 , a network interface 504 , a computer readable medium drive 506 , an input/output device interface 508 , and a memory 510 .
- the network interface 504 can provide connectivity to one or more networks or computing systems.
- the processing unit 504 can receive information and instructions from other computing systems or services via the network interface 504 .
- the network interface 504 can also store data directly to memory 510 .
- the processing unit 502 can communicate to and from memory 510 , execute instructions and process data in memory 510 , etc.
- the memory 510 may include computer program instructions that the processing unit 502 executes in order to implement one or more embodiments.
- the memory 510 generally includes volatile memory, such as RAM, and/or other non-transitory computer-readable media.
- the memory 510 can store an operating system 512 that provides computer program instructions for use by the processing unit 502 in the general administration and operation of the computing device 500 .
- the memory 510 can further include computer program instructions and other information for implementing aspects of the present disclosure.
- the memory 510 includes a model modification module 514 that determines subspace matrices to be used in modifying class models, as described above with respect to process 200 .
- Memory 510 may also include an input classification module 516 that uses modified models to classify input, as described above with respect to process 400 .
- the computing device 502 may include additional or fewer components than are shown in FIG. 5 .
- a computing device 502 may include more than one processing unit 502 and computer readable medium drive 506 .
- multiple (e.g., two or more) computing devices 500 may together form a computer system for executing features of the present disclosure.
- multiple computing devices 500 may communicate with each other via their respective network interfaces 504 , and can implement load balancing of multiple tasks (e.g., each computing device 500 may execute one or more separate instances of the processes 200 and/or 400 ), parallel processing (e.g., each computing device 500 may execute a portion of a single instance of a process 200 and/or 400 ), etc.
- a machine such as a general purpose processor device, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein.
- DSP digital signal processor
- ASIC application specific integrated circuit
- FPGA field programmable gate array
- a general purpose processor device can be a microprocessor, but in the alternative, the processor device can be a controller, microcontroller, or state machine, combinations of the same, or the like.
- a processor device can include electrical circuitry configured to process computer-executable instructions.
- a processor device includes an FPGA or other programmable device that perfoiins logic operations without processing computer-executable instructions.
- a processor device can also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.
- a processor device may also include primarily analog components. For example, some or all of the signal processing algorithms described herein may be implemented in analog circuitry or mixed analog and digital circuitry.
- a computing environment can include any type of computer system, including, but not limited to, a computer system based on a microprocessor, a mainframe computer, a digital signal processor, a portable computing device, a device controller, or a computational engine within an appliance, to name a few.
- a software module can reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of a non-transitory computer-readable storage medium.
- An exemplary storage medium can be coupled to the processor device such that the processor device can read information from, and write information to, the storage medium.
- the storage medium can be integral to the processor device.
- the processor device and the storage medium can reside in an ASIC.
- the ASIC can reside in a user terminal.
- the processor device and the storage medium can reside as discrete components in a user terminal.
- Disjunctive language such as the phrase “at least one of X, Y, Z,” unless specifically stated otherwise, is otherwise understood with the context as used in general to present that an item, term, etc., may be either X, Y, or Z, or any combination thereof (e.g., X, Y, and/or Z). Thus, such disjunctive language is not generally intended to, and should not, imply that certain embodiments require at least one of X, at least one of Y, or at least one of Z to each be present.
- a device configured to are intended to include one or more recited devices. Such one or more recited devices can also be collectively configured to carry out the stated recitations.
- a processor configured to carry out recitations A, B and C can include a first processor configured to carry out recitation A working in conjunction with a second processor configured to carry out recitations B and C.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Probability & Statistics with Applications (AREA)
- Artificial Intelligence (AREA)
- Signal Processing (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
where Cs is a random variable representing the class of the sth data segment determined using the model, ys is the true class label of the sth data segment, Xs is the set of feature vectors for the sth segment, and ⊖ is the set of known parameters for the class GMM (weights, mean vectors and covariance matrices for the component Gaussians) and also the subspace matrix T currently being used (initially, T0). As seen, the sample objective function may be constructed as the difference of two terms: the first term corresponds to the log probability of observing feature vectors for the data segment and also determining the correct class for that data segment using the model with parameters ⊖, and the second term corresponds to the log probability of observing the feature vectors for the data segment using the model with parameters ⊖.
where qs is the subspace point for the sth data segment generated according to a distribution having a zero mean and a unit variance, and Zs identifies the Gaussian components from which the feature vectors Xs were generated for the sth data segment. The initial subspace T0 may be part of the initial model parameters ⊖0 shown in equation [2]. In order to properly compute the gradient of the auxiliary function, one or more hidden variables (e.g., variables which cannot be directly observed from the data) may be computed. For example, the posterior distribution of the subspace points qs may be computed, the posterior probability of the correct class ys for a given data segment s may be computed, etc. using data generated below.
f⊖(q s |X s ,Z s j ,C s =j)˜N(.;μq
where μq
In some embodiments, additional or alternative hidden variables may be computed, or alternative equations or methods may be used to compute the hidden variables describe above.
In some embodiments, an alternative equation or method may be used to update the subspace matrix. In further embodiments, an additional equation or method may be used to determine the updated subspace matrix, such as using a smoothing function.
Claims (20)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US14/574,239 US9892726B1 (en) | 2014-12-17 | 2014-12-17 | Class-based discriminative training of speech models |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US14/574,239 US9892726B1 (en) | 2014-12-17 | 2014-12-17 | Class-based discriminative training of speech models |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| US9892726B1 true US9892726B1 (en) | 2018-02-13 |
Family
ID=61148020
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US14/574,239 Active 2036-03-26 US9892726B1 (en) | 2014-12-17 | 2014-12-17 | Class-based discriminative training of speech models |
Country Status (1)
| Country | Link |
|---|---|
| US (1) | US9892726B1 (en) |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN111027453A (en) * | 2019-12-06 | 2020-04-17 | 西北工业大学 | Automatic non-cooperative underwater target identification method based on Gaussian mixture model |
| CN111583966A (en) * | 2020-05-06 | 2020-08-25 | 东南大学 | Cross-database speech emotion recognition method and device based on joint distribution least square regression |
| US20210304734A1 (en) * | 2020-03-25 | 2021-09-30 | Qualcomm Incorporated | On-device self training in a two-stage wakeup system |
Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20080021897A1 (en) * | 2006-07-19 | 2008-01-24 | International Business Machines Corporation | Techniques for detection of multi-dimensional clusters in arbitrary subspaces of high-dimensional data |
| US20100169094A1 (en) * | 2008-12-25 | 2010-07-01 | Kabushiki Kaisha Toshiba | Speaker adaptation apparatus and program thereof |
| US20100198598A1 (en) * | 2009-02-05 | 2010-08-05 | Nuance Communications, Inc. | Speaker Recognition in a Speech Recognition System |
| US20130253931A1 (en) * | 2010-12-10 | 2013-09-26 | Haifeng Shen | Modeling device and method for speaker recognition, and speaker recognition system |
| US20130262119A1 (en) * | 2012-03-30 | 2013-10-03 | Kabushiki Kaisha Toshiba | Text to speech system |
| US20160019883A1 (en) * | 2014-07-15 | 2016-01-21 | International Business Machines Corporation | Dataset shift compensation in machine learning |
| US20160078771A1 (en) * | 2014-09-15 | 2016-03-17 | Raytheon Bbn Technologies Corporation | Multi-view learning in detection of psychological states |
-
2014
- 2014-12-17 US US14/574,239 patent/US9892726B1/en active Active
Patent Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20080021897A1 (en) * | 2006-07-19 | 2008-01-24 | International Business Machines Corporation | Techniques for detection of multi-dimensional clusters in arbitrary subspaces of high-dimensional data |
| US20100169094A1 (en) * | 2008-12-25 | 2010-07-01 | Kabushiki Kaisha Toshiba | Speaker adaptation apparatus and program thereof |
| US20100198598A1 (en) * | 2009-02-05 | 2010-08-05 | Nuance Communications, Inc. | Speaker Recognition in a Speech Recognition System |
| US20130253931A1 (en) * | 2010-12-10 | 2013-09-26 | Haifeng Shen | Modeling device and method for speaker recognition, and speaker recognition system |
| US20130262119A1 (en) * | 2012-03-30 | 2013-10-03 | Kabushiki Kaisha Toshiba | Text to speech system |
| US20160019883A1 (en) * | 2014-07-15 | 2016-01-21 | International Business Machines Corporation | Dataset shift compensation in machine learning |
| US20160078771A1 (en) * | 2014-09-15 | 2016-03-17 | Raytheon Bbn Technologies Corporation | Multi-view learning in detection of psychological states |
Cited By (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN111027453A (en) * | 2019-12-06 | 2020-04-17 | 西北工业大学 | Automatic non-cooperative underwater target identification method based on Gaussian mixture model |
| US20210304734A1 (en) * | 2020-03-25 | 2021-09-30 | Qualcomm Incorporated | On-device self training in a two-stage wakeup system |
| US11664012B2 (en) * | 2020-03-25 | 2023-05-30 | Qualcomm Incorporated | On-device self training in a two-stage wakeup system comprising a system on chip which operates in a reduced-activity mode |
| CN111583966A (en) * | 2020-05-06 | 2020-08-25 | 东南大学 | Cross-database speech emotion recognition method and device based on joint distribution least square regression |
| CN111583966B (en) * | 2020-05-06 | 2022-06-28 | 东南大学 | Cross-database speech emotion recognition method and device based on joint distribution least square regression |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US11900948B1 (en) | Automatic speaker identification using speech recognition features | |
| US10902845B2 (en) | System and methods for adapting neural network acoustic models | |
| US11776548B2 (en) | Convolutional neural network with phonetic attention for speaker verification | |
| US9653093B1 (en) | Generative modeling of speech using neural networks | |
| US11620989B2 (en) | Sub-matrix input for neural network layers | |
| EP3424044B1 (en) | Modular deep learning model | |
| US10832662B2 (en) | Keyword detection modeling using contextual information | |
| US10629185B2 (en) | Statistical acoustic model adaptation method, acoustic model learning method suitable for statistical acoustic model adaptation, storage medium storing parameters for building deep neural network, and computer program for adapting statistical acoustic model | |
| CN106688034B (en) | Text-to-speech conversion with emotional content | |
| US9620145B2 (en) | Context-dependent state tying using a neural network | |
| US10332507B2 (en) | Method and device for waking up via speech based on artificial intelligence | |
| US20200335093A1 (en) | Latency constraints for acoustic modeling | |
| US10008209B1 (en) | Computer-implemented systems and methods for speaker recognition using a neural network | |
| US10032463B1 (en) | Speech processing with learned representation of user interaction history | |
| US9818409B2 (en) | Context-dependent modeling of phonemes | |
| WO2018215404A1 (en) | Feedforward generative neural networks | |
| KR102814729B1 (en) | Method and apparatus for speech recognition | |
| CN108885870A (en) | System and method for implementing a voice user interface by combining a speech-to-text system with a speech-to-intent system | |
| US11250860B2 (en) | Speaker recognition based on signal segments weighted by quality | |
| CN113571064B (en) | Natural language understanding methods and devices, vehicles and media | |
| CN112037772A (en) | Multi-mode-based response obligation detection method, system and device | |
| JPWO2007105409A1 (en) | Standard pattern adaptation device, standard pattern adaptation method, and standard pattern adaptation program | |
| US9355636B1 (en) | Selective speech recognition scoring using articulatory features | |
| US9892726B1 (en) | Class-based discriminative training of speech models | |
| Gamage et al. | An i-vector gplda system for speech based emotion recognition |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| AS | Assignment |
Owner name: AMAZON TECHNOLOGIES, INC., WASHINGTON Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:GARIMELLA, SRI VENKATA SURYA SIVA RAMA KRISHNA;MATSOUKAS, SPYRIDON;RASTROW, ARIYA;AND OTHERS;SIGNING DATES FROM 20150320 TO 20150515;REEL/FRAME:035979/0117 |
|
| STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
| CC | Certificate of correction | ||
| MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 4 |
|
| MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 8TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1552); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 8 |