US8775177B1 - Speech recognition process - Google Patents

Speech recognition process Download PDF

Info

Publication number
US8775177B1
US8775177B1 US13/665,245 US201213665245A US8775177B1 US 8775177 B1 US8775177 B1 US 8775177B1 US 201213665245 A US201213665245 A US 201213665245A US 8775177 B1 US8775177 B1 US 8775177B1
Authority
US
United States
Prior art keywords
audio
templates
path costs
similarity metrics
weights
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
US13/665,245
Inventor
Georg Heigold
Patrick An Phu Nguyen
Mitchel Weintraub
Vincent O. Vanhoucke
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Google LLC
Original Assignee
Google LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Google LLC filed Critical Google LLC
Priority to US13/665,245 priority Critical patent/US8775177B1/en
Assigned to GOOGLE INC. reassignment GOOGLE INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: WEINTRAUB, MITCHEL, HEIGOLD, GEORG, NGUYEN, PATRICK AN, VANHOUCKE, VINCENT O.
Application granted granted Critical
Publication of US8775177B1 publication Critical patent/US8775177B1/en
Assigned to GOOGLE LLC reassignment GOOGLE LLC CHANGE OF NAME (SEE DOCUMENT FOR DETAILS). Assignors: GOOGLE INC.
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/10Speech classification or search using distance or distortion measures between unknown speech and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/085Methods for reducing search complexity, pruning

Definitions

  • This disclosure relates generally to speech recognition.
  • Speech recognition includes processes for converting spoken words to text or other data.
  • speech recognition systems translate verbal utterances into a series of computer-readable sounds and compare those sounds to known words.
  • a microphone may accept an analog signal, which is converted into a digital form that is then divided into smaller segments. The digital segments can be compared to elements of a spoken language. Based on this comparison, and an analysis of the context in which those sounds were uttered, the system is able to recognize the speech.
  • a typical speech recognition system may include an acoustic model, a language model, and a dictionary.
  • an acoustic model includes digital representations of individual sounds that are combinable to produce a collection of words, phrases, etc.
  • a language model assigns a probability that a sequence of words will occur together in a particular sentence or phrase.
  • a dictionary transforms sound sequences into words that can be understood by the language model.
  • a speech recognition process may perform the following operations: performing a preliminary recognition process on first audio to identify candidates for the first audio; generating first templates corresponding to the first audio, where each first template includes a number of elements; selecting second templates corresponding to the candidates, where the second templates represent second audio, and where each second template includes elements that correspond to the elements in the first templates; comparing the first templates to the second templates, where comparing includes determining similarity metrics between the first templates and corresponding second templates; applying weights to the similarity metrics to produce weighted similarity metrics, where the weights are associated with corresponding second templates; and using the weighted similarity metrics to determine whether the first audio corresponds to the second audio.
  • the speech recognition systems may include one or more of the following features, either alone or in combination.
  • Selecting the second templates may include selecting templates associated with a non-zero weight.
  • Metadata may be associated with at least one of the first audio and the second audio.
  • the metadata may be used in obtaining at least the second templates.
  • the metadata may be indicative of the context of at least one of the first audio and the second audio.
  • the metadata may indicate at least one word that neighbors a word in at least one of the first audio and the second audio.
  • the preliminary recognition process may include a Hidden Markov Model (HMM) based process.
  • the preliminary recognition process may generate first scores associated with the candidates.
  • Using the weighted similarity metrics to determine whether the first audio corresponds to the second audio may include generating second scores for the first audio, where the second scores correspond to whether the first audio corresponds to the second audio.
  • HMM Hidden Markov Model
  • the operations may include combining the first scores and the second scores using a conditional random field technique to generate a composite score indicative of an extent to which the first audio corresponds to the second audio.
  • Each element may be at least one of: a phoneme in context, a syllable, or a word.
  • the first templates may include vectors
  • the second templates may include vectors
  • the similarity metrics may include distances between vectors.
  • the second templates may include multiple groups of second templates, and each group of second templates may represent a different version of a same candidate word or phrase for at least one of the first and second audio.
  • the second templates may be selected from among a group of templates having associated weights. At least some of the weights may be negative. Weights may be determined using a conditional random field technique. At least some of the weights may be zero. Zero weights may be determined using a regularization technique.
  • Metadata may be associated with at least one of the first audio and the second audio.
  • the metadata may indicate at least one of: information about a speaker of at least one of the first audio or the second audio, and information about an acoustic condition of at least one of the first audio or the second audio.
  • the systems and techniques described herein, or portions thereof, may be implemented as a computer program product that includes instructions that are stored on one or more non-transitory machine-readable storage media, and that are executable on one or more processing devices.
  • the systems and techniques described herein, or portions thereof, may be implemented as an apparatus, method, or electronic system that may include one or more processing devices and memory to store executable instructions to implement the stated functions.
  • FIG. 1 shows, conceptually, an example of a speech recognition system.
  • FIG. 2 shows an example of an acoustic model of the speech recognition system.
  • FIG. 3 is an example of a network on which the speech recognition system may be implemented.
  • FIG. 4 is a flowchart showing an example training phase for use in the speech recognition system.
  • FIG. 5 is a flowchart showing an example process for recognizing speech.
  • FIG. 6 shows examples of computing devices on which the processes described herein, or portions thereof, may be implemented.
  • the processes include performing a preliminary (first pass) recognition process on audio and then performing an exemplar- (e.g., template- or vector-) based recognition process on the audio. Scores from the two processes are used to identify a recognition candidate for the input audio.
  • exemplar- e.g., template- or vector-
  • FIG. 1 shows a conceptual example of a system for performing speech recognition according to the processes described herein.
  • a user 100 of a mobile device 101 accesses a speech recognition system 104 .
  • the mobile device 101 is a cellular telephone having advanced computing capabilities, known as a smartphone.
  • Speech recognition system 104 may be hosted by one or more server(s) that is/are remote from mobile device 101 .
  • speech recognition system 104 may be part of another service available to users of the mobile device 101 (e.g., a help service, a search service, etc.).
  • mobile device 101 may include an application 107 (“app”) that receives input audio (e.g., speech) provided by user 100 and that transmits data 110 representing that input audio to the speech recognition system 104 .
  • App 107 may have any appropriate functionality, e.g., it may be a search app, a messaging app, an e-mail app, and so forth.
  • an app is used as an example in this case.
  • all or part of the functionality of the app 107 may be part of another program downloaded to mobile device 101 , part of another program provisioned on mobile device 101 , part of the operating system of the mobile device 101 , or part of a service available to mobile device 101 .
  • app 107 may ask user 100 to identify, beforehand, the languages that user 100 speaks.
  • the user 100 may select, e.g., via a touch-screen menu item or voice input, the languages that user 100 expects to speak or have recognized.
  • user 100 may also select among various accents or dialects.
  • the user's languages, accents, and/or dialects may be determined based on the audio input itself or based on prior audio or other appropriate input.
  • Speech recognition system 104 includes one or more of each of the following: an acoustic model 115 , a language model 116 , and a dictionary 117 .
  • acoustic model 115 includes digital representations of individual sounds that are combinable to produce a collection of words, phrases, etc.
  • Language model 116 assigns a probability that a sequence of words will occur together in a particular sentence or phrase.
  • Dictionary 117 dictionary transforms sound sequences into words that can be understood by language model 116 .
  • acoustic model 115 includes two stages: a “first pass” stage 115 a and a “second pass” stage 115 b .
  • first pass stage 115 a is implemented using a Hidden Markov Model (HMM)-based system, which identifies recognition candidates and assigns scores thereto.
  • Second pass stage 115 b uses templates, such as vectors, to represent input audio. These vectors are compared to other vectors that represent known words, phrases or other sound sequences. Distances between vectors for input audio and for known audio correspond to a likelihood that the input audio matches the known audio. The distances, which correspond to scores, are used in adjusting the score(s) from the first pass stage to identify a best recognition candidate for the input audio.
  • a conditional random field process may be used to combine the scores from the first pass stage and the second pass stage to identify the candidate. The first pass stage is described initially, followed by the second pass stage.
  • the HMM-based system uses one or more state machines to identify first pass recognition candidates.
  • a state machine may be used to recognize an unknown input.
  • the state machine determines a sequence of known states representing sounds that that best match input speech. This best-matched sequence is deemed to be the state machine's hypothesis for the input speech.
  • the audio element recognized in the first pass stage may be a part of a word (e.g., a syllable), phoneme, etc; a whole word, phoneme, etc.; a part of a sequence of words, phonemes, etc., and so forth.
  • each state in the state machine receives the best incoming path to that state (e.g., the incoming path with the lowest cost), determines how good a match incoming audio is to itself, produces a result called the “state matching cost”, and outputs data corresponding to this result to successor state(s).
  • the combination of state matching costs with the lowest cost incoming path is referred to as the “path cost”.
  • the path with the lowest path cost may be selected as the best-matched sequence for the input speech.
  • a “path” includes a sequence of states through a state machine that are compared to input audio data.
  • a “path cost” includes a sum of matching costs (e.g., costs of matching a state to a segment of audio) and transition costs (costs to transition from a state_i to a state_j).
  • a “best path cost” includes the “path” with the lowest “path cost”.
  • a state in a state machine may have several different states that can transition to the current state. To determine the “best input path” leading into a state, the “path cost” for each path arriving at a current state should be known. If any of the incoming “path costs” are unknown at the current time, then the “best path cost” for this state cannot be determined until incoming path costs become known.
  • user 101 utters input speech, e.g., the word “recognize”, into mobile device 101 .
  • Mobile device 101 converts the input speech into audio data 110 .
  • the audio data is part of a continuous stream that is sent from a microphone to speech recognition system 104 .
  • the speech is received at acoustic model 115 at both the first and second pass stages.
  • the part of the speech recognition process performed by acoustic model 115 employs state machine 200 that include states 201 .
  • states may represent sub-phonemes in the English language.
  • a phoneme is the smallest piece of sound that provides meaningful distinctions between different words in one or more languages (e.g., every word has a sequence of phonemes).
  • the acoustic data of phonemes are further broken down into smaller components called sub-phonemes, which can facilitate more accurate speech recognition (since smaller units of sound are recognized).
  • state machine 200 determines best path cost 204 , which corresponds to a sequence of sub-phonemes that best matches the corresponding input audio element. The better the match is between an audio element and a sequence of sub-phonemes, the smaller the resulting path cost is. Therefore, in this example, the best path cost corresponds to the sequence of sub-phonemes which has the smallest path cost.
  • the speech recognition system includes a state machine 200 with M states, where M ⁇ 1.
  • Audio element, “recognize”, can be broken-down into the following set of sub-phonemes: r-r-r-eh-eh-eh-k-k-ao-ao-g-g-g-n-n-ay-ay-z-z-z, which are labeled as follows: r1, r2, r3, eh1, eh2, eh3, k1, k2, k3, ao1, ao2, ao3, g1, g2, g3, n1, n2, n3, ay1, ay2, ay3, z1, z2, z3.
  • State machine 200 should ultimately find the best path to be the following sequence of sub-phoneme states: r1, r2, r3, eh1, eh2, eh3, k1, k2, k3, ao1, ao2, ao3, g1, g2, g3, n1, n2, n3, ay1, ay2, ay3, z1, z2, z3.
  • first pass stage 115 a of the acoustic model compares the input audio to its model for the word “recognize” and finds a candidate 208 with a best path cost.
  • the candidate corresponds to the sequence of sub-phonemes that has the lowest path cost. More than one best path cost may be obtained in some cases. For example, if determined best path costs are close (e.g., within a predefined tolerance of each other or another metric), several candidates may be selected. One or more words, phrases, etc. 208 thus may be identified and sent to second pass stage 115 b for further processing.
  • the input audio is broken-down into time duration segments.
  • the segments may be, for example, 10 ms each or any other appropriate duration.
  • An average word is around 500 ms. So, in the 10 ms example, an average word contains about 50 segments. Other words, however, may have fewer or lesser numbers of segments.
  • the segments are represented by templates.
  • the templates include vectors 210 having a number of features (e.g., one feature per dimension of a vector). In an example implementation, there are 39 features per vector; however, other implementations may use different numbers of features.
  • the acoustic model is not a series of states as in the first pass stage, but rather a number of vectors for a sound sequence (e.g., a word or phrase).
  • a sound sequence e.g., a word or phrase.
  • vectors for input audio are generated by performing a Fast Fourier Transform (FFT) on the input audio to obtain its frequency components.
  • FFT Fast Fourier Transform
  • a cosine transformation is performed on the frequency components to obtain features for the vectors.
  • thirteen features are obtained per 10 ms segment.
  • First and second derivatives of those features are taken over time to obtain an additional 26 features to produce the full 39 features for a vector.
  • PLP perceptual linear prediction
  • MFCC Mel frequency cepstrum coefficients
  • second pass stage 115 b stored vectors 211 are identified that correspond to recognition candidate(s) identified in first pass stage 115 a .
  • the speech recognition system generates, identifies, and stores in a database, vectors for different speech elements.
  • the speech element is a word; however, vectors may be pre-stored in a database for syllables phrases, word combinations, or other sounds sequences as well, and used as described herein to recognize more or less than a single word.
  • audio is recognized and vectors are generated for the corresponding audio as described above.
  • the audio may be recognized using automatic and/or manual recognition processes.
  • the training data may be unsupervised or supervised.
  • input audio may be for the word “recognize”.
  • the input audio is recognized, e.g., using a standard HMM-based approach with, or without, manual (e.g., human) confirmation.
  • Vectors for that input audio may be generated and stored in memory. For example, if the audio is the word “recognize”, and that word is 1000 ms in duration, then there are 100 vectors stored, one for each 10 ms of audio on the word “recognize”.
  • the foregoing process may be performed, during the training phase, for various instances of the word “recognize”. For example, different groups of vectors may be generated for the word “recognize” spoken using different speech patterns, for different durations, in different accents, in different (e.g., noise or quiet) environments, in different word contexts, and so forth.
  • the result may be numerous groups of vectors, all of which represent different versions of the same words e.g., “recognize”.
  • the vectors may differ in content for reasons noted above, and may be used in the second pass stage to generate a recognition candidate for the input audio in the manner described herein.
  • the training phase may associate metadata with each vector identifying, e.g., the word that the vector represents.
  • each vector may also be assigned a weight, which may be represented by metadata.
  • the weights may be indicative or the likelihood (e.g., a confidence or relevance score) that the vector representation is accurate. For example, higher weights (indicative of more accuracy) may be assigned to manually-verified vector representations than for vector representations that are not manually verified.
  • vector representations for noisy audio, or other audio that is deemed generally unreliable for some reason may be assigned lower weights (indicative of less accuracy), since such noise may affect recognition accuracy.
  • vector representations for audio that exceeds a predefined noise threshold may be assigned weights of zero.
  • a regularization process may be used to obtain the weights of zero.
  • the weight assigned to the vector may be proportionate to the noise level of the associated audio, or to its reliability in general.
  • the weights may be negative, which indicates a negative correlation between a vector representation and audio.
  • the weights may be determined using a conditional random field technique.
  • the metadata may also identify other features associated with the input audio.
  • the metadata may identify one or more words that neighbor the word that is the subject of a vector.
  • neighbor may include, but is not limited to, one or more words either before or after the word at issue.
  • the one or more words are directly before or after the word at issue; however, this need not be the case always.
  • the metadata may also identify other contextual aspects of the audio.
  • the metadata may specify a source of the audio, e.g., a television network, an online video service, a video device (e.g., a digital video camera), and so forth.
  • the metadata may also include, if available, information about the linguistic characteristics of the audio, e.g., the speaker's accent, location, and so forth.
  • the metadata may also identify the condition of the audio, e.g., whether the audio is noisy, the amount of noise, the type of noise, and so forth.
  • Vectors stored in the training phase are used in second pass stage 115 b in recognizing input audio. More specifically, vectors are identified, in storage, for a first pass stage recognition candidate. Vector for the input audio (the “input audio vectors”) are compared 215 to the stored vectors, and are scored against the stored vectors. In this example, the scores are based, at least in part, on a calculated distance between the input audio vectors and each stored vector. In some implementations, the calculated distance between two vectors is the Dynamic Time Warping (DTW) distance. In an example, the DTW distance is the summed Euclidean distances of the best warping of two vectors. The warping usually is subject to certain constraints, for example, monotonicity and bounded jump size. The DTW distance can be determined using dynamic programming techniques, with a complexity quadratic in a number of frames. The DTW distance may be length-normalized, if necessary, to make vectors of different length comparable.
  • DTW Dynamic Time Warping
  • the DTW distance is indicative of how closely the input audio corresponds to the word represented by stored vectors.
  • the DTW distances between the input audio vectors 210 and stored vectors 211 for “recognize” are indicative of how closely input audio corresponds to the word “recognize”.
  • This DTW distances may be determined for any number (e.g., all or a subset) of stored vectors for the same word.
  • the DTW distances for various vector comparisons may be considered together or combined mathematically to provide an indication of a likelihood that the input audio is a known word.
  • scores 219 resulting from the first pass stage and score 220 resulting from the second pass stage are both used to produce an overall score indicative of how well the input audio matches a word.
  • a combiner module 211 which implements a conditional random field technique, may be used to generate a final recognition score and thus an output recognition candidate 224 .
  • Factors other than the DTW distances and first pass stage scores may also affect the final recognition score.
  • weights applied to the stored vectors may affect the amount that those stored vectors contribute to the final recognition score.
  • the output of the second pass stage may be adjusted (e.g., multiplied by) weights for corresponding pre-stored vectors.
  • Vectors that are deemed reliable representations of audio e.g., manually-confirmed vectors or vectors generated from audio having low levels of noise
  • vector weights are identified prior to vector identification. Only those vectors having (e.g., positive) non-zero weights, or weights that exceed a predefined threshold, may be identified and compared against a vector for input audio. As a result, the amount of vector comparisons that are performed can be reduced. In other implementations, the zero-weighted vectors may be identified; however, their zero weight effectively discounts their effect on the final score.
  • neighboring words may be used to adjust scores resulting from DTW distances.
  • the input audio may include the word “to”, neighbored by “going”, as in “going to”.
  • metadata may be associated with the resulting recognition candidate indicating that the word “going” precedes the word “to”. This information may be used to adjust the weight applied to the DTW distance. For example, in some cases, if it is known what a predecessor word was, the weight may be adjusted so that the resulting score is downgraded or upgraded. For example, “thereto” is a word that ends in “to”.
  • a recognition result for the word “to” may be downgraded (e.g., by adjusting the weight for its corresponding vectors downward) to reflect the possibility that the word “to” is part of “thereto”, rather than the stand-alone word “to”.
  • more than one neighboring word or sound sequence may affect the determination.
  • succeeding neighbor words may affect applied weights.
  • neighboring words may affect which vectors are identified for comparison with the input audio. For example, if neighboring words are known, vectors reflecting a combination of those neighboring words with the word at issue may be identified and compared to the input audio. This may reduce the number of comparisons that occur, particularly where there are large numbers of vector examples for words (e.g., for prepositions, such as “to”).
  • Metadata such as that described above for the vectors produced in the training phase, may be associated with vectors generated from the input audio, in cases where the appropriate information is available.
  • the metadata for the input audio vectors may be used in scoring stored vectors. For example, the metadata of input audio vectors may be matched to corresponding metadata of stored vectors and, where matches are/are not present, recognition scores may be adjusted.
  • the final recognition output constitutes recognized audio.
  • the recognized audio 119 may include, e.g., a textual transcription of the audio, language information associated with included recognition candidates, or other information representative of its content.
  • the recognized audio 119 may be provided as data to the mobile device 101 that provided the input audio.
  • a user may input audio to the speech recognition system through the mobile device 101 .
  • the recognized audio 119 may be provided to the mobile device 101 or another service and used to control one or more functions associated with the mobile device 101 .
  • an application on the mobile device 101 may execute an e-mail or messaging application in response to command(s) in the recognized audio 119 .
  • the recognized audio 119 may be used to populate an e-mail or other message. Processes may be implemented, either remote from, or local to, mobile device 101 , to identify commands in an application, such as “send e-mail” to cause actions to occur, such as executing an e-mail application, on mobile device 101 .
  • recognized audio 119 may be provided as data to a search engine.
  • recognized audio 119 may constitute a search query that is to be input to a search engine.
  • the search engine may identify content (e.g., Web pages, images, documents, and the like) that are relevant to the search query, and return that information to the computing device that provided the initial audio.
  • the recognized audio may be provided to the computing device prior to searching in order to confirm its accuracy.
  • recognized audio 119 may be used to determine advertisements related to the topic of the audio. Such advertisements may be provided in conjunction with output of the audio content.
  • FIG. 3 is a block diagram of an example of a system 300 on which the processes of FIGS. 1 and 2 may be implemented.
  • input speech may be provided through one or more of communication devices 302 .
  • Mobile device 101 of FIG. 1 is an example of a communication device 302 that may be used to perform the processes described herein.
  • Resulting audio data may be transmitted to one or more processing entities (e.g., processing entities 308 a and 308 b or more), which may be part of server(s) 304 , for speech recognition performed as described herein.
  • processing entities 308 a and 308 b or more may be part of server(s) 304 , for speech recognition performed as described herein.
  • Network 306 can represent a mobile communications network that can allow devices (e.g., communication devices 302 ) to communicate wirelessly through a communication interface (not shown), which may include digital signal processing circuitry where appropriate.
  • Network 306 can include one or more networks.
  • the network(s) may provide for communications under various modes or protocols, e.g., Global System for Mobile communication (GSM) voice calls, Short Message Service (SMS), Enhanced Messaging Service (EMS), or Multimedia Messaging Service (MMS) messaging, Code Division Multiple Access (CDMA), Time Division Multiple Access (TDMA), Personal Digital Cellular (PDC), Wideband Code Division Multiple Access (WCDMA), CDMA2000, or General Packet Radio System (GPRS), among others.
  • GSM Global System for Mobile communication
  • SMS Short Message Service
  • EMS Enhanced Messaging Service
  • MMS Multimedia Messaging Service
  • CDMA Code Division Multiple Access
  • TDMA Time Division Multiple Access
  • PDC Personal Digital Cellular
  • WCDMA Wideband Code Division Multiple Access
  • CDMA2000 Code Division Multiple Access 2000
  • GPRS General Packet Radio System
  • Communication devices 302 can include various forms of client devices and personal computing devices. Communication devices 302 can include, but are not limited to, a cellular telephone 302 a , personal digital assistant (PDA) 302 b , and a smartphone 302 c . In other implementations, communication devices 302 may include (not shown), personal computing devices, e.g., a laptop computer, a handheld computer, a desktop computer, a tablet computer, a network appliance, a camera, an enhanced general packet radio service (EGPRS) mobile phone, a media player, a navigation device, an email device, a game console, or a combination of any two or more of these data processing devices or other data processing devices. In some implementations, the personal computing device can be included as part of a motor vehicle (e.g., an automobile, an emergency vehicle (e.g., fire truck, ambulance), a bus).
  • a motor vehicle e.g., an automobile, an emergency vehicle (e.g., fire truck, ambulance), a bus).
  • Communication devices 302 may each include one or more processing devices 322 , memory 324 , and a storage system 326 .
  • Storage system 326 can include a speech conversion module 328 and a mobile operating system module 330 .
  • Each processing device 322 can run an operating system included in mobile operating system module 330 to execute software included in speech conversion module 328 .
  • speech conversion module 328 may receive input speech 106 and perform any processing necessary to convert the input speech into audio data 110 for recognition.
  • Server(s) 304 can include various forms of servers including, but not limited to, a web server, an application server, a proxy server, a network server, a gateway, or a server farm.
  • Server(s) 304 can include one or more processing entities 308 a , 308 b . Although only two processing entities are shown, any number may be included in system 300 .
  • each entity includes a memory 310 and a storage system 312 .
  • Processing entities 308 a , 308 b can be real (e.g., different computers, processors, programmed logic, a combination thereof, etc.) or virtual machines, which can be software implementations of machines that execute programs like physical machines.
  • Each storage system 312 can include a speech recognition module 314 , a speech recognition database 316 , and a server operating system module 318 .
  • Each processing entity 308 a , 308 b can run an operating system included in the server operating system module 318 to execute software included in the modules that make-up speech recognition module 314 .
  • the operation of speech recognition module may be spread across various processing entities or performed in a single processing entity.
  • a speech recognition module 314 can process received audio data, or a portion thereof, from a communication device 302 (e.g., cellular telephone 302 a ) and use speech recognition database 316 to determine the spoken word content of the speech data.
  • Each speech recognition module may include an acoustic model 331 , a language model 332 , and a dictionary 333 .
  • acoustic model 331 includes digital representations of individual sounds that are combinable to produce a collection of words, phrases, etc.
  • Language model 332 assigns a probability that a sequence of words will occur together in a particular sentence or phrase.
  • Dictionary 333 transforms sound sequences into words that can be understood by the language model.
  • Speech recognition database 316 includes data for one or more state machines 334 for performing the first stage recognition process described herein and a vector database 335 that includes vectors for known words for performing the second stage recognition process described herein.
  • acoustic model 331 includes a first pass module 340 and a second pass module 341 , which implement the first pass and second pass recognition stages described herein.
  • First pass module 340 may be a discriminatively trained HMM model (e.g., of the type shown in FIG. 2 ) that uses Gaussian mixtures and PLPs as front-end features. The decoding performed by first pass module 340 may use a trigram language model.
  • Second pass module 341 may be an exemplar features-based recognition process, which uses vectors representing segments of audio to identify the content of input audio.
  • a combiner module (not shown in FIG. 3 ), which also may be part of the acoustic model, combines scores produced by the first pass module and the second pass module to identify one or more higher-rated recognition candidates for input audio.
  • FIG. 4 shows operations performed during a training phase process 400 .
  • Process 400 may be performed by speech recognition module 314 of FIG. 3 , either alone or in combination with one or more other appropriate computer programs.
  • the training phase includes, among other things, generating a database of vectors for segments of audio; identifying words, phrases or sounds sequences that are represented by groups of the vectors; and associating weights and metadata with the vectors.
  • the speech recognition system is trained on a corpus of audio.
  • the corpus need not be a single source of audio, but rather may contain multiple sources including, e.g., broadcast audio, audio from online sources, speech, music, other sounds, noise and so forth.
  • Training includes receiving (401) segments of the audio from the corpus.
  • the segments of audio may be of any appropriate length. In this example, the segments are 10 ms.
  • the received audio is identified ( 402 ).
  • the retrieved audio may be identified using an HMM-based system having one or more state machines.
  • the identification process may be completely automatic, e.g., the HMM-based system may identify sounds in the audio; a language model may provide phonetic representations of words composed of those sounds; and a dictionary may transform sound sequences into words that can be understood by the language model.
  • the training phase may include making a manual determination about the identity of input audio. For example, a person may identify the audio or confirm the accuracy of the result produced by an HMM-based system. In other implementations, a person may identify the audio without the assistance of the HMM-based system.
  • the automatic portion of the recognition may be a system other than an HMM-based system.
  • Vectors are generated ( 403 ) for the audio.
  • vectors for input audio are generated by performing a Fast Fourier Transform (FFT) on the audio to obtain its frequency components.
  • FFT Fast Fourier Transform
  • a cosine transformation is performed on the frequency components to obtain features for the vectors.
  • thirteen features are obtained.
  • First and second derivatives of those features are taken over time to obtain an additional 26 features to produce the full 39 features for a vector.
  • PLP perceptual linear prediction
  • MFCC Mel frequency cepstrum coefficients
  • Information is associated ( 404 ) with the generated vectors.
  • the information may include weights and metadata, including, but not limited to, the weights and metadata described above.
  • the applied weights and metadata if appropriate, are used to generate outputs for known audio. Accordingly, a testing phase may be part of the training. If the applied weights do not generate the appropriate output during testing, then the applied weights may be adjusted until the appropriate output is obtained.
  • the model weights may be estimated using a maximum mutual information (MMI) training criterion.
  • MMI maximum mutual information
  • regularization may be used for feature selection.
  • regularization may be used to avoid overfitting. Processing may be performed using the general-purpose L-BFGS or Rprop techniques.
  • the information associated ( 404 ) with the generated vectors may also identify a word or phrase associated with each vector.
  • a single vector will not typically represent an entire word.
  • a group of such vectors e.g., 50
  • the metadata associated with each vector may identify the word or phrase that the vector is part of, and what part of the word or phrase the vector represents.
  • the metadata may specify that the word that a vector is part of is “recognize” and it may specify what part of the word “recognize” that the vector represents (e.g., the first 10 ms, the tenth 10 ms, and so forth).
  • a group of vectors is not representative of audio (e.g., a negative representation) and may be indicated as such in metadata.
  • Vectors and associated metadata are stored (405) in a database.
  • the vectors may be indexed, e.g., by word or words, for retrieval.
  • the training process continues 406 for all or part of the corpus of audio.
  • the training may be updated, as desired, using new audio or the same audio.
  • FIG. 5 is a flow diagram for an example process 500 for performing speech recognition.
  • Process 500 may be performed by speech recognition module 314 of FIG. 3 , either alone or in combination with one or more other appropriate computer programs.
  • audio is received (501).
  • speech recognition module 314 may receive audio from a computing device, such as mobile device 101 ( FIG. 1 ).
  • the input audio referred to herein may include all of the audio received between designated start and stop times, or a portion or snippet thereof.
  • the audio is input speech; however, any type of audio may be received.
  • the audio may be a recorded musical track, a recorded track associated with video, and so forth.
  • Phonemes (“phones”) are identified in the input audio and may be used, as described below, to identify the content of the audio.
  • a recognition process is performed ( 502 ) on the input audio.
  • the recognition process may be performed by first pass module 340 .
  • first pass module 340 is an HMM-based system (e.g., like first pass stage 115 a of FIGS. 1 and 2 ), as described above, which produces scored recognition candidates.
  • Candidates for recognition of the input audio are identified ( 503 ) by their scores. For example, one or more candidates with the highest recognition scores may be identified and selected. A predefined number of candidates may be selected, or those within a predefined tolerance of the candidate with the highest score may be selected. Selection criteria other than these may also be used.
  • the candidates are provided to second pass module 341 . There, at least some of the following operations may be performed to generate final recognition candidates (e.g., a best recognition candidate).
  • Vectors are generated ( 504 ) for the input audio.
  • the vectors may be for 10 ms segments of the audio, as described above, and may include appropriate metadata.
  • Vectors that may correspond to the input audio are identified ( 505 ) in the database.
  • the vectors that are identified are vectors for the words, phrase, etc. of audio recognized in the first pass stage. For example, if the first pass stage has identified candidates of “recognize”, “recognized”, and “ignition”, then vectors corresponding to those words are identified in the database based, e.g., on their associated metadata. For example, a search of an index may be performed to identify the vectors. In some implementations, all vectors corresponding to a recognition candidate are identified.
  • a subset of all vectors corresponding to a recognition candidate is identified. For example, vectors with weights that are at, or below, a predefined value, e.g., zero, may be excluded from consideration. In this case, it is possible to reduce the effects of noise or other artifacts on the recognition process. Furthermore, as a result, the amount of processing performed is reduced (since vectors with zero weights need not be processed). Thus, the metadata may be used to reduce the amount of processing performed, since it can result in consideration of fewer numbers of vectors than would otherwise be considered.
  • a predefined value e.g., zero
  • all stored vectors may not be accurately labeled.
  • vectors for “recognize” may be inaccurately labeled as being for “recognition”.
  • the effects of inaccurately-labeled vectors may be mitigated in some cases.
  • the vectors for the input audio are compared ( 506 ) to the identified vectors for the recognition candidates to determine similarity metrics between the vectors for the input audio and the identified vectors.
  • the similarity metric may be based on DTW distances between vectors, as noted above.
  • the similarity metrics may be such that they reduce the effects of noise and errors on recognition.
  • template features (f) may be based on a segmented word W (e.g., broken into segments of 10 ms) and frame features X associated with this word segment.
  • a template feature is set to the average DTW distance between a recognition hypothesis X (e.g., the vector for recognition candidate from the first pass stage) and the k-nearest vectors of X associated with the hypothesis word W, where Y ⁇ KNN W (X) (KNN, meaning “k-nearest neighbor vectors) if the word hypothesis W I matches the template word W. Otherwise, the template feature is set to zero. This is expressed in the following equation:
  • individual DTW distances are used as the template features.
  • the DTW distances may be exponentiated to achieve a more sparse representation and thus, in some cases, faster training.
  • this non-linearity enables modeling of arbitrary decision boundaries. This is expressed in the following equation:
  • f Y kernel ⁇ ( X , W ) ⁇ exp ⁇ ( - ⁇ ⁇ ⁇ d ⁇ ( X , Y ) ) 0 ⁇ if ⁇ ⁇ Y ⁇ ⁇ ⁇ template ⁇ ⁇ of ⁇ ⁇ W otherwise ⁇ , in the above equation, is a scaling factor.
  • the similarity metric may be adjusted by weights associated with the corresponding vectors.
  • the similarity metric may be adjusted in accordance with other metadata associated with the vectors (e.g., the identity(ies) of neighboring words, the context of the audio, and so forth).
  • the output of second pass module 341 includes one or more scores (e.g., one or more template features) that are indicative of how well the recognition candidate from first pass module matches vectors from database 335 .
  • the scores produced by first pass module 340 are re-scored ( 507 ) using the scores produced by second pass module 341 to identify ( 508 ) which of the recognition candidates best matches the input audio.
  • the combination of the template features from the second pass module with the first pass scores is performed using a segmental conditional random field.
  • a segmental conditional random field is a conditional random field defined on word lattices.
  • the features of the conditional random field are defined on the word arc level.
  • language and acoustic model scores are used as features. As a result, the re-scoring result is no worse than the first-pass baseline result.
  • the resulting output ( 509 ) of speech recognition module 314 may be applied to language model 332 that generates a phonetic representation of the selected (e.g., best) recognition candidate, along with other appropriate information identifying the word or phrase.
  • Dictionary 333 may be used to transform sound sequences into words that can be understood by the language model.
  • Data corresponding to the selected recognition candidate is output as a recognized version of the audio.
  • speech recognition module may output the data to the appropriate device or process.
  • the output may be formatted as part of an XML file, a text transcription, a command or command sequence, a search query, and so forth.
  • the data may be presented to the user, either audibly or visually, or it may be used as part of a process either on the user's device or elsewhere.
  • a transcription of the input audio may be applied to a translation service, which may be programmed to generate an audio and/or textual translation of the input audio into another, different language (e.g., from English to French) for output to the user's computing device.
  • the user may be able to specify the accent or dialect of the target language for the output audio.
  • FIG. 6 shows examples of computing devices on which the processes described herein, or portions thereof, may be implemented.
  • FIG. 6 shows an example of a generic computing device 600 and a generic mobile computing device 650 , which may be used to implement the processes described herein, or portions thereof.
  • server(s) 304 may be implemented on computing device 600 .
  • Mobile computing device 650 may represent the mobile device 101 of FIG. 1 .
  • Computing device 600 is intended to represent various forms of digital computers, examples of which include laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers.
  • Computing device 650 is intended to represent various forms of mobile devices, examples of which include personal digital assistants, cellular telephones, smartphones, and other similar computing devices.
  • the components shown here, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the implementations described and/or claimed in this document.
  • Computing device 600 includes a processor 602 , memory 604 , a storage device 606 , a high-speed interface 608 connecting to memory 604 and high-speed expansion ports 610 , and a low speed interface 612 connecting to low speed bus 614 and storage device 606 .
  • Components 602 , 604 , 606 , 608 , 610 , and 612 are interconnected using various busses, and may be mounted on a common motherboard or in other manners as appropriate.
  • the processor 602 may process instructions for execution within the computing device 600 , including instructions stored in the memory 604 or on the storage device 606 to display graphical information for a GUI on an external input/output device, for example, display 616 coupled to high speed interface 608 .
  • multiple processors and/or multiple buses may be used, as appropriate, along with multiple memories and types of memory.
  • multiple computing devices 600 may be connected, with a device providing a portion of the necessary operations (e.g., as a server bank, a group of blade servers, or a multi-processor system).
  • the memory 604 stores information within the computing device 600 .
  • the memory 604 is a volatile memory unit or units.
  • the memory 604 is a non-volatile memory unit or units.
  • the memory 604 may also be another form of computer-readable medium, examples of which include a magnetic or optical disk.
  • the storage device 606 is capable of providing mass storage for the computing device 600 .
  • the storage device 606 may be or contain a computer-readable medium, examples of which include a floppy disk device, a hard disk device, an optical disk device, or a tape device, a flash memory or other similar solid state memory device, or an array of devices, including devices in a storage area network or other configurations.
  • a computer program product may be tangibly embodied in an information carrier.
  • the computer program product may also contain instructions that, when executed, perform one or more methods, including those described above.
  • the information carrier may be a non-transitory computer- or machine-readable medium, for example, the memory 604 , the storage device 606 , or memory on processor 602 .
  • the information carrier may be a non-transitory, machine-readable storage medium.
  • the high speed controller 608 manages bandwidth-intensive operations for the computing device 600 , while the low speed controller 612 manages lower bandwidth-intensive operations. Such allocation of functions is exemplary only.
  • the high-speed controller 608 is coupled to memory 604 , display 616 (e.g., through a graphics processor or accelerator), and to high-speed expansion ports 610 , which may accept various expansion cards (not shown).
  • low-speed controller 612 is coupled to storage device 606 and low-speed expansion port 614 .
  • the low-speed expansion port which may include various communication ports (e.g., USB, Bluetooth, Ethernet, wireless Ethernet) may be coupled to one or more input/output devices, e.g., a keyboard, a pointing device, a scanner, or a networking device, e.g., a switch or router, e.g., through a network adapter.
  • input/output devices e.g., a keyboard, a pointing device, a scanner, or a networking device, e.g., a switch or router, e.g., through a network adapter.
  • the computing device 600 may be implemented in a number of different forms, as shown in the figure. For example, it may be implemented as a standard server 620 , or multiple times in a group of such servers. It may also be implemented as part of a rack server system 624 . In addition, it may be implemented in a personal computer, e.g., a laptop computer 622 . Alternatively, components from computing device 600 may be combined with other components in a mobile device (not shown), e.g., device 650 . Such devices may contain one or more of computing device 600 , 650 , and an entire system may be made up of multiple computing devices 600 , 650 communicating with one other.
  • Computing device 650 includes a processor 652 , memory 664 , an input/output device, e.g., a display 654 , a communication interface 666 , and a transceiver 668 , among other components.
  • the device 650 may also be provided with a storage device, e.g., a microdrive or other device, to provide additional storage.
  • the components 650 , 652 , 664 , 654 , 666 , and 668 are interconnected using various buses, and several of the components may be mounted on a common motherboard or in other manners as appropriate.
  • the processor 652 may execute instructions within the computing device 650 , including instructions stored in the memory 664 .
  • the processor may be implemented as a chipset of chips that include separate and multiple analog and digital processors.
  • the processor may provide, for example, for coordination of the other components of the device 650 , e.g., control of user interfaces, applications run by device 650 , and wireless communication by device 650 .
  • Processor 652 may communicate with a user through control interface 658 and display interface 656 coupled to a display 654 .
  • the display 654 may be, for example, a TFT LCD (Thin-Film-Transistor Liquid Crystal Display) or an OLED (Organic Light Emitting Diode) display, or other appropriate display technology.
  • the display interface 656 may include appropriate circuitry for driving the display 654 to present graphical and other information to a user.
  • the control interface 658 may receive commands from a user and convert them for submission to the processor 652 .
  • an external interface 662 may be provide in communication with processor 652 , so as to enable near area communication of device 650 with other devices. External interface 662 may provide, for example, for wired communication in some implementations, or for wireless communication in other implementations, and multiple interfaces may also be used.
  • the memory 664 stores information within the computing device 650 .
  • the memory 664 may be implemented as one or more of a computer-readable medium or media, a volatile memory unit or units, or a non-volatile memory unit or units.
  • Expansion memory 674 may also be provided and connected to device 650 through expansion interface 672 , which may include, for example, a SIMM (Single In Line Memory Module) card interface.
  • SIMM Single In Line Memory Module
  • expansion memory 674 may provide extra storage space for device 650 , or may also store applications or other information for device 650 .
  • expansion memory 674 may include instructions to carry out or supplement the processes described above, and may include secure information also.
  • expansion memory 674 may be provide as a security module for device 650 , and may be programmed with instructions that permit secure use of device 650 .
  • secure applications may be provided by the SIMM cards, along with additional information, e.g., placing identifying information on the SIMM card in a non-hackable manner.
  • the memory may include, for example, flash memory and/or NVRAM memory, as discussed below.
  • a computer program product is tangibly embodied in an information carrier.
  • the computer program product contains instructions that, when executed, perform one or more methods, including those described above.
  • the information carrier is a computer- or machine-readable medium, e.g., the memory 664 , expansion memory 674 , memory on processor 652 , and so forth that may be received, for example, over transceiver 668 or external interface 662 .
  • Device 650 may communicate wirelessly through communication interface 666 , which may include digital signal processing circuitry where necessary. Communication interface 666 may provide for communications under various modes or protocols, examples of which include GSM voice calls, SMS, EMS, or MMS messaging, CDMA, TDMA, PDC, WCDMA, CDMA2000, or GPRS, among others. Such communication may occur, for example, through radio-frequency transceiver 668 . In addition, short-range communication may occur, e.g., using a Bluetooth, Wi-Fi, or other such transceiver (not shown). In addition, GPS (Global Positioning System) receiver module 670 may provide additional navigation- and location-related wireless data to device 650 , which may be used as appropriate by applications running on device 650 .
  • GPS Global Positioning System
  • Device 650 may also communicate audibly using audio codec 660 , which may receive spoken information from a user and convert it to usable digital information. Audio codec 660 may likewise generate audible sound for a user, e.g., through a speaker, e.g., in a handset of device 650 . Such sound may include sound from voice telephone calls, may include recorded sound (e.g., voice electronic messages, music files, etc.) and may also include sound generated by applications operating on device 650 .
  • Audio codec 660 may receive spoken information from a user and convert it to usable digital information. Audio codec 660 may likewise generate audible sound for a user, e.g., through a speaker, e.g., in a handset of device 650 . Such sound may include sound from voice telephone calls, may include recorded sound (e.g., voice electronic messages, music files, etc.) and may also include sound generated by applications operating on device 650 .
  • the computing device 650 may be implemented in a number of different forms, as shown in the figure. For example, it may be implemented as a cellular telephone 680 . It may also be implemented as part of a smartphone 682 , personal digital assistant, or other similar mobile device.
  • implementations of the systems and techniques described here may be realized in digital electronic circuitry, integrated circuitry, specially designed ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof.
  • ASICs application specific integrated circuits
  • These various implementations may include implementation in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.
  • the systems and techniques described here may be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user and a keyboard and a pointing device (e.g., a mouse or a trackball) by which the user may provide input to the computer.
  • a display device e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor
  • a keyboard and a pointing device e.g., a mouse or a trackball
  • Other kinds of devices may be used to provide for interaction with a user as well; for example, feedback provided to the user may be a form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in a form, including acoustic, speech, or tactile input.
  • the systems and techniques described here may be implemented in a computing system that includes a back end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front end component (e.g., a client computer having a graphical user interface or a Web browser through which a user may interact with an implementation of the systems and techniques described here), or a combination of such back end, middleware, or front end components.
  • the components of the system may be interconnected by a form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network (“LAN”), a wide area network (“WAN”), and the Internet.
  • LAN local area network
  • WAN wide area network
  • the Internet the global information network
  • the computing system may include clients and servers.
  • a client and server are generally remote from one other and typically interact through a communication network.
  • the relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to one other.
  • the engines described herein may be separated, combined or incorporated into a single or combined engine.
  • the engines depicted in the figures are not intended to limit the systems described here to the software architectures shown in the figures.
  • the users may be provided with an opportunity to control whether programs or features collect personal information (e.g., information about a user's social network, social actions or activities, profession, a user's preferences, or a user's current location), or to control whether and/or how to receive content from the content server that may be more relevant to the user.
  • personal information e.g., information about a user's social network, social actions or activities, profession, a user's preferences, or a user's current location
  • certain data may be anonymized in one or more ways before it is stored or used, so that personally identifiable information is removed.
  • a user's identity may be anonymized so that no personally identifiable information can be determined for the user, or a user's geographic location may be generalized where location information is obtained (such as to a city, ZIP code, or state level), so that a particular location of a user cannot be determined.
  • location information such as to a city, ZIP code, or state level
  • the user may have control over how information is collected about him or her and used by a content server.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • User Interface Of Digital Computer (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A speech recognition process may perform the following operations: performing a preliminary recognition process on first audio to identify candidates for the first audio; generating first templates corresponding to the first audio, where each first template includes a number of elements; selecting second templates corresponding to the candidates, where the second templates represent second audio, and where each second template includes elements that correspond to the elements in the first templates; comparing the first templates to the second templates, where comparing comprises includes similarity metrics between the first templates and corresponding second templates; applying weights to the similarity metrics to produce weighted similarity metrics, where the weights are associated with corresponding second templates; and using the weighted similarity metrics to determine whether the first audio corresponds to the second audio.

Description

CROSS-REFERENCE TO RELATED APPLICATION
Priority is hereby claimed to U.S. Provisional Application No. 61/608,218, which was filed on Mar. 8, 2012. The contents of U.S. Provisional Application No. 61/608,218 are hereby incorporated by reference into this disclosure.
TECHNICAL FIELD
This disclosure relates generally to speech recognition.
BACKGROUND
Speech recognition includes processes for converting spoken words to text or other data. In general, speech recognition systems translate verbal utterances into a series of computer-readable sounds and compare those sounds to known words. For example, a microphone may accept an analog signal, which is converted into a digital form that is then divided into smaller segments. The digital segments can be compared to elements of a spoken language. Based on this comparison, and an analysis of the context in which those sounds were uttered, the system is able to recognize the speech.
A typical speech recognition system may include an acoustic model, a language model, and a dictionary. Briefly, an acoustic model includes digital representations of individual sounds that are combinable to produce a collection of words, phrases, etc. A language model assigns a probability that a sequence of words will occur together in a particular sentence or phrase. A dictionary transforms sound sequences into words that can be understood by the language model.
SUMMARY
Described herein is a speech recognition process that may perform the following operations: performing a preliminary recognition process on first audio to identify candidates for the first audio; generating first templates corresponding to the first audio, where each first template includes a number of elements; selecting second templates corresponding to the candidates, where the second templates represent second audio, and where each second template includes elements that correspond to the elements in the first templates; comparing the first templates to the second templates, where comparing includes determining similarity metrics between the first templates and corresponding second templates; applying weights to the similarity metrics to produce weighted similarity metrics, where the weights are associated with corresponding second templates; and using the weighted similarity metrics to determine whether the first audio corresponds to the second audio. The speech recognition systems may include one or more of the following features, either alone or in combination.
Selecting the second templates may include selecting templates associated with a non-zero weight.
Metadata may be associated with at least one of the first audio and the second audio. The metadata may be used in obtaining at least the second templates. The metadata may be indicative of the context of at least one of the first audio and the second audio. The metadata may indicate at least one word that neighbors a word in at least one of the first audio and the second audio.
The preliminary recognition process may include a Hidden Markov Model (HMM) based process. The preliminary recognition process may generate first scores associated with the candidates. Using the weighted similarity metrics to determine whether the first audio corresponds to the second audio may include generating second scores for the first audio, where the second scores correspond to whether the first audio corresponds to the second audio.
The operations may include combining the first scores and the second scores using a conditional random field technique to generate a composite score indicative of an extent to which the first audio corresponds to the second audio.
Each element may be at least one of: a phoneme in context, a syllable, or a word. The first templates may include vectors, the second templates may include vectors, and the similarity metrics may include distances between vectors. The second templates may include multiple groups of second templates, and each group of second templates may represent a different version of a same candidate word or phrase for at least one of the first and second audio.
The second templates may be selected from among a group of templates having associated weights. At least some of the weights may be negative. Weights may be determined using a conditional random field technique. At least some of the weights may be zero. Zero weights may be determined using a regularization technique.
Metadata may be associated with at least one of the first audio and the second audio. The metadata may indicate at least one of: information about a speaker of at least one of the first audio or the second audio, and information about an acoustic condition of at least one of the first audio or the second audio.
The systems and techniques described herein, or portions thereof, may be implemented as a computer program product that includes instructions that are stored on one or more non-transitory machine-readable storage media, and that are executable on one or more processing devices. The systems and techniques described herein, or portions thereof, may be implemented as an apparatus, method, or electronic system that may include one or more processing devices and memory to store executable instructions to implement the stated functions.
The details of one or more implementations are set forth in the accompanying drawings and the description below. Other features, objects, and advantages will be apparent from the description and drawings, and from the claims.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 shows, conceptually, an example of a speech recognition system.
FIG. 2 shows an example of an acoustic model of the speech recognition system.
FIG. 3 is an example of a network on which the speech recognition system may be implemented.
FIG. 4 is a flowchart showing an example training phase for use in the speech recognition system.
FIG. 5 is a flowchart showing an example process for recognizing speech.
FIG. 6 shows examples of computing devices on which the processes described herein, or portions thereof, may be implemented.
Like reference symbols in the various drawings indicate like elements.
DETAILED DESCRIPTION
Described herein are processes for performing speech recognition. The processes include performing a preliminary (first pass) recognition process on audio and then performing an exemplar- (e.g., template- or vector-) based recognition process on the audio. Scores from the two processes are used to identify a recognition candidate for the input audio.
FIG. 1 shows a conceptual example of a system for performing speech recognition according to the processes described herein. In the example of FIG. 1, a user 100 of a mobile device 101 accesses a speech recognition system 104. In this example, the mobile device 101 is a cellular telephone having advanced computing capabilities, known as a smartphone. Speech recognition system 104 may be hosted by one or more server(s) that is/are remote from mobile device 101. For example, speech recognition system 104 may be part of another service available to users of the mobile device 101 (e.g., a help service, a search service, etc.).
In this example, mobile device 101 may include an application 107 (“app”) that receives input audio (e.g., speech) provided by user 100 and that transmits data 110 representing that input audio to the speech recognition system 104. App 107 may have any appropriate functionality, e.g., it may be a search app, a messaging app, an e-mail app, and so forth. In this regard, an app is used as an example in this case. However, all or part of the functionality of the app 107 may be part of another program downloaded to mobile device 101, part of another program provisioned on mobile device 101, part of the operating system of the mobile device 101, or part of a service available to mobile device 101.
In an example, app 107 may ask user 100 to identify, beforehand, the languages that user 100 speaks. The user 100 may select, e.g., via a touch-screen menu item or voice input, the languages that user 100 expects to speak or have recognized. In some implementations, user 100 may also select among various accents or dialects. Alternatively, the user's languages, accents, and/or dialects may be determined based on the audio input itself or based on prior audio or other appropriate input.
To begin the speech recognition process, user 100 speaks in a language (e.g., English) into mobile device 101. App 107 generates audio data 110 that corresponds to the input speech, and forwards that audio data to speech recognition system 104. Speech recognition system 104 includes one or more of each of the following: an acoustic model 115, a language model 116, and a dictionary 117. In this example implementation, acoustic model 115 includes digital representations of individual sounds that are combinable to produce a collection of words, phrases, etc. Language model 116 assigns a probability that a sequence of words will occur together in a particular sentence or phrase. Dictionary 117 dictionary transforms sound sequences into words that can be understood by language model 116.
In an example implementation, acoustic model 115 includes two stages: a “first pass” stage 115 a and a “second pass” stage 115 b. In this example, first pass stage 115 a is implemented using a Hidden Markov Model (HMM)-based system, which identifies recognition candidates and assigns scores thereto. Second pass stage 115 b uses templates, such as vectors, to represent input audio. These vectors are compared to other vectors that represent known words, phrases or other sound sequences. Distances between vectors for input audio and for known audio correspond to a likelihood that the input audio matches the known audio. The distances, which correspond to scores, are used in adjusting the score(s) from the first pass stage to identify a best recognition candidate for the input audio. In an example, a conditional random field process may be used to combine the scores from the first pass stage and the second pass stage to identify the candidate. The first pass stage is described initially, followed by the second pass stage.
In an example, in the first pass stage, the HMM-based system uses one or more state machines to identify first pass recognition candidates. In general, a state machine may be used to recognize an unknown input. In this example, the state machine determines a sequence of known states representing sounds that that best match input speech. This best-matched sequence is deemed to be the state machine's hypothesis for the input speech. The audio element recognized in the first pass stage may be a part of a word (e.g., a syllable), phoneme, etc; a whole word, phoneme, etc.; a part of a sequence of words, phonemes, etc., and so forth.
During the speech recognition process, each state in the state machine receives the best incoming path to that state (e.g., the incoming path with the lowest cost), determines how good a match incoming audio is to itself, produces a result called the “state matching cost”, and outputs data corresponding to this result to successor state(s). The combination of state matching costs with the lowest cost incoming path is referred to as the “path cost”. The path with the lowest path cost may be selected as the best-matched sequence for the input speech.
Accordingly, in the context of the processes described herein, a “path” includes a sequence of states through a state machine that are compared to input audio data. A “path cost” includes a sum of matching costs (e.g., costs of matching a state to a segment of audio) and transition costs (costs to transition from a state_i to a state_j). A “best path cost” includes the “path” with the lowest “path cost”. A state in a state machine may have several different states that can transition to the current state. To determine the “best input path” leading into a state, the “path cost” for each path arriving at a current state should be known. If any of the incoming “path costs” are unknown at the current time, then the “best path cost” for this state cannot be determined until incoming path costs become known.
Referring to FIG. 2, user 101 utters input speech, e.g., the word “recognize”, into mobile device 101. Mobile device 101 converts the input speech into audio data 110. In this example, the audio data is part of a continuous stream that is sent from a microphone to speech recognition system 104. The speech is received at acoustic model 115 at both the first and second pass stages.
The part of the speech recognition process performed by acoustic model 115 employs state machine 200 that include states 201. In this example, these states may represent sub-phonemes in the English language. In an example implementation, a phoneme is the smallest piece of sound that provides meaningful distinctions between different words in one or more languages (e.g., every word has a sequence of phonemes). In the example of FIG. 2, the acoustic data of phonemes are further broken down into smaller components called sub-phonemes, which can facilitate more accurate speech recognition (since smaller units of sound are recognized). At the end of the recognition process, state machine 200 determines best path cost 204, which corresponds to a sequence of sub-phonemes that best matches the corresponding input audio element. The better the match is between an audio element and a sequence of sub-phonemes, the smaller the resulting path cost is. Therefore, in this example, the best path cost corresponds to the sequence of sub-phonemes which has the smallest path cost.
In the example of FIG. 2, the speech recognition system includes a state machine 200 with M states, where M≧1. Audio element, “recognize”, can be broken-down into the following set of sub-phonemes: r-r-r-eh-eh-eh-k-k-k-ao-ao-ao-g-g-g-n-n-n-ay-ay-ay-z-z-z, which are labeled as follows: r1, r2, r3, eh1, eh2, eh3, k1, k2, k3, ao1, ao2, ao3, g1, g2, g3, n1, n2, n3, ay1, ay2, ay3, z1, z2, z3. State machine 200, therefore, should ultimately find the best path to be the following sequence of sub-phoneme states: r1, r2, r3, eh1, eh2, eh3, k1, k2, k3, ao1, ao2, ao3, g1, g2, g3, n1, n2, n3, ay1, ay2, ay3, z1, z2, z3.
In this example, first pass stage 115 a of the acoustic model compares the input audio to its model for the word “recognize” and finds a candidate 208 with a best path cost. The candidate corresponds to the sequence of sub-phonemes that has the lowest path cost. More than one best path cost may be obtained in some cases. For example, if determined best path costs are close (e.g., within a predefined tolerance of each other or another metric), several candidates may be selected. One or more words, phrases, etc. 208 thus may be identified and sent to second pass stage 115 b for further processing.
In second pass stage 115 b, the input audio is broken-down into time duration segments. The segments may be, for example, 10 ms each or any other appropriate duration. An average word is around 500 ms. So, in the 10 ms example, an average word contains about 50 segments. Other words, however, may have fewer or lesser numbers of segments. The segments are represented by templates. In this example, the templates include vectors 210 having a number of features (e.g., one feature per dimension of a vector). In an example implementation, there are 39 features per vector; however, other implementations may use different numbers of features. So, in the second pass stage, the acoustic model is not a series of states as in the first pass stage, but rather a number of vectors for a sound sequence (e.g., a word or phrase). Although the implementations described herein use vectors, other types of templates may be used instead of vectors.
In an example implementation, vectors for input audio are generated by performing a Fast Fourier Transform (FFT) on the input audio to obtain its frequency components. A cosine transformation is performed on the frequency components to obtain features for the vectors. In this example, thirteen features are obtained per 10 ms segment. First and second derivatives of those features are taken over time to obtain an additional 26 features to produce the full 39 features for a vector. In some implementations, perceptual linear prediction (PLP) features or Mel frequency cepstrum coefficients (MFCC) may be used in the vectors.
In second pass stage 115 b, stored vectors 211 are identified that correspond to recognition candidate(s) identified in first pass stage 115 a. In this regard, during a training phase, the speech recognition system generates, identifies, and stores in a database, vectors for different speech elements. In this example, the speech element is a word; however, vectors may be pre-stored in a database for syllables phrases, word combinations, or other sounds sequences as well, and used as described herein to recognize more or less than a single word.
In the training phase, audio is recognized and vectors are generated for the corresponding audio as described above. The audio may be recognized using automatic and/or manual recognition processes. In other words, the training data may be unsupervised or supervised. For example, input audio may be for the word “recognize”. During training, the input audio is recognized, e.g., using a standard HMM-based approach with, or without, manual (e.g., human) confirmation. Vectors for that input audio may be generated and stored in memory. For example, if the audio is the word “recognize”, and that word is 1000 ms in duration, then there are 100 vectors stored, one for each 10 ms of audio on the word “recognize”.
The foregoing process may be performed, during the training phase, for various instances of the word “recognize”. For example, different groups of vectors may be generated for the word “recognize” spoken using different speech patterns, for different durations, in different accents, in different (e.g., noise or quiet) environments, in different word contexts, and so forth. The result may be numerous groups of vectors, all of which represent different versions of the same words e.g., “recognize”. The vectors may differ in content for reasons noted above, and may be used in the second pass stage to generate a recognition candidate for the input audio in the manner described herein.
The training phase may associate metadata with each vector identifying, e.g., the word that the vector represents. For example, each vector may also be assigned a weight, which may be represented by metadata. The weights may be indicative or the likelihood (e.g., a confidence or relevance score) that the vector representation is accurate. For example, higher weights (indicative of more accuracy) may be assigned to manually-verified vector representations than for vector representations that are not manually verified. Likewise, vector representations for noisy audio, or other audio that is deemed generally unreliable for some reason, may be assigned lower weights (indicative of less accuracy), since such noise may affect recognition accuracy. In some implementations, vector representations for audio that exceeds a predefined noise threshold may be assigned weights of zero. A regularization process may be used to obtain the weights of zero. In this regard, the weight assigned to the vector may be proportionate to the noise level of the associated audio, or to its reliability in general. In some implementations, the weights may be negative, which indicates a negative correlation between a vector representation and audio. In some implementations, the weights may be determined using a conditional random field technique.
The metadata may also identify other features associated with the input audio. For example, the metadata may identify one or more words that neighbor the word that is the subject of a vector. In this context, “neighbor” may include, but is not limited to, one or more words either before or after the word at issue. In some examples, the one or more words are directly before or after the word at issue; however, this need not be the case always.
The metadata may also identify other contextual aspects of the audio. For example, the metadata may specify a source of the audio, e.g., a television network, an online video service, a video device (e.g., a digital video camera), and so forth. The metadata may also include, if available, information about the linguistic characteristics of the audio, e.g., the speaker's accent, location, and so forth. The metadata may also identify the condition of the audio, e.g., whether the audio is noisy, the amount of noise, the type of noise, and so forth.
Vectors stored in the training phase are used in second pass stage 115 b in recognizing input audio. More specifically, vectors are identified, in storage, for a first pass stage recognition candidate. Vector for the input audio (the “input audio vectors”) are compared 215 to the stored vectors, and are scored against the stored vectors. In this example, the scores are based, at least in part, on a calculated distance between the input audio vectors and each stored vector. In some implementations, the calculated distance between two vectors is the Dynamic Time Warping (DTW) distance. In an example, the DTW distance is the summed Euclidean distances of the best warping of two vectors. The warping usually is subject to certain constraints, for example, monotonicity and bounded jump size. The DTW distance can be determined using dynamic programming techniques, with a complexity quadratic in a number of frames. The DTW distance may be length-normalized, if necessary, to make vectors of different length comparable.
Generally, the DTW distance is indicative of how closely the input audio corresponds to the word represented by stored vectors. In the above example, the DTW distances between the input audio vectors 210 and stored vectors 211 for “recognize” are indicative of how closely input audio corresponds to the word “recognize”. This DTW distances may be determined for any number (e.g., all or a subset) of stored vectors for the same word. The DTW distances for various vector comparisons may be considered together or combined mathematically to provide an indication of a likelihood that the input audio is a known word.
In some implementations, scores 219 resulting from the first pass stage and score 220 resulting from the second pass stage (e.g., the DTW distances or scores based thereon) are both used to produce an overall score indicative of how well the input audio matches a word. In some implementations, a combiner module 211, which implements a conditional random field technique, may be used to generate a final recognition score and thus an output recognition candidate 224.
Factors other than the DTW distances and first pass stage scores may also affect the final recognition score. For example, weights applied to the stored vectors may affect the amount that those stored vectors contribute to the final recognition score. For example, the output of the second pass stage may be adjusted (e.g., multiplied by) weights for corresponding pre-stored vectors. Vectors that are deemed reliable representations of audio (e.g., manually-confirmed vectors or vectors generated from audio having low levels of noise) may have a greater effect on the final recognition score than other, less-reliable vectors. Accordingly, such vectors may be associated with higher weights than other vectors.
In some implementations, vector weights are identified prior to vector identification. Only those vectors having (e.g., positive) non-zero weights, or weights that exceed a predefined threshold, may be identified and compared against a vector for input audio. As a result, the amount of vector comparisons that are performed can be reduced. In other implementations, the zero-weighted vectors may be identified; however, their zero weight effectively discounts their effect on the final score.
In some implementations, neighboring words may be used to adjust scores resulting from DTW distances. For example, the input audio may include the word “to”, neighbored by “going”, as in “going to”. In recognizing “to” in the first pass stage, metadata may be associated with the resulting recognition candidate indicating that the word “going” precedes the word “to”. This information may be used to adjust the weight applied to the DTW distance. For example, in some cases, if it is known what a predecessor word was, the weight may be adjusted so that the resulting score is downgraded or upgraded. For example, “thereto” is a word that ends in “to”. If the first pass stage indicates that “there” precedes “to” in audio, a recognition result for the word “to” may be downgraded (e.g., by adjusting the weight for its corresponding vectors downward) to reflect the possibility that the word “to” is part of “thereto”, rather than the stand-alone word “to”. In other implementations, more than one neighboring word or sound sequence may affect the determination. In a similar manner, succeeding neighbor words may affect applied weights.
In some implementations, neighboring words may affect which vectors are identified for comparison with the input audio. For example, if neighboring words are known, vectors reflecting a combination of those neighboring words with the word at issue may be identified and compared to the input audio. This may reduce the number of comparisons that occur, particularly where there are large numbers of vector examples for words (e.g., for prepositions, such as “to”).
Metadata, such as that described above for the vectors produced in the training phase, may be associated with vectors generated from the input audio, in cases where the appropriate information is available. The metadata for the input audio vectors may be used in scoring stored vectors. For example, the metadata of input audio vectors may be matched to corresponding metadata of stored vectors and, where matches are/are not present, recognition scores may be adjusted.
Referring back to FIG. 1, the final recognition output constitutes recognized audio. The recognized audio 119 may include, e.g., a textual transcription of the audio, language information associated with included recognition candidates, or other information representative of its content.
The recognized audio 119 may be provided as data to the mobile device 101 that provided the input audio. For example, a user may input audio to the speech recognition system through the mobile device 101. The recognized audio 119 may be provided to the mobile device 101 or another service and used to control one or more functions associated with the mobile device 101. For example, an application on the mobile device 101 may execute an e-mail or messaging application in response to command(s) in the recognized audio 119. Likewise the recognized audio 119 may be used to populate an e-mail or other message. Processes may be implemented, either remote from, or local to, mobile device 101, to identify commands in an application, such as “send e-mail” to cause actions to occur, such as executing an e-mail application, on mobile device 101.
In another example, recognized audio 119 may be provided as data to a search engine. For instance, recognized audio 119 may constitute a search query that is to be input to a search engine. The search engine may identify content (e.g., Web pages, images, documents, and the like) that are relevant to the search query, and return that information to the computing device that provided the initial audio. In some implementations, the recognized audio may be provided to the computing device prior to searching in order to confirm its accuracy.
In another example, recognized audio 119 may be used to determine advertisements related to the topic of the audio. Such advertisements may be provided in conjunction with output of the audio content.
FIG. 3 is a block diagram of an example of a system 300 on which the processes of FIGS. 1 and 2 may be implemented. For example, input speech may be provided through one or more of communication devices 302. Mobile device 101 of FIG. 1 is an example of a communication device 302 that may be used to perform the processes described herein. Resulting audio data may be transmitted to one or more processing entities (e.g., processing entities 308 a and 308 b or more), which may be part of server(s) 304, for speech recognition performed as described herein.
Communication devices 302 may communicate with server(s) 304 through network 306. Network 306 can represent a mobile communications network that can allow devices (e.g., communication devices 302) to communicate wirelessly through a communication interface (not shown), which may include digital signal processing circuitry where appropriate. Network 306 can include one or more networks. The network(s) may provide for communications under various modes or protocols, e.g., Global System for Mobile communication (GSM) voice calls, Short Message Service (SMS), Enhanced Messaging Service (EMS), or Multimedia Messaging Service (MMS) messaging, Code Division Multiple Access (CDMA), Time Division Multiple Access (TDMA), Personal Digital Cellular (PDC), Wideband Code Division Multiple Access (WCDMA), CDMA2000, or General Packet Radio System (GPRS), among others. For example, the communication may occur through a radio-frequency transceiver. In addition, short-range communication may occur, e.g., using a Bluetooth, WiFi, Near Field Communication, or other such transceiver.
Communication devices 302 can include various forms of client devices and personal computing devices. Communication devices 302 can include, but are not limited to, a cellular telephone 302 a, personal digital assistant (PDA) 302 b, and a smartphone 302 c. In other implementations, communication devices 302 may include (not shown), personal computing devices, e.g., a laptop computer, a handheld computer, a desktop computer, a tablet computer, a network appliance, a camera, an enhanced general packet radio service (EGPRS) mobile phone, a media player, a navigation device, an email device, a game console, or a combination of any two or more of these data processing devices or other data processing devices. In some implementations, the personal computing device can be included as part of a motor vehicle (e.g., an automobile, an emergency vehicle (e.g., fire truck, ambulance), a bus).
Communication devices 302 may each include one or more processing devices 322, memory 324, and a storage system 326. Storage system 326 can include a speech conversion module 328 and a mobile operating system module 330. Each processing device 322 can run an operating system included in mobile operating system module 330 to execute software included in speech conversion module 328. Referring to FIGS. 1 to 3, speech conversion module 328 may receive input speech 106 and perform any processing necessary to convert the input speech into audio data 110 for recognition.
Server(s) 304 can include various forms of servers including, but not limited to, a web server, an application server, a proxy server, a network server, a gateway, or a server farm. Server(s) 304 can include one or more processing entities 308 a, 308 b. Although only two processing entities are shown, any number may be included in system 300. In this example, each entity includes a memory 310 and a storage system 312. Processing entities 308 a, 308 b can be real (e.g., different computers, processors, programmed logic, a combination thereof, etc.) or virtual machines, which can be software implementations of machines that execute programs like physical machines. Each storage system 312 can include a speech recognition module 314, a speech recognition database 316, and a server operating system module 318. Each processing entity 308 a, 308 b can run an operating system included in the server operating system module 318 to execute software included in the modules that make-up speech recognition module 314. In this regard, the operation of speech recognition module may be spread across various processing entities or performed in a single processing entity.
A speech recognition module 314 can process received audio data, or a portion thereof, from a communication device 302 (e.g., cellular telephone 302 a) and use speech recognition database 316 to determine the spoken word content of the speech data. Each speech recognition module may include an acoustic model 331, a language model 332, and a dictionary 333. As noted, acoustic model 331 includes digital representations of individual sounds that are combinable to produce a collection of words, phrases, etc. Language model 332 assigns a probability that a sequence of words will occur together in a particular sentence or phrase. Dictionary 333 transforms sound sequences into words that can be understood by the language model. Speech recognition database 316 includes data for one or more state machines 334 for performing the first stage recognition process described herein and a vector database 335 that includes vectors for known words for performing the second stage recognition process described herein.
In this implementation, acoustic model 331 includes a first pass module 340 and a second pass module 341, which implement the first pass and second pass recognition stages described herein. First pass module 340 may be a discriminatively trained HMM model (e.g., of the type shown in FIG. 2) that uses Gaussian mixtures and PLPs as front-end features. The decoding performed by first pass module 340 may use a trigram language model. Second pass module 341 may be an exemplar features-based recognition process, which uses vectors representing segments of audio to identify the content of input audio. A combiner module (not shown in FIG. 3), which also may be part of the acoustic model, combines scores produced by the first pass module and the second pass module to identify one or more higher-rated recognition candidates for input audio.
Further details about the speech recognition processes performed the first and second pass modules will be described with respect to FIGS. 4 and 5. FIG. 4 shows operations performed during a training phase process 400. Process 400 may be performed by speech recognition module 314 of FIG. 3, either alone or in combination with one or more other appropriate computer programs.
In example implementations, the training phase includes, among other things, generating a database of vectors for segments of audio; identifying words, phrases or sounds sequences that are represented by groups of the vectors; and associating weights and metadata with the vectors.
More specifically, the speech recognition system is trained on a corpus of audio. The corpus need not be a single source of audio, but rather may contain multiple sources including, e.g., broadcast audio, audio from online sources, speech, music, other sounds, noise and so forth. Training includes receiving (401) segments of the audio from the corpus. The segments of audio may be of any appropriate length. In this example, the segments are 10 ms. The received audio is identified (402). For example, the retrieved audio may be identified using an HMM-based system having one or more state machines. The identification process may be completely automatic, e.g., the HMM-based system may identify sounds in the audio; a language model may provide phonetic representations of words composed of those sounds; and a dictionary may transform sound sequences into words that can be understood by the language model. In some implementations, the training phase may include making a manual determination about the identity of input audio. For example, a person may identify the audio or confirm the accuracy of the result produced by an HMM-based system. In other implementations, a person may identify the audio without the assistance of the HMM-based system. In still other implementations, the automatic portion of the recognition may be a system other than an HMM-based system.
Vectors are generated (403) for the audio. In this implementation, vectors for input audio are generated by performing a Fast Fourier Transform (FFT) on the audio to obtain its frequency components. A cosine transformation is performed on the frequency components to obtain features for the vectors. In this example, thirteen features are obtained. First and second derivatives of those features are taken over time to obtain an additional 26 features to produce the full 39 features for a vector. In some implementations, perceptual linear prediction (PLP) features or Mel frequency cepstrum coefficients (MFCC) may be used in the vectors.
Information is associated (404) with the generated vectors. For example, the information may include weights and metadata, including, but not limited to, the weights and metadata described above. During the training phase, the applied weights and metadata, if appropriate, are used to generate outputs for known audio. Accordingly, a testing phase may be part of the training. If the applied weights do not generate the appropriate output during testing, then the applied weights may be adjusted until the appropriate output is obtained.
In an example implementation, the model weights may be estimated using a maximum mutual information (MMI) training criterion. As there may be millions of features to consider, most of which are not expected to be relevant, regularization may be used for feature selection. In addition, regularization may be used to avoid overfitting. Processing may be performed using the general-purpose L-BFGS or Rprop techniques.
The information associated (404) with the generated vectors may also identify a word or phrase associated with each vector. In this regard, given that vectors in this example represent 10 ms of audio, a single vector will not typically represent an entire word. However, a group of such vectors (e.g., 50) may represent a word and several groups may represent a phrase or other sound sequence. The metadata associated with each vector may identify the word or phrase that the vector is part of, and what part of the word or phrase the vector represents. For example, the metadata may specify that the word that a vector is part of is “recognize” and it may specify what part of the word “recognize” that the vector represents (e.g., the first 10 ms, the tenth 10 ms, and so forth).
In some implementations, a group of vectors is not representative of audio (e.g., a negative representation) and may be indicated as such in metadata.
Vectors and associated metadata are stored (405) in a database. The vectors may be indexed, e.g., by word or words, for retrieval. The training process continues 406 for all or part of the corpus of audio. The training may be updated, as desired, using new audio or the same audio.
FIG. 5 is a flow diagram for an example process 500 for performing speech recognition. Process 500 may be performed by speech recognition module 314 of FIG. 3, either alone or in combination with one or more other appropriate computer programs.
In process 300, audio is received (501). For example, speech recognition module 314 may receive audio from a computing device, such as mobile device 101 (FIG. 1). The input audio referred to herein may include all of the audio received between designated start and stop times, or a portion or snippet thereof. In the example described here, the audio is input speech; however, any type of audio may be received. For example, the audio may be a recorded musical track, a recorded track associated with video, and so forth. Phonemes (“phones”) are identified in the input audio and may be used, as described below, to identify the content of the audio.
A recognition process is performed (502) on the input audio. For example the recognition process may be performed by first pass module 340. In this example, first pass module 340 is an HMM-based system (e.g., like first pass stage 115 a of FIGS. 1 and 2), as described above, which produces scored recognition candidates. Candidates for recognition of the input audio are identified (503) by their scores. For example, one or more candidates with the highest recognition scores may be identified and selected. A predefined number of candidates may be selected, or those within a predefined tolerance of the candidate with the highest score may be selected. Selection criteria other than these may also be used. The candidates are provided to second pass module 341. There, at least some of the following operations may be performed to generate final recognition candidates (e.g., a best recognition candidate).
Vectors are generated (504) for the input audio. The vectors may be for 10 ms segments of the audio, as described above, and may include appropriate metadata. Vectors that may correspond to the input audio are identified (505) in the database. The vectors that are identified are vectors for the words, phrase, etc. of audio recognized in the first pass stage. For example, if the first pass stage has identified candidates of “recognize”, “recognized”, and “ignition”, then vectors corresponding to those words are identified in the database based, e.g., on their associated metadata. For example, a search of an index may be performed to identify the vectors. In some implementations, all vectors corresponding to a recognition candidate are identified. In other implementations, a subset of all vectors corresponding to a recognition candidate is identified. For example, vectors with weights that are at, or below, a predefined value, e.g., zero, may be excluded from consideration. In this case, it is possible to reduce the effects of noise or other artifacts on the recognition process. Furthermore, as a result, the amount of processing performed is reduced (since vectors with zero weights need not be processed). Thus, the metadata may be used to reduce the amount of processing performed, since it can result in consideration of fewer numbers of vectors than would otherwise be considered.
In this regard, all stored vectors may not be accurately labeled. For example, vectors for “recognize” may be inaccurately labeled as being for “recognition”. By using a number of vectors from the database for comparison, the effects of inaccurately-labeled vectors may be mitigated in some cases.
The vectors for the input audio are compared (506) to the identified vectors for the recognition candidates to determine similarity metrics between the vectors for the input audio and the identified vectors. The similarity metric may be based on DTW distances between vectors, as noted above. The similarity metrics may be such that they reduce the effects of noise and errors on recognition.
In an example implementation, the similarity metric is referred to as a “template feature”. In an example implementation, template features (f) may be based on a segmented word W (e.g., broken into segments of 10 ms) and frame features X associated with this word segment. In an example implementation, a template feature is set to the average DTW distance between a recognition hypothesis X (e.g., the vector for recognition candidate from the first pass stage) and the k-nearest vectors of X associated with the hypothesis word W, where YεKNNW(X) (KNN, meaning “k-nearest neighbor vectors) if the word hypothesis WI matches the template word W. Otherwise, the template feature is set to zero. This is expressed in the following equation:
f W tmpl ( X , W ) = { Y ε KNN W ( X ) d ( X , Y ) KNN W ( X ) 0 if W = W otherwise
Accordingly, in this example, there is one template feature for each word.
In another example implementation, individual DTW distances are used as the template features. The DTW distances may be exponentiated to achieve a more sparse representation and thus, in some cases, faster training. In addition, this non-linearity enables modeling of arbitrary decision boundaries. This is expressed in the following equation:
f Y kernel ( X , W ) = { exp ( - β d ( X , Y ) ) 0 if Y template of W otherwise
β, in the above equation, is a scaling factor.
The similarity metric may be adjusted by weights associated with the corresponding vectors. In addition, the similarity metric may be adjusted in accordance with other metadata associated with the vectors (e.g., the identity(ies) of neighboring words, the context of the audio, and so forth).
Thus, the output of second pass module 341 includes one or more scores (e.g., one or more template features) that are indicative of how well the recognition candidate from first pass module matches vectors from database 335.
The scores produced by first pass module 340 are re-scored (507) using the scores produced by second pass module 341 to identify (508) which of the recognition candidates best matches the input audio. In an implementation, the combination of the template features from the second pass module with the first pass scores is performed using a segmental conditional random field.
A segmental conditional random field is a conditional random field defined on word lattices. In an implementation, the features of the conditional random field are defined on the word arc level. In addition to the template features, language and acoustic model scores are used as features. As a result, the re-scoring result is no worse than the first-pass baseline result.
The resulting output (509) of speech recognition module 314 may be applied to language model 332 that generates a phonetic representation of the selected (e.g., best) recognition candidate, along with other appropriate information identifying the word or phrase. Dictionary 333 may be used to transform sound sequences into words that can be understood by the language model.
Data corresponding to the selected recognition candidate is output as a recognized version of the audio. For example, speech recognition module may output the data to the appropriate device or process. In different examples, the output may be formatted as part of an XML file, a text transcription, a command or command sequence, a search query, and so forth. The data may be presented to the user, either audibly or visually, or it may be used as part of a process either on the user's device or elsewhere. For example, a transcription of the input audio may be applied to a translation service, which may be programmed to generate an audio and/or textual translation of the input audio into another, different language (e.g., from English to French) for output to the user's computing device. In some examples, the user may be able to specify the accent or dialect of the target language for the output audio.
FIG. 6 shows examples of computing devices on which the processes described herein, or portions thereof, may be implemented. In this regard, FIG. 6 shows an example of a generic computing device 600 and a generic mobile computing device 650, which may be used to implement the processes described herein, or portions thereof. For example, server(s) 304 may be implemented on computing device 600. Mobile computing device 650 may represent the mobile device 101 of FIG. 1.
Computing device 600 is intended to represent various forms of digital computers, examples of which include laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Computing device 650 is intended to represent various forms of mobile devices, examples of which include personal digital assistants, cellular telephones, smartphones, and other similar computing devices. The components shown here, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the implementations described and/or claimed in this document.
Computing device 600 includes a processor 602, memory 604, a storage device 606, a high-speed interface 608 connecting to memory 604 and high-speed expansion ports 610, and a low speed interface 612 connecting to low speed bus 614 and storage device 606. Components 602, 604, 606, 608, 610, and 612, are interconnected using various busses, and may be mounted on a common motherboard or in other manners as appropriate. The processor 602 may process instructions for execution within the computing device 600, including instructions stored in the memory 604 or on the storage device 606 to display graphical information for a GUI on an external input/output device, for example, display 616 coupled to high speed interface 608. In other implementations, multiple processors and/or multiple buses may be used, as appropriate, along with multiple memories and types of memory. Also, multiple computing devices 600 may be connected, with a device providing a portion of the necessary operations (e.g., as a server bank, a group of blade servers, or a multi-processor system).
The memory 604 stores information within the computing device 600. In one implementation, the memory 604 is a volatile memory unit or units. In another implementation, the memory 604 is a non-volatile memory unit or units. The memory 604 may also be another form of computer-readable medium, examples of which include a magnetic or optical disk.
The storage device 606 is capable of providing mass storage for the computing device 600. In one implementation, the storage device 606 may be or contain a computer-readable medium, examples of which include a floppy disk device, a hard disk device, an optical disk device, or a tape device, a flash memory or other similar solid state memory device, or an array of devices, including devices in a storage area network or other configurations. A computer program product may be tangibly embodied in an information carrier. The computer program product may also contain instructions that, when executed, perform one or more methods, including those described above. The information carrier may be a non-transitory computer- or machine-readable medium, for example, the memory 604, the storage device 606, or memory on processor 602. For example, the information carrier may be a non-transitory, machine-readable storage medium.
The high speed controller 608 manages bandwidth-intensive operations for the computing device 600, while the low speed controller 612 manages lower bandwidth-intensive operations. Such allocation of functions is exemplary only. In one implementation, the high-speed controller 608 is coupled to memory 604, display 616 (e.g., through a graphics processor or accelerator), and to high-speed expansion ports 610, which may accept various expansion cards (not shown). In the implementation, low-speed controller 612 is coupled to storage device 606 and low-speed expansion port 614. The low-speed expansion port, which may include various communication ports (e.g., USB, Bluetooth, Ethernet, wireless Ethernet) may be coupled to one or more input/output devices, e.g., a keyboard, a pointing device, a scanner, or a networking device, e.g., a switch or router, e.g., through a network adapter.
The computing device 600 may be implemented in a number of different forms, as shown in the figure. For example, it may be implemented as a standard server 620, or multiple times in a group of such servers. It may also be implemented as part of a rack server system 624. In addition, it may be implemented in a personal computer, e.g., a laptop computer 622. Alternatively, components from computing device 600 may be combined with other components in a mobile device (not shown), e.g., device 650. Such devices may contain one or more of computing device 600, 650, and an entire system may be made up of multiple computing devices 600, 650 communicating with one other.
Computing device 650 includes a processor 652, memory 664, an input/output device, e.g., a display 654, a communication interface 666, and a transceiver 668, among other components. The device 650 may also be provided with a storage device, e.g., a microdrive or other device, to provide additional storage. The components 650, 652, 664, 654, 666, and 668, are interconnected using various buses, and several of the components may be mounted on a common motherboard or in other manners as appropriate.
The processor 652 may execute instructions within the computing device 650, including instructions stored in the memory 664. The processor may be implemented as a chipset of chips that include separate and multiple analog and digital processors. The processor may provide, for example, for coordination of the other components of the device 650, e.g., control of user interfaces, applications run by device 650, and wireless communication by device 650.
Processor 652 may communicate with a user through control interface 658 and display interface 656 coupled to a display 654. The display 654 may be, for example, a TFT LCD (Thin-Film-Transistor Liquid Crystal Display) or an OLED (Organic Light Emitting Diode) display, or other appropriate display technology. The display interface 656 may include appropriate circuitry for driving the display 654 to present graphical and other information to a user. The control interface 658 may receive commands from a user and convert them for submission to the processor 652. In addition, an external interface 662 may be provide in communication with processor 652, so as to enable near area communication of device 650 with other devices. External interface 662 may provide, for example, for wired communication in some implementations, or for wireless communication in other implementations, and multiple interfaces may also be used.
The memory 664 stores information within the computing device 650. The memory 664 may be implemented as one or more of a computer-readable medium or media, a volatile memory unit or units, or a non-volatile memory unit or units. Expansion memory 674 may also be provided and connected to device 650 through expansion interface 672, which may include, for example, a SIMM (Single In Line Memory Module) card interface. Such expansion memory 674 may provide extra storage space for device 650, or may also store applications or other information for device 650. Specifically, expansion memory 674 may include instructions to carry out or supplement the processes described above, and may include secure information also. Thus, for example, expansion memory 674 may be provide as a security module for device 650, and may be programmed with instructions that permit secure use of device 650. In addition, secure applications may be provided by the SIMM cards, along with additional information, e.g., placing identifying information on the SIMM card in a non-hackable manner.
The memory may include, for example, flash memory and/or NVRAM memory, as discussed below. In one implementation, a computer program product is tangibly embodied in an information carrier. The computer program product contains instructions that, when executed, perform one or more methods, including those described above. The information carrier is a computer- or machine-readable medium, e.g., the memory 664, expansion memory 674, memory on processor 652, and so forth that may be received, for example, over transceiver 668 or external interface 662.
Device 650 may communicate wirelessly through communication interface 666, which may include digital signal processing circuitry where necessary. Communication interface 666 may provide for communications under various modes or protocols, examples of which include GSM voice calls, SMS, EMS, or MMS messaging, CDMA, TDMA, PDC, WCDMA, CDMA2000, or GPRS, among others. Such communication may occur, for example, through radio-frequency transceiver 668. In addition, short-range communication may occur, e.g., using a Bluetooth, Wi-Fi, or other such transceiver (not shown). In addition, GPS (Global Positioning System) receiver module 670 may provide additional navigation- and location-related wireless data to device 650, which may be used as appropriate by applications running on device 650.
Device 650 may also communicate audibly using audio codec 660, which may receive spoken information from a user and convert it to usable digital information. Audio codec 660 may likewise generate audible sound for a user, e.g., through a speaker, e.g., in a handset of device 650. Such sound may include sound from voice telephone calls, may include recorded sound (e.g., voice electronic messages, music files, etc.) and may also include sound generated by applications operating on device 650.
The computing device 650 may be implemented in a number of different forms, as shown in the figure. For example, it may be implemented as a cellular telephone 680. It may also be implemented as part of a smartphone 682, personal digital assistant, or other similar mobile device.
Various implementations of the systems and techniques described here may be realized in digital electronic circuitry, integrated circuitry, specially designed ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various implementations may include implementation in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.
These computer programs (also known as programs, software, software applications or code) include machine instructions for a programmable processor, and may be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms “machine-readable medium” “computer-readable medium” refers to a computer program product, apparatus and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term “machine-readable signal” refers to a signal used to provide machine instructions and/or data to a programmable processor.
To provide for interaction with a user, the systems and techniques described here may be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user and a keyboard and a pointing device (e.g., a mouse or a trackball) by which the user may provide input to the computer. Other kinds of devices may be used to provide for interaction with a user as well; for example, feedback provided to the user may be a form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in a form, including acoustic, speech, or tactile input.
The systems and techniques described here may be implemented in a computing system that includes a back end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front end component (e.g., a client computer having a graphical user interface or a Web browser through which a user may interact with an implementation of the systems and techniques described here), or a combination of such back end, middleware, or front end components. The components of the system may be interconnected by a form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network (“LAN”), a wide area network (“WAN”), and the Internet.
The computing system may include clients and servers. A client and server are generally remote from one other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to one other.
In some implementations, the engines described herein may be separated, combined or incorporated into a single or combined engine. The engines depicted in the figures are not intended to limit the systems described here to the software architectures shown in the figures.
For situations in which the systems discussed here collect personal information about users, or may make use of personal information, the users may be provided with an opportunity to control whether programs or features collect personal information (e.g., information about a user's social network, social actions or activities, profession, a user's preferences, or a user's current location), or to control whether and/or how to receive content from the content server that may be more relevant to the user. In addition, certain data may be anonymized in one or more ways before it is stored or used, so that personally identifiable information is removed. For example, a user's identity may be anonymized so that no personally identifiable information can be determined for the user, or a user's geographic location may be generalized where location information is obtained (such as to a city, ZIP code, or state level), so that a particular location of a user cannot be determined. Thus, the user may have control over how information is collected about him or her and used by a content server.
Elements of different implementations described herein may be combined to form other implementations not specifically set forth above. Elements may be left out of the processes, computer programs, Web pages, etc., described herein without adversely affecting their operation. In addition, the logic flows depicted in the figures do not require the particular order shown, or sequential order, to achieve desirable results. Various separate elements may be combined into one or more individual elements to perform the functions described herein.
The features described herein may be combined in a single system, or used separately in one or more systems.
Other implementations not specifically described herein are also within the scope of the following claims.

Claims (16)

What is claimed is:
1. A method performed by one or more processing devices, comprising:
performing a preliminary recognition process on first audio, the preliminary recognition process comprising:
identifying one or more candidates for the first audio;
determining a plurality of path costs for the identified candidates, the plurality of path costs corresponding to sequences of sub-phonemes identified in the first audio;
determining a best path cost for each of the identified candidates based on the plurality of path costs;
associating the best path costs with the identified candidates; and
providing the identified candidates and associated best path costs;
generating first templates corresponding to the first audio, each first template comprising a number of elements corresponding to a sequence of sub-phonemes of the first audio;
selecting second templates corresponding to the identified candidates, the second templates representing second audio, each second template comprising elements that correspond to the elements in the first templates;
comparing the first templates to the second templates, wherein comparing comprises determining similarity metrics between the first templates and corresponding second templates, wherein the similarity metrics are based on
exponentiated and scaled dynamic time warping (DTW) distances between the selected ones of the first templates and selected ones of the second templates;
applying weights to the similarity metrics to produce weighted similarity metrics, the weights being associated with corresponding second templates;
applying the weighted similarity metrics to corresponding best path costs to produce re-scored path costs, the re-scored path costs being associated with corresponding identified candidates; and
using the re-scored path costs to determine which of the identified candidates corresponds to the first audio.
2. The method of claim 1, wherein selecting the second templates comprises selecting templates associated with a non-zero weight.
3. The method of claim 1, wherein metadata is associated with at least one of the first audio and the second audio, the metadata being used in obtaining at least the second templates.
4. The method of claim 3, wherein the metadata is indicative of the context of at least one of the first audio and the second audio.
5. The method of claim 4, wherein the metadata indicates at least one word that neighbors a word in at least one of the first audio and the second audio.
6. The method of claim 1, wherein the preliminary recognition process comprises a Hidden Markov Model (HMM) based process.
7. The method of claim 1, wherein applying the weighted similarity metrics to corresponding best path costs to produce re-scored path costs comprises using a conditional random field technique to generate a composite score indicative of an extent to which the first audio corresponds to the second audio.
8. The method of claim 1, wherein each element is at least one of: a phoneme in context, a syllable, or a word.
9. The method of claim 1, wherein, the first templates comprise vectors, the second templates comprise vectors, and the similarity metrics comprise distances between vectors.
10. The method of claim 1, wherein the second templates comprise multiple groups of second templates, each group of second templates representing a different version of a same candidate word or phrase for at least one of the first and second audio.
11. The method of claim 1, wherein second templates are selected from among a group of templates having associated weights, at least some of the weights being negative.
12. The method of claim 1, wherein the weights are determined using a conditional random field technique.
13. The method of claim 11, wherein at least some of the weights are zero, the zero weights being determined using a regularization technique.
14. The method of claim 1, wherein metadata is associated with at least one of the first audio and the second audio, the metadata indicating at least one of: information about a speaker of at least one of the first audio or the second audio, and information about an acoustic condition of at least one of the first audio or the second audio.
15. One or more non-transitory machine-readable media storing instructions that are executable to perform operations comprising:
performing a preliminary recognition process on first audio, the preliminary recognition process comprising:
identifying one or more candidates for the first audio;
determining a plurality of path costs for the identified candidates, the plurality of path costs corresponding to sequences of sub-phonemes identified in the first audio;
determining a best path cost for each of the identified candidates based on the plurality of path costs;
associating the best path costs with the identified candidates; and
providing the identified candidates and associated best path costs;
generating first templates corresponding to the first audio, each first template comprising a number of elements corresponding to a sequence of sub-phonemes of the first audio;
selecting second templates corresponding to the identified candidates, the second templates representing second audio, each second template comprising elements that correspond to the elements in the first templates;
comparing the first templates to the second templates, wherein comparing comprises determining similarity metrics between the first templates and corresponding second templates, wherein the similarity metrics are based on
exponentiated and scaled dynamic time warping (DTW) distances between the selected ones of the first templates and selected ones of the second templates;
applying weights to the similarity metrics to produce weighted similarity metrics, the weights being associated with corresponding second templates; and
applying the weighted similarity metrics to corresponding best path costs to produce re-scored bath costs, the re-scored bath costs being associated with corresponding identified candidates;
using the re-scored path costs to determine which of the identified candidates corresponds to the first audio.
16. A system comprising:
memory to store an acoustic model; and
one or more processing devices to perform operations associated with the acoustic model, the acoustic model comprising:
a first pass module to perform a preliminary recognition process on first audio, the preliminary recognition process comprising:
identifying one or more candidates for the first audio;
determining a plurality of path costs for the identified candidates, the plurality of path costs corresponding to sequences of sub-phonemes identified in the first audio;
determining a best path cost for each of the identified candidates based on the plurality of path costs;
associating the best path costs with the identified candidates; and
providing the identified candidates and associated best path costs;
a second pass module to:
generate first templates corresponding to the first audio, each first template comprising a number of elements corresponding to a sequence of sub-phonemes of the first audio;
select second templates corresponding to the identified candidates, the second templates representing second audio, each second template comprising elements that correspond to the elements in the first templates;
compare the first templates to the second templates, wherein comparing comprises determining similarity metrics between the first templates and corresponding second templates, wherein the similarity metrics are based
exponentiated and scaled dynamic time warping (DTW) distances between the selected ones of the first templates and selected ones of the second templates;
apply weights to the similarity metrics to produce weighted similarity metrics, the weights being associated with corresponding second templates;
apply the weighted similarity metrics to corresponding best path costs to produce re-scored path costs, the re-scored path costs being associated with corresponding identified candidates; and
use the re-scored path costs to determine which of the identified candidates corresponds to the first audio.
US13/665,245 2012-03-08 2012-10-31 Speech recognition process Active US8775177B1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US13/665,245 US8775177B1 (en) 2012-03-08 2012-10-31 Speech recognition process

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201261608218P 2012-03-08 2012-03-08
US13/665,245 US8775177B1 (en) 2012-03-08 2012-10-31 Speech recognition process

Publications (1)

Publication Number Publication Date
US8775177B1 true US8775177B1 (en) 2014-07-08

Family

ID=51031892

Family Applications (1)

Application Number Title Priority Date Filing Date
US13/665,245 Active US8775177B1 (en) 2012-03-08 2012-10-31 Speech recognition process

Country Status (1)

Country Link
US (1) US8775177B1 (en)

Cited By (83)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140211669A1 (en) * 2013-01-28 2014-07-31 Pantech Co., Ltd. Terminal to communicate data using voice command, and method and system thereof
US20140244255A1 (en) * 2013-02-25 2014-08-28 Seiko Epson Corporation Speech recognition device and method, and semiconductor integrated circuit device
WO2016167779A1 (en) * 2015-04-16 2016-10-20 Mitsubishi Electric Corporation Speech recognition device and rescoring device
US9502032B2 (en) 2014-10-08 2016-11-22 Google Inc. Dynamically biasing language models
EP3267328A1 (en) * 2016-07-07 2018-01-10 Samsung Electronics Co., Ltd Automated interpretation method and apparatus
US9978367B2 (en) 2016-03-16 2018-05-22 Google Llc Determining dialog states for language models
US10311860B2 (en) 2017-02-14 2019-06-04 Google Llc Language model biasing system
US20190303442A1 (en) * 2018-03-30 2019-10-03 Apple Inc. Implicit identification of translation payload with neural machine translation
CN111199730A (en) * 2020-01-08 2020-05-26 北京松果电子有限公司 Voice recognition method, device, terminal and storage medium
US10720160B2 (en) 2018-06-01 2020-07-21 Apple Inc. Voice interaction at a primary device to access call functionality of a companion device
US10878809B2 (en) 2014-05-30 2020-12-29 Apple Inc. Multi-command single utterance input method
US10896681B2 (en) 2015-12-29 2021-01-19 Google Llc Speech recognition with selective use of dynamic language models
US10978090B2 (en) 2013-02-07 2021-04-13 Apple Inc. Voice trigger for a digital assistant
US11009970B2 (en) 2018-06-01 2021-05-18 Apple Inc. Attention aware virtual assistant dismissal
US11037565B2 (en) 2016-06-10 2021-06-15 Apple Inc. Intelligent digital assistant in a multi-tasking environment
US11070949B2 (en) 2015-05-27 2021-07-20 Apple Inc. Systems and methods for proactively identifying and surfacing relevant content on an electronic device with a touch-sensitive display
US11087759B2 (en) 2015-03-08 2021-08-10 Apple Inc. Virtual assistant activation
US11107475B2 (en) * 2019-05-09 2021-08-31 Rovi Guides, Inc. Word correction using automatic speech recognition (ASR) incremental response
US11120372B2 (en) 2011-06-03 2021-09-14 Apple Inc. Performing actions associated with task items that represent tasks to perform
US11126400B2 (en) 2015-09-08 2021-09-21 Apple Inc. Zero latency digital assistant
US11133008B2 (en) 2014-05-30 2021-09-28 Apple Inc. Reducing the need for manual start/end-pointing and trigger phrases
US11140099B2 (en) 2019-05-21 2021-10-05 Apple Inc. Providing message response suggestions
US11152002B2 (en) 2016-06-11 2021-10-19 Apple Inc. Application integration with a digital assistant
US11169616B2 (en) 2018-05-07 2021-11-09 Apple Inc. Raise to speak
US11217251B2 (en) 2019-05-06 2022-01-04 Apple Inc. Spoken notifications
US11237797B2 (en) 2019-05-31 2022-02-01 Apple Inc. User activity shortcut suggestions
US11257504B2 (en) 2014-05-30 2022-02-22 Apple Inc. Intelligent assistant for home automation
US11269678B2 (en) 2012-05-15 2022-03-08 Apple Inc. Systems and methods for integrating third party services with a digital assistant
US11289073B2 (en) 2019-05-31 2022-03-29 Apple Inc. Device text to speech
US11307752B2 (en) 2019-05-06 2022-04-19 Apple Inc. User configurable task triggers
US11340925B2 (en) 2017-05-18 2022-05-24 Peloton Interactive Inc. Action recipes for a crowdsourced digital assistant system
US11348582B2 (en) 2008-10-02 2022-05-31 Apple Inc. Electronic devices with voice command and contextual data processing capabilities
US11360641B2 (en) 2019-06-01 2022-06-14 Apple Inc. Increasing the relevance of new available information
US11380310B2 (en) 2017-05-12 2022-07-05 Apple Inc. Low-latency intelligent automated assistant
US11388291B2 (en) 2013-03-14 2022-07-12 Apple Inc. System and method for processing voicemail
US20220223157A1 (en) * 2021-01-11 2022-07-14 Bank Of America Corporation System and method for single-speaker identification in a multi-speaker environment on a low-frequency audio recording
US11405466B2 (en) 2017-05-12 2022-08-02 Apple Inc. Synchronization and task delegation of a digital assistant
US11423908B2 (en) 2019-05-06 2022-08-23 Apple Inc. Interpreting spoken requests
US11423886B2 (en) 2010-01-18 2022-08-23 Apple Inc. Task flow identification based on user intent
US11431642B2 (en) 2018-06-01 2022-08-30 Apple Inc. Variable latency device coordination
US11467802B2 (en) 2017-05-11 2022-10-11 Apple Inc. Maintaining privacy of personal information
US11468282B2 (en) 2015-05-15 2022-10-11 Apple Inc. Virtual assistant in a communication session
US11475884B2 (en) 2019-05-06 2022-10-18 Apple Inc. Reducing digital assistant latency when a language is incorrectly determined
US11475898B2 (en) 2018-10-26 2022-10-18 Apple Inc. Low-latency multi-speaker speech recognition
US11488406B2 (en) 2019-09-25 2022-11-01 Apple Inc. Text detection using global geometry estimators
US11496600B2 (en) 2019-05-31 2022-11-08 Apple Inc. Remote execution of machine-learned models
US11500672B2 (en) 2015-09-08 2022-11-15 Apple Inc. Distributed personal assistant
US11516537B2 (en) 2014-06-30 2022-11-29 Apple Inc. Intelligent automated assistant for TV user interactions
US11520610B2 (en) * 2017-05-18 2022-12-06 Peloton Interactive Inc. Crowdsourced on-boarding of digital assistant operations
US11526368B2 (en) 2015-11-06 2022-12-13 Apple Inc. Intelligent automated assistant in a messaging environment
US11532306B2 (en) 2017-05-16 2022-12-20 Apple Inc. Detecting a trigger of a digital assistant
US11580990B2 (en) 2017-05-12 2023-02-14 Apple Inc. User-specific acoustic models
US11599331B2 (en) 2017-05-11 2023-03-07 Apple Inc. Maintaining privacy of personal information
US11656884B2 (en) 2017-01-09 2023-05-23 Apple Inc. Application integration with a digital assistant
US11657813B2 (en) 2019-05-31 2023-05-23 Apple Inc. Voice identification in digital assistant systems
US11671920B2 (en) 2007-04-03 2023-06-06 Apple Inc. Method and system for operating a multifunction portable electronic device using voice-activation
US11675829B2 (en) 2017-05-16 2023-06-13 Apple Inc. Intelligent automated assistant for media exploration
US11682380B2 (en) 2017-05-18 2023-06-20 Peloton Interactive Inc. Systems and methods for crowdsourced actions and commands
US11696060B2 (en) 2020-07-21 2023-07-04 Apple Inc. User identification using headphones
US11710482B2 (en) 2018-03-26 2023-07-25 Apple Inc. Natural assistant interaction
US11727219B2 (en) 2013-06-09 2023-08-15 Apple Inc. System and method for inferring user intent from speech inputs
US11755276B2 (en) 2020-05-12 2023-09-12 Apple Inc. Reducing description length based on confidence
US11765209B2 (en) 2020-05-11 2023-09-19 Apple Inc. Digital assistant hardware abstraction
US11783815B2 (en) 2019-03-18 2023-10-10 Apple Inc. Multimodality in digital assistant systems
US11790914B2 (en) 2019-06-01 2023-10-17 Apple Inc. Methods and user interfaces for voice-based control of electronic devices
US11798547B2 (en) 2013-03-15 2023-10-24 Apple Inc. Voice activated device for use with a voice-based digital assistant
US11809483B2 (en) 2015-09-08 2023-11-07 Apple Inc. Intelligent automated assistant for media search and playback
US11809783B2 (en) 2016-06-11 2023-11-07 Apple Inc. Intelligent device arbitration and control
US11838734B2 (en) 2020-07-20 2023-12-05 Apple Inc. Multi-device audio adjustment coordination
US11854539B2 (en) 2018-05-07 2023-12-26 Apple Inc. Intelligent automated assistant for delivering content from user experiences
US11853647B2 (en) 2015-12-23 2023-12-26 Apple Inc. Proactive assistance based on dialog communication between devices
US11853536B2 (en) 2015-09-08 2023-12-26 Apple Inc. Intelligent automated assistant in a media environment
US11862156B2 (en) 2017-05-18 2024-01-02 Peloton Interactive, Inc. Talk back from actions in applications
US11886805B2 (en) 2015-11-09 2024-01-30 Apple Inc. Unconventional virtual assistant interactions
US11893992B2 (en) 2018-09-28 2024-02-06 Apple Inc. Multi-modal inputs for voice commands
US11914848B2 (en) 2020-05-11 2024-02-27 Apple Inc. Providing relevant data items based on context
US11947873B2 (en) 2015-06-29 2024-04-02 Apple Inc. Virtual assistant for media playback
US12010262B2 (en) 2013-08-06 2024-06-11 Apple Inc. Auto-activating smart responses based on activities from remote devices
US12014118B2 (en) 2017-05-15 2024-06-18 Apple Inc. Multi-modal interfaces having selection disambiguation and text modification capability
US12051413B2 (en) 2015-09-30 2024-07-30 Apple Inc. Intelligent device identification
US12067985B2 (en) 2018-06-01 2024-08-20 Apple Inc. Virtual assistant operations in multi-device environments
US12073147B2 (en) 2013-06-09 2024-08-27 Apple Inc. Device, method, and graphical user interface for enabling conversation persistence across two or more instances of a digital assistant
US12087308B2 (en) 2010-01-18 2024-09-10 Apple Inc. Intelligent automated assistant

Citations (47)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4571697A (en) 1981-12-29 1986-02-18 Nippon Electric Co., Ltd. Apparatus for calculating pattern dissimilarity between patterns
US4783802A (en) 1984-10-02 1988-11-08 Kabushiki Kaisha Toshiba Learning system of dictionary for speech recognition
US4860358A (en) 1983-09-12 1989-08-22 American Telephone And Telegraph Company, At&T Bell Laboratories Speech recognition arrangement with preselection
US4908865A (en) 1984-12-27 1990-03-13 Texas Instruments Incorporated Speaker independent speech recognition method and system
US5131043A (en) 1983-09-05 1992-07-14 Matsushita Electric Industrial Co., Ltd. Method of and apparatus for speech recognition wherein decisions are made based on phonemes
US5864810A (en) * 1995-01-20 1999-01-26 Sri International Method and apparatus for speech recognition adapted to an individual speaker
US5946653A (en) 1997-10-01 1999-08-31 Motorola, Inc. Speaker independent speech recognition system and method
US5953699A (en) 1996-10-28 1999-09-14 Nec Corporation Speech recognition using distance between feature vector of one sequence and line segment connecting feature-variation-end-point vectors in another sequence
US5983177A (en) * 1997-12-18 1999-11-09 Nortel Networks Corporation Method and apparatus for obtaining transcriptions from multiple training utterances
US6104989A (en) 1998-07-29 2000-08-15 International Business Machines Corporation Real time detection of topical changes and topic identification via likelihood based methods
US6336108B1 (en) * 1997-12-04 2002-01-01 Microsoft Corporation Speech recognition with mixtures of bayesian networks
US6529902B1 (en) 1999-11-08 2003-03-04 International Business Machines Corporation Method and system for off-line detection of textual topical changes and topic identification via likelihood based methods for improved language modeling
US20030055640A1 (en) * 2001-05-01 2003-03-20 Ramot University Authority For Applied Research & Industrial Development Ltd. System and method for parameter estimation for pattern recognition
US6542866B1 (en) * 1999-09-22 2003-04-01 Microsoft Corporation Speech recognition method and apparatus utilizing multiple feature streams
US20050119885A1 (en) * 2003-11-28 2005-06-02 Axelrod Scott E. Speech recognition utilizing multitude of speech features
US20050149326A1 (en) 2004-01-05 2005-07-07 Kabushiki Kaisha Toshiba Speech recognition system and technique
US20050203738A1 (en) * 2004-03-10 2005-09-15 Microsoft Corporation New-word pronunciation learning using a pronunciation graph
US6983246B2 (en) 2002-05-21 2006-01-03 Thinkengine Networks, Inc. Dynamic time warping using frequency distributed distance measures
US20060129392A1 (en) 2004-12-13 2006-06-15 Lg Electronics Inc Method for extracting feature vectors for speech recognition
US20060149710A1 (en) 2004-12-30 2006-07-06 Ross Koningstein Associating features with entities, such as categories of web page documents, and/or weighting such features
US20060184360A1 (en) * 1999-04-20 2006-08-17 Hy Murveit Adaptive multi-pass speech recognition system
US20070037513A1 (en) 2005-08-15 2007-02-15 International Business Machines Corporation System and method for targeted message delivery and subscription
US20070100618A1 (en) 2005-11-02 2007-05-03 Samsung Electronics Co., Ltd. Apparatus, method, and medium for dialogue speech recognition using topic domain detection
US20070106685A1 (en) 2005-11-09 2007-05-10 Podzinger Corp. Method and apparatus for updating speech recognition databases and reindexing audio and video content using the same
US20070118372A1 (en) 2005-11-23 2007-05-24 General Electric Company System and method for generating closed captions
US7272558B1 (en) 2006-12-01 2007-09-18 Coveo Solutions Inc. Speech recognition training method for audio and video file indexing on a search engine
US20070265849A1 (en) * 2006-05-11 2007-11-15 General Motors Corporation Distinguishing out-of-vocabulary speech from in-vocabulary speech
US7310601B2 (en) 2004-06-08 2007-12-18 Matsushita Electric Industrial Co., Ltd. Speech recognition apparatus and speech recognition method
US20080195389A1 (en) * 2007-02-12 2008-08-14 Microsoft Corporation Text-dependent speaker verification
US20080201143A1 (en) 2007-02-15 2008-08-21 Forensic Intelligence Detection Organization System and method for multi-modal audio mining of telephone conversations
US20090030697A1 (en) 2007-03-07 2009-01-29 Cerra Joseph P Using contextual information for delivering results generated from a speech recognition facility using an unstructured language model
US20090055185A1 (en) 2007-04-16 2009-02-26 Motoki Nakade Voice chat system, information processing apparatus, speech recognition method, keyword data electrode detection method, and program
US20090055381A1 (en) 2007-08-23 2009-02-26 Google Inc. Domain Dictionary Creation
US20100070268A1 (en) * 2008-09-10 2010-03-18 Jun Hyung Sung Multimodal unification of articulation for device interfacing
US20100161313A1 (en) 2008-12-18 2010-06-24 Palo Alto Research Center Incorporated Region-Matching Transducers for Natural Language Processing
US7761296B1 (en) * 1999-04-02 2010-07-20 International Business Machines Corporation System and method for rescoring N-best hypotheses of an automatic speech recognition system
US20100305947A1 (en) * 2009-06-02 2010-12-02 Nuance Communications, Inc. Speech Recognition Method for Selecting a Combination of List Elements via a Speech Input
US20110004462A1 (en) 2009-07-01 2011-01-06 Comcast Interactive Media, Llc Generating Topic-Specific Language Models
US20110035210A1 (en) * 2009-08-10 2011-02-10 Benjamin Rosenfeld Conditional random fields (crf)-based relation extraction system
US20110077943A1 (en) 2006-06-26 2011-03-31 Nec Corporation System for generating language model, method of generating language model, and program for language model generation
US20110131046A1 (en) * 2009-11-30 2011-06-02 Microsoft Corporation Features for utilization in speech recognition
US8001066B2 (en) 2002-07-03 2011-08-16 Sean Colbath Systems and methods for improving recognition results via user-augmentation of a database
US20110296374A1 (en) 2008-11-05 2011-12-01 Google Inc. Custom language models
US20120022873A1 (en) 2009-12-23 2012-01-26 Ballinger Brandon M Speech Recognition Language Models
US20120029910A1 (en) 2009-03-30 2012-02-02 Touchtype Ltd System and Method for Inputting Text into Electronic Devices
US20120072215A1 (en) * 2010-09-21 2012-03-22 Microsoft Corporation Full-sequence training of deep structures for speech recognition
US20120150532A1 (en) 2010-12-08 2012-06-14 At&T Intellectual Property I, L.P. System and method for feature-rich continuous space language models

Patent Citations (49)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4571697A (en) 1981-12-29 1986-02-18 Nippon Electric Co., Ltd. Apparatus for calculating pattern dissimilarity between patterns
US5131043A (en) 1983-09-05 1992-07-14 Matsushita Electric Industrial Co., Ltd. Method of and apparatus for speech recognition wherein decisions are made based on phonemes
US4860358A (en) 1983-09-12 1989-08-22 American Telephone And Telegraph Company, At&T Bell Laboratories Speech recognition arrangement with preselection
US4783802A (en) 1984-10-02 1988-11-08 Kabushiki Kaisha Toshiba Learning system of dictionary for speech recognition
US4908865A (en) 1984-12-27 1990-03-13 Texas Instruments Incorporated Speaker independent speech recognition method and system
US5864810A (en) * 1995-01-20 1999-01-26 Sri International Method and apparatus for speech recognition adapted to an individual speaker
US5953699A (en) 1996-10-28 1999-09-14 Nec Corporation Speech recognition using distance between feature vector of one sequence and line segment connecting feature-variation-end-point vectors in another sequence
US5946653A (en) 1997-10-01 1999-08-31 Motorola, Inc. Speaker independent speech recognition system and method
US6336108B1 (en) * 1997-12-04 2002-01-01 Microsoft Corporation Speech recognition with mixtures of bayesian networks
US5983177A (en) * 1997-12-18 1999-11-09 Nortel Networks Corporation Method and apparatus for obtaining transcriptions from multiple training utterances
US6104989A (en) 1998-07-29 2000-08-15 International Business Machines Corporation Real time detection of topical changes and topic identification via likelihood based methods
US7761296B1 (en) * 1999-04-02 2010-07-20 International Business Machines Corporation System and method for rescoring N-best hypotheses of an automatic speech recognition system
US20060184360A1 (en) * 1999-04-20 2006-08-17 Hy Murveit Adaptive multi-pass speech recognition system
US6542866B1 (en) * 1999-09-22 2003-04-01 Microsoft Corporation Speech recognition method and apparatus utilizing multiple feature streams
US6529902B1 (en) 1999-11-08 2003-03-04 International Business Machines Corporation Method and system for off-line detection of textual topical changes and topic identification via likelihood based methods for improved language modeling
US20030055640A1 (en) * 2001-05-01 2003-03-20 Ramot University Authority For Applied Research & Industrial Development Ltd. System and method for parameter estimation for pattern recognition
US6983246B2 (en) 2002-05-21 2006-01-03 Thinkengine Networks, Inc. Dynamic time warping using frequency distributed distance measures
US8001066B2 (en) 2002-07-03 2011-08-16 Sean Colbath Systems and methods for improving recognition results via user-augmentation of a database
US20050119885A1 (en) * 2003-11-28 2005-06-02 Axelrod Scott E. Speech recognition utilizing multitude of speech features
US20050149326A1 (en) 2004-01-05 2005-07-07 Kabushiki Kaisha Toshiba Speech recognition system and technique
US20050203738A1 (en) * 2004-03-10 2005-09-15 Microsoft Corporation New-word pronunciation learning using a pronunciation graph
US7310601B2 (en) 2004-06-08 2007-12-18 Matsushita Electric Industrial Co., Ltd. Speech recognition apparatus and speech recognition method
US20060129392A1 (en) 2004-12-13 2006-06-15 Lg Electronics Inc Method for extracting feature vectors for speech recognition
US20060149710A1 (en) 2004-12-30 2006-07-06 Ross Koningstein Associating features with entities, such as categories of web page documents, and/or weighting such features
US20070037513A1 (en) 2005-08-15 2007-02-15 International Business Machines Corporation System and method for targeted message delivery and subscription
US20070100618A1 (en) 2005-11-02 2007-05-03 Samsung Electronics Co., Ltd. Apparatus, method, and medium for dialogue speech recognition using topic domain detection
US8301450B2 (en) 2005-11-02 2012-10-30 Samsung Electronics Co., Ltd. Apparatus, method, and medium for dialogue speech recognition using topic domain detection
US20070106685A1 (en) 2005-11-09 2007-05-10 Podzinger Corp. Method and apparatus for updating speech recognition databases and reindexing audio and video content using the same
US20070118372A1 (en) 2005-11-23 2007-05-24 General Electric Company System and method for generating closed captions
US20070265849A1 (en) * 2006-05-11 2007-11-15 General Motors Corporation Distinguishing out-of-vocabulary speech from in-vocabulary speech
US20110077943A1 (en) 2006-06-26 2011-03-31 Nec Corporation System for generating language model, method of generating language model, and program for language model generation
US7272558B1 (en) 2006-12-01 2007-09-18 Coveo Solutions Inc. Speech recognition training method for audio and video file indexing on a search engine
US20080195389A1 (en) * 2007-02-12 2008-08-14 Microsoft Corporation Text-dependent speaker verification
US20080201143A1 (en) 2007-02-15 2008-08-21 Forensic Intelligence Detection Organization System and method for multi-modal audio mining of telephone conversations
US20090030697A1 (en) 2007-03-07 2009-01-29 Cerra Joseph P Using contextual information for delivering results generated from a speech recognition facility using an unstructured language model
US20090055185A1 (en) 2007-04-16 2009-02-26 Motoki Nakade Voice chat system, information processing apparatus, speech recognition method, keyword data electrode detection method, and program
US7983902B2 (en) 2007-08-23 2011-07-19 Google Inc. Domain dictionary creation by detection of new topic words using divergence value comparison
US20090055381A1 (en) 2007-08-23 2009-02-26 Google Inc. Domain Dictionary Creation
US20100070268A1 (en) * 2008-09-10 2010-03-18 Jun Hyung Sung Multimodal unification of articulation for device interfacing
US20110296374A1 (en) 2008-11-05 2011-12-01 Google Inc. Custom language models
US20100161313A1 (en) 2008-12-18 2010-06-24 Palo Alto Research Center Incorporated Region-Matching Transducers for Natural Language Processing
US20120029910A1 (en) 2009-03-30 2012-02-02 Touchtype Ltd System and Method for Inputting Text into Electronic Devices
US20100305947A1 (en) * 2009-06-02 2010-12-02 Nuance Communications, Inc. Speech Recognition Method for Selecting a Combination of List Elements via a Speech Input
US20110004462A1 (en) 2009-07-01 2011-01-06 Comcast Interactive Media, Llc Generating Topic-Specific Language Models
US20110035210A1 (en) * 2009-08-10 2011-02-10 Benjamin Rosenfeld Conditional random fields (crf)-based relation extraction system
US20110131046A1 (en) * 2009-11-30 2011-06-02 Microsoft Corporation Features for utilization in speech recognition
US20120022873A1 (en) 2009-12-23 2012-01-26 Ballinger Brandon M Speech Recognition Language Models
US20120072215A1 (en) * 2010-09-21 2012-03-22 Microsoft Corporation Full-sequence training of deep structures for speech recognition
US20120150532A1 (en) 2010-12-08 2012-06-14 At&T Intellectual Property I, L.P. System and method for feature-rich continuous space language models

Non-Patent Citations (25)

* Cited by examiner, † Cited by third party
Title
‘CMUSphinx’ [online]. "Open Source Toolkit for Speech Recognition," Carnegie Mellon University, Aug. 2, 2011 [retrieved on Dec. 16, 2011]. Retrieved from the internet: URL <http://cmusphinx.sourceforge.net/wiki/tutoriallm>, 3 pages.
Aradilla et al. "Using Posterior-Based Features in Template Matching for Speech Recognition" 2006. *
'CMUSphinx' [online]. "Open Source Toolkit for Speech Recognition," Carnegie Mellon University, Aug. 2, 2011 [retrieved on Dec. 16, 2011]. Retrieved from the internet: URL , 3 pages.
Demange et al. "HEAR: An Hybrid Episodic-Abstract speech Recognizer" 2009. *
Demuynck et al. "Integrating Meta-Information Into Exemplar-Based Speech Recognition With Segmental Conditional Random Fields" 2011. *
Demuynck et al. "Progress in Example Based Automatic Speech Recognition" 2011. *
Gaudard et al. "Speech Recognition based on Template Matching and Phone Posterior Probabilities" 2007. *
Gemmeke et al. "Exemplar-based sparse representations for noise robust automatic speech recognition" 2010. *
Gildea et al. "Topic-based language models using EM." In Proceedings of Eurospeech, 1999, 1-4.
Heigold et al. "A Flat Direct Model for Speech Recognition" 2009. *
Heigold et al., "Investigations on Exemplar-Based Features for Speech Recognition Towards Thousands of Hours of Unsupervised, Noisy Data," Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE, 2012, 4437-4440.
Hoffmeister et al. "Cross-Site and Intra-Site ASR System Combination: Comparisons on Lattice and 1-Best Methods" 2007. *
Kanevsky et al. "An Analysis of Sparseness and Regularization in Exemplar-Based Methods for Speech Classification" 2010. *
Lane et al. "Dialogue speech recognition by combining hierarchical topic classification and language model switching." IEICE transactions on information and systems, 88(3):446-454, Mar. 2005.
Morris et al. "Conditional Random Fields for Integrating Local Discriminative Classifiers" 2008. *
Nederhof, "A General Technique to Train Language Models on Language Models," Association for Computational Linguistics, 31(2):173-185, 2005.
Nguyen et al. "Speech Recognition with Flat Direct Models" 2010. *
Sainath et al., "Exemplar-based sparse representation phone identification features," in Proc. ICASSP, 2011, 4492-4495.
Seppi et al. "Data Pruning for Template-based Automatic Speech Recognition" 2010. *
Seppi et al. "Template-based Automatic Speech Recognition meets Prosody" 2011. *
Wachter et al. "Outlier Correction for Local Distance Measures in Example Based Speech Recognition" 2007. *
Wachter et al. "Template Based Continuous Speech Recognition" 2007. *
Zweig et al. "SCARF: A Segmental Conditional Random Field Toolkit for Speech Recognition" 2010. *
Zweig et al. "Speech Recognition with Segmental Conditional Random Fields: Final Report from the 2010 JHU Summer Workshop" 2010. *
Zweig et al. "Speech Recognitionwith Segmental Conditional Random Fields: A Summary of the JHU CLSP 2010 Summer Workshop" 2011. *

Cited By (136)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11979836B2 (en) 2007-04-03 2024-05-07 Apple Inc. Method and system for operating a multi-function portable electronic device using voice-activation
US11671920B2 (en) 2007-04-03 2023-06-06 Apple Inc. Method and system for operating a multifunction portable electronic device using voice-activation
US11348582B2 (en) 2008-10-02 2022-05-31 Apple Inc. Electronic devices with voice command and contextual data processing capabilities
US11900936B2 (en) 2008-10-02 2024-02-13 Apple Inc. Electronic devices with voice command and contextual data processing capabilities
US12087308B2 (en) 2010-01-18 2024-09-10 Apple Inc. Intelligent automated assistant
US11423886B2 (en) 2010-01-18 2022-08-23 Apple Inc. Task flow identification based on user intent
US11120372B2 (en) 2011-06-03 2021-09-14 Apple Inc. Performing actions associated with task items that represent tasks to perform
US11269678B2 (en) 2012-05-15 2022-03-08 Apple Inc. Systems and methods for integrating third party services with a digital assistant
US11321116B2 (en) 2012-05-15 2022-05-03 Apple Inc. Systems and methods for integrating third party services with a digital assistant
US20140211669A1 (en) * 2013-01-28 2014-07-31 Pantech Co., Ltd. Terminal to communicate data using voice command, and method and system thereof
US11862186B2 (en) 2013-02-07 2024-01-02 Apple Inc. Voice trigger for a digital assistant
US12009007B2 (en) 2013-02-07 2024-06-11 Apple Inc. Voice trigger for a digital assistant
US11636869B2 (en) 2013-02-07 2023-04-25 Apple Inc. Voice trigger for a digital assistant
US10978090B2 (en) 2013-02-07 2021-04-13 Apple Inc. Voice trigger for a digital assistant
US11557310B2 (en) 2013-02-07 2023-01-17 Apple Inc. Voice trigger for a digital assistant
US9886947B2 (en) * 2013-02-25 2018-02-06 Seiko Epson Corporation Speech recognition device and method, and semiconductor integrated circuit device
US20140244255A1 (en) * 2013-02-25 2014-08-28 Seiko Epson Corporation Speech recognition device and method, and semiconductor integrated circuit device
US11388291B2 (en) 2013-03-14 2022-07-12 Apple Inc. System and method for processing voicemail
US11798547B2 (en) 2013-03-15 2023-10-24 Apple Inc. Voice activated device for use with a voice-based digital assistant
US12073147B2 (en) 2013-06-09 2024-08-27 Apple Inc. Device, method, and graphical user interface for enabling conversation persistence across two or more instances of a digital assistant
US11727219B2 (en) 2013-06-09 2023-08-15 Apple Inc. System and method for inferring user intent from speech inputs
US12010262B2 (en) 2013-08-06 2024-06-11 Apple Inc. Auto-activating smart responses based on activities from remote devices
US11257504B2 (en) 2014-05-30 2022-02-22 Apple Inc. Intelligent assistant for home automation
US11699448B2 (en) 2014-05-30 2023-07-11 Apple Inc. Intelligent assistant for home automation
US11810562B2 (en) 2014-05-30 2023-11-07 Apple Inc. Reducing the need for manual start/end-pointing and trigger phrases
US10878809B2 (en) 2014-05-30 2020-12-29 Apple Inc. Multi-command single utterance input method
US12067990B2 (en) 2014-05-30 2024-08-20 Apple Inc. Intelligent assistant for home automation
US11670289B2 (en) 2014-05-30 2023-06-06 Apple Inc. Multi-command single utterance input method
US11133008B2 (en) 2014-05-30 2021-09-28 Apple Inc. Reducing the need for manual start/end-pointing and trigger phrases
US12118999B2 (en) 2014-05-30 2024-10-15 Apple Inc. Reducing the need for manual start/end-pointing and trigger phrases
US11516537B2 (en) 2014-06-30 2022-11-29 Apple Inc. Intelligent automated assistant for TV user interactions
US11838579B2 (en) 2014-06-30 2023-12-05 Apple Inc. Intelligent automated assistant for TV user interactions
US9502032B2 (en) 2014-10-08 2016-11-22 Google Inc. Dynamically biasing language models
US11087759B2 (en) 2015-03-08 2021-08-10 Apple Inc. Virtual assistant activation
US11842734B2 (en) 2015-03-08 2023-12-12 Apple Inc. Virtual assistant activation
WO2016167779A1 (en) * 2015-04-16 2016-10-20 Mitsubishi Electric Corporation Speech recognition device and rescoring device
US12001933B2 (en) 2015-05-15 2024-06-04 Apple Inc. Virtual assistant in a communication session
US11468282B2 (en) 2015-05-15 2022-10-11 Apple Inc. Virtual assistant in a communication session
US11070949B2 (en) 2015-05-27 2021-07-20 Apple Inc. Systems and methods for proactively identifying and surfacing relevant content on an electronic device with a touch-sensitive display
US11947873B2 (en) 2015-06-29 2024-04-02 Apple Inc. Virtual assistant for media playback
US11500672B2 (en) 2015-09-08 2022-11-15 Apple Inc. Distributed personal assistant
US11809483B2 (en) 2015-09-08 2023-11-07 Apple Inc. Intelligent automated assistant for media search and playback
US11853536B2 (en) 2015-09-08 2023-12-26 Apple Inc. Intelligent automated assistant in a media environment
US11126400B2 (en) 2015-09-08 2021-09-21 Apple Inc. Zero latency digital assistant
US11954405B2 (en) 2015-09-08 2024-04-09 Apple Inc. Zero latency digital assistant
US11550542B2 (en) 2015-09-08 2023-01-10 Apple Inc. Zero latency digital assistant
US12051413B2 (en) 2015-09-30 2024-07-30 Apple Inc. Intelligent device identification
US11526368B2 (en) 2015-11-06 2022-12-13 Apple Inc. Intelligent automated assistant in a messaging environment
US11809886B2 (en) 2015-11-06 2023-11-07 Apple Inc. Intelligent automated assistant in a messaging environment
US11886805B2 (en) 2015-11-09 2024-01-30 Apple Inc. Unconventional virtual assistant interactions
US11853647B2 (en) 2015-12-23 2023-12-26 Apple Inc. Proactive assistance based on dialog communication between devices
US10896681B2 (en) 2015-12-29 2021-01-19 Google Llc Speech recognition with selective use of dynamic language models
US11810568B2 (en) 2015-12-29 2023-11-07 Google Llc Speech recognition with selective use of dynamic language models
US9978367B2 (en) 2016-03-16 2018-05-22 Google Llc Determining dialog states for language models
US10553214B2 (en) 2016-03-16 2020-02-04 Google Llc Determining dialog states for language models
US11037565B2 (en) 2016-06-10 2021-06-15 Apple Inc. Intelligent digital assistant in a multi-tasking environment
US11657820B2 (en) 2016-06-10 2023-05-23 Apple Inc. Intelligent digital assistant in a multi-tasking environment
US11809783B2 (en) 2016-06-11 2023-11-07 Apple Inc. Intelligent device arbitration and control
US11152002B2 (en) 2016-06-11 2021-10-19 Apple Inc. Application integration with a digital assistant
US11749275B2 (en) 2016-06-11 2023-09-05 Apple Inc. Application integration with a digital assistant
CN107590135A (en) * 2016-07-07 2018-01-16 三星电子株式会社 Automatic translating method, equipment and system
US20180011843A1 (en) * 2016-07-07 2018-01-11 Samsung Electronics Co., Ltd. Automatic interpretation method and apparatus
EP3267328A1 (en) * 2016-07-07 2018-01-10 Samsung Electronics Co., Ltd Automated interpretation method and apparatus
CN107590135B (en) * 2016-07-07 2024-01-05 三星电子株式会社 Automatic translation method, device and system
US10867136B2 (en) * 2016-07-07 2020-12-15 Samsung Electronics Co., Ltd. Automatic interpretation method and apparatus
US11656884B2 (en) 2017-01-09 2023-05-23 Apple Inc. Application integration with a digital assistant
US11682383B2 (en) 2017-02-14 2023-06-20 Google Llc Language model biasing system
US10311860B2 (en) 2017-02-14 2019-06-04 Google Llc Language model biasing system
US11037551B2 (en) 2017-02-14 2021-06-15 Google Llc Language model biasing system
US11467802B2 (en) 2017-05-11 2022-10-11 Apple Inc. Maintaining privacy of personal information
US11599331B2 (en) 2017-05-11 2023-03-07 Apple Inc. Maintaining privacy of personal information
US11837237B2 (en) 2017-05-12 2023-12-05 Apple Inc. User-specific acoustic models
US11862151B2 (en) 2017-05-12 2024-01-02 Apple Inc. Low-latency intelligent automated assistant
US11405466B2 (en) 2017-05-12 2022-08-02 Apple Inc. Synchronization and task delegation of a digital assistant
US11580990B2 (en) 2017-05-12 2023-02-14 Apple Inc. User-specific acoustic models
US11538469B2 (en) 2017-05-12 2022-12-27 Apple Inc. Low-latency intelligent automated assistant
US11380310B2 (en) 2017-05-12 2022-07-05 Apple Inc. Low-latency intelligent automated assistant
US12014118B2 (en) 2017-05-15 2024-06-18 Apple Inc. Multi-modal interfaces having selection disambiguation and text modification capability
US11532306B2 (en) 2017-05-16 2022-12-20 Apple Inc. Detecting a trigger of a digital assistant
US12026197B2 (en) 2017-05-16 2024-07-02 Apple Inc. Intelligent automated assistant for media exploration
US11675829B2 (en) 2017-05-16 2023-06-13 Apple Inc. Intelligent automated assistant for media exploration
US11340925B2 (en) 2017-05-18 2022-05-24 Peloton Interactive Inc. Action recipes for a crowdsourced digital assistant system
US11682380B2 (en) 2017-05-18 2023-06-20 Peloton Interactive Inc. Systems and methods for crowdsourced actions and commands
US11520610B2 (en) * 2017-05-18 2022-12-06 Peloton Interactive Inc. Crowdsourced on-boarding of digital assistant operations
US12093707B2 (en) 2017-05-18 2024-09-17 Peloton Interactive Inc. Action recipes for a crowdsourced digital assistant system
US11862156B2 (en) 2017-05-18 2024-01-02 Peloton Interactive, Inc. Talk back from actions in applications
US11710482B2 (en) 2018-03-26 2023-07-25 Apple Inc. Natural assistant interaction
US10909331B2 (en) * 2018-03-30 2021-02-02 Apple Inc. Implicit identification of translation payload with neural machine translation
US20190303442A1 (en) * 2018-03-30 2019-10-03 Apple Inc. Implicit identification of translation payload with neural machine translation
US11169616B2 (en) 2018-05-07 2021-11-09 Apple Inc. Raise to speak
US11854539B2 (en) 2018-05-07 2023-12-26 Apple Inc. Intelligent automated assistant for delivering content from user experiences
US11487364B2 (en) 2018-05-07 2022-11-01 Apple Inc. Raise to speak
US11907436B2 (en) 2018-05-07 2024-02-20 Apple Inc. Raise to speak
US11900923B2 (en) 2018-05-07 2024-02-13 Apple Inc. Intelligent automated assistant for delivering content from user experiences
US12080287B2 (en) 2018-06-01 2024-09-03 Apple Inc. Voice interaction at a primary device to access call functionality of a companion device
US11360577B2 (en) 2018-06-01 2022-06-14 Apple Inc. Attention aware virtual assistant dismissal
US11630525B2 (en) 2018-06-01 2023-04-18 Apple Inc. Attention aware virtual assistant dismissal
US11431642B2 (en) 2018-06-01 2022-08-30 Apple Inc. Variable latency device coordination
US10984798B2 (en) 2018-06-01 2021-04-20 Apple Inc. Voice interaction at a primary device to access call functionality of a companion device
US11009970B2 (en) 2018-06-01 2021-05-18 Apple Inc. Attention aware virtual assistant dismissal
US12061752B2 (en) 2018-06-01 2024-08-13 Apple Inc. Attention aware virtual assistant dismissal
US12067985B2 (en) 2018-06-01 2024-08-20 Apple Inc. Virtual assistant operations in multi-device environments
US10720160B2 (en) 2018-06-01 2020-07-21 Apple Inc. Voice interaction at a primary device to access call functionality of a companion device
US11893992B2 (en) 2018-09-28 2024-02-06 Apple Inc. Multi-modal inputs for voice commands
US11475898B2 (en) 2018-10-26 2022-10-18 Apple Inc. Low-latency multi-speaker speech recognition
US11783815B2 (en) 2019-03-18 2023-10-10 Apple Inc. Multimodality in digital assistant systems
US11307752B2 (en) 2019-05-06 2022-04-19 Apple Inc. User configurable task triggers
US11705130B2 (en) 2019-05-06 2023-07-18 Apple Inc. Spoken notifications
US11423908B2 (en) 2019-05-06 2022-08-23 Apple Inc. Interpreting spoken requests
US11217251B2 (en) 2019-05-06 2022-01-04 Apple Inc. Spoken notifications
US11675491B2 (en) 2019-05-06 2023-06-13 Apple Inc. User configurable task triggers
US11475884B2 (en) 2019-05-06 2022-10-18 Apple Inc. Reducing digital assistant latency when a language is incorrectly determined
US11651775B2 (en) * 2019-05-09 2023-05-16 Rovi Guides, Inc. Word correction using automatic speech recognition (ASR) incremental response
US20230252997A1 (en) * 2019-05-09 2023-08-10 Rovi Guides, Inc. Word correction using automatic speech recognition (asr) incremental response
US11107475B2 (en) * 2019-05-09 2021-08-31 Rovi Guides, Inc. Word correction using automatic speech recognition (ASR) incremental response
US20210350807A1 (en) * 2019-05-09 2021-11-11 Rovi Guides, Inc. Word correction using automatic speech recognition (asr) incremental response
US11140099B2 (en) 2019-05-21 2021-10-05 Apple Inc. Providing message response suggestions
US11888791B2 (en) 2019-05-21 2024-01-30 Apple Inc. Providing message response suggestions
US11360739B2 (en) 2019-05-31 2022-06-14 Apple Inc. User activity shortcut suggestions
US11496600B2 (en) 2019-05-31 2022-11-08 Apple Inc. Remote execution of machine-learned models
US11657813B2 (en) 2019-05-31 2023-05-23 Apple Inc. Voice identification in digital assistant systems
US11237797B2 (en) 2019-05-31 2022-02-01 Apple Inc. User activity shortcut suggestions
US11289073B2 (en) 2019-05-31 2022-03-29 Apple Inc. Device text to speech
US11360641B2 (en) 2019-06-01 2022-06-14 Apple Inc. Increasing the relevance of new available information
US11790914B2 (en) 2019-06-01 2023-10-17 Apple Inc. Methods and user interfaces for voice-based control of electronic devices
US11488406B2 (en) 2019-09-25 2022-11-01 Apple Inc. Text detection using global geometry estimators
CN111199730A (en) * 2020-01-08 2020-05-26 北京松果电子有限公司 Voice recognition method, device, terminal and storage medium
US11765209B2 (en) 2020-05-11 2023-09-19 Apple Inc. Digital assistant hardware abstraction
US11924254B2 (en) 2020-05-11 2024-03-05 Apple Inc. Digital assistant hardware abstraction
US11914848B2 (en) 2020-05-11 2024-02-27 Apple Inc. Providing relevant data items based on context
US11755276B2 (en) 2020-05-12 2023-09-12 Apple Inc. Reducing description length based on confidence
US11838734B2 (en) 2020-07-20 2023-12-05 Apple Inc. Multi-device audio adjustment coordination
US11750962B2 (en) 2020-07-21 2023-09-05 Apple Inc. User identification using headphones
US11696060B2 (en) 2020-07-21 2023-07-04 Apple Inc. User identification using headphones
US20220223157A1 (en) * 2021-01-11 2022-07-14 Bank Of America Corporation System and method for single-speaker identification in a multi-speaker environment on a low-frequency audio recording
US11521623B2 (en) * 2021-01-11 2022-12-06 Bank Of America Corporation System and method for single-speaker identification in a multi-speaker environment on a low-frequency audio recording

Similar Documents

Publication Publication Date Title
US8775177B1 (en) Speech recognition process
US11854545B2 (en) Privacy mode based on speaker identifier
US11990127B2 (en) User recognition for speech processing systems
US11496582B2 (en) Generation of automated message responses
US11270685B2 (en) Speech based user recognition
US20240221737A1 (en) Recognizing speech in the presence of additional audio
US11061644B2 (en) Maintaining context for voice processes
US11594215B2 (en) Contextual voice user interface
US9972318B1 (en) Interpreting voice commands
US9293136B2 (en) Multiple recognizer speech recognition
US11830485B2 (en) Multiple speech processing system with synthesized speech styles
US10917758B1 (en) Voice-based messaging
US10448115B1 (en) Speech recognition for localized content
CN107967916B (en) Determining phonetic relationships
US10565989B1 (en) Ingesting device specific content
US9135912B1 (en) Updating phonetic dictionaries
US11810556B2 (en) Interactive content output
US9263033B2 (en) Utterance selection for automated speech recognizer training
US20210241760A1 (en) Speech-processing system
US11887583B1 (en) Updating models with trained model update objects
US11468897B2 (en) Systems and methods related to automated transcription of voice communications

Legal Events

Date Code Title Description
AS Assignment

Owner name: GOOGLE INC., CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:HEIGOLD, GEORG;NGUYEN, PATRICK AN;WEINTRAUB, MITCHEL;AND OTHERS;SIGNING DATES FROM 20121009 TO 20121016;REEL/FRAME:029885/0923

STCF Information on status: patent grant

Free format text: PATENTED CASE

CC Certificate of correction
AS Assignment

Owner name: GOOGLE LLC, CALIFORNIA

Free format text: CHANGE OF NAME;ASSIGNOR:GOOGLE INC.;REEL/FRAME:044277/0001

Effective date: 20170929

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551)

Year of fee payment: 4

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 8TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1552); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

Year of fee payment: 8