US8775177B1 - Speech recognition process - Google Patents
Speech recognition process Download PDFInfo
- Publication number
- US8775177B1 US8775177B1 US13/665,245 US201213665245A US8775177B1 US 8775177 B1 US8775177 B1 US 8775177B1 US 201213665245 A US201213665245 A US 201213665245A US 8775177 B1 US8775177 B1 US 8775177B1
- Authority
- US
- United States
- Prior art keywords
- audio
- templates
- path costs
- similarity metrics
- weights
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 92
- 230000008569 process Effects 0.000 title claims abstract description 59
- 239000013598 vector Substances 0.000 claims description 133
- 230000015654 memory Effects 0.000 claims description 38
- 238000012545 processing Methods 0.000 claims description 23
- 239000002131 composite material Substances 0.000 claims description 2
- 238000004891 communication Methods 0.000 description 30
- 238000012549 training Methods 0.000 description 19
- 238000004590 computer program Methods 0.000 description 12
- 230000001413 cellular effect Effects 0.000 description 6
- 230000000694 effects Effects 0.000 description 6
- 230000006870 function Effects 0.000 description 5
- 238000006243 chemical reaction Methods 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 238000013518 transcription Methods 0.000 description 3
- 230000035897 transcription Effects 0.000 description 3
- 230000007704 transition Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 239000004973 liquid crystal related substance Substances 0.000 description 2
- 238000010295 mobile communication Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 238000013519 translation Methods 0.000 description 2
- 230000002411 adverse Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000012790 confirmation Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 230000006855 networking Effects 0.000 description 1
- 235000007682 pyridoxal 5'-phosphate Nutrition 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000001953 sensory effect Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
- 230000001755 vocal effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/10—Speech classification or search using distance or distortion measures between unknown speech and reference templates
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L2015/085—Methods for reducing search complexity, pruning
Definitions
- This disclosure relates generally to speech recognition.
- Speech recognition includes processes for converting spoken words to text or other data.
- speech recognition systems translate verbal utterances into a series of computer-readable sounds and compare those sounds to known words.
- a microphone may accept an analog signal, which is converted into a digital form that is then divided into smaller segments. The digital segments can be compared to elements of a spoken language. Based on this comparison, and an analysis of the context in which those sounds were uttered, the system is able to recognize the speech.
- a typical speech recognition system may include an acoustic model, a language model, and a dictionary.
- an acoustic model includes digital representations of individual sounds that are combinable to produce a collection of words, phrases, etc.
- a language model assigns a probability that a sequence of words will occur together in a particular sentence or phrase.
- a dictionary transforms sound sequences into words that can be understood by the language model.
- a speech recognition process may perform the following operations: performing a preliminary recognition process on first audio to identify candidates for the first audio; generating first templates corresponding to the first audio, where each first template includes a number of elements; selecting second templates corresponding to the candidates, where the second templates represent second audio, and where each second template includes elements that correspond to the elements in the first templates; comparing the first templates to the second templates, where comparing includes determining similarity metrics between the first templates and corresponding second templates; applying weights to the similarity metrics to produce weighted similarity metrics, where the weights are associated with corresponding second templates; and using the weighted similarity metrics to determine whether the first audio corresponds to the second audio.
- the speech recognition systems may include one or more of the following features, either alone or in combination.
- Selecting the second templates may include selecting templates associated with a non-zero weight.
- Metadata may be associated with at least one of the first audio and the second audio.
- the metadata may be used in obtaining at least the second templates.
- the metadata may be indicative of the context of at least one of the first audio and the second audio.
- the metadata may indicate at least one word that neighbors a word in at least one of the first audio and the second audio.
- the preliminary recognition process may include a Hidden Markov Model (HMM) based process.
- the preliminary recognition process may generate first scores associated with the candidates.
- Using the weighted similarity metrics to determine whether the first audio corresponds to the second audio may include generating second scores for the first audio, where the second scores correspond to whether the first audio corresponds to the second audio.
- HMM Hidden Markov Model
- the operations may include combining the first scores and the second scores using a conditional random field technique to generate a composite score indicative of an extent to which the first audio corresponds to the second audio.
- Each element may be at least one of: a phoneme in context, a syllable, or a word.
- the first templates may include vectors
- the second templates may include vectors
- the similarity metrics may include distances between vectors.
- the second templates may include multiple groups of second templates, and each group of second templates may represent a different version of a same candidate word or phrase for at least one of the first and second audio.
- the second templates may be selected from among a group of templates having associated weights. At least some of the weights may be negative. Weights may be determined using a conditional random field technique. At least some of the weights may be zero. Zero weights may be determined using a regularization technique.
- Metadata may be associated with at least one of the first audio and the second audio.
- the metadata may indicate at least one of: information about a speaker of at least one of the first audio or the second audio, and information about an acoustic condition of at least one of the first audio or the second audio.
- the systems and techniques described herein, or portions thereof, may be implemented as a computer program product that includes instructions that are stored on one or more non-transitory machine-readable storage media, and that are executable on one or more processing devices.
- the systems and techniques described herein, or portions thereof, may be implemented as an apparatus, method, or electronic system that may include one or more processing devices and memory to store executable instructions to implement the stated functions.
- FIG. 1 shows, conceptually, an example of a speech recognition system.
- FIG. 2 shows an example of an acoustic model of the speech recognition system.
- FIG. 3 is an example of a network on which the speech recognition system may be implemented.
- FIG. 4 is a flowchart showing an example training phase for use in the speech recognition system.
- FIG. 5 is a flowchart showing an example process for recognizing speech.
- FIG. 6 shows examples of computing devices on which the processes described herein, or portions thereof, may be implemented.
- the processes include performing a preliminary (first pass) recognition process on audio and then performing an exemplar- (e.g., template- or vector-) based recognition process on the audio. Scores from the two processes are used to identify a recognition candidate for the input audio.
- exemplar- e.g., template- or vector-
- FIG. 1 shows a conceptual example of a system for performing speech recognition according to the processes described herein.
- a user 100 of a mobile device 101 accesses a speech recognition system 104 .
- the mobile device 101 is a cellular telephone having advanced computing capabilities, known as a smartphone.
- Speech recognition system 104 may be hosted by one or more server(s) that is/are remote from mobile device 101 .
- speech recognition system 104 may be part of another service available to users of the mobile device 101 (e.g., a help service, a search service, etc.).
- mobile device 101 may include an application 107 (“app”) that receives input audio (e.g., speech) provided by user 100 and that transmits data 110 representing that input audio to the speech recognition system 104 .
- App 107 may have any appropriate functionality, e.g., it may be a search app, a messaging app, an e-mail app, and so forth.
- an app is used as an example in this case.
- all or part of the functionality of the app 107 may be part of another program downloaded to mobile device 101 , part of another program provisioned on mobile device 101 , part of the operating system of the mobile device 101 , or part of a service available to mobile device 101 .
- app 107 may ask user 100 to identify, beforehand, the languages that user 100 speaks.
- the user 100 may select, e.g., via a touch-screen menu item or voice input, the languages that user 100 expects to speak or have recognized.
- user 100 may also select among various accents or dialects.
- the user's languages, accents, and/or dialects may be determined based on the audio input itself or based on prior audio or other appropriate input.
- Speech recognition system 104 includes one or more of each of the following: an acoustic model 115 , a language model 116 , and a dictionary 117 .
- acoustic model 115 includes digital representations of individual sounds that are combinable to produce a collection of words, phrases, etc.
- Language model 116 assigns a probability that a sequence of words will occur together in a particular sentence or phrase.
- Dictionary 117 dictionary transforms sound sequences into words that can be understood by language model 116 .
- acoustic model 115 includes two stages: a “first pass” stage 115 a and a “second pass” stage 115 b .
- first pass stage 115 a is implemented using a Hidden Markov Model (HMM)-based system, which identifies recognition candidates and assigns scores thereto.
- Second pass stage 115 b uses templates, such as vectors, to represent input audio. These vectors are compared to other vectors that represent known words, phrases or other sound sequences. Distances between vectors for input audio and for known audio correspond to a likelihood that the input audio matches the known audio. The distances, which correspond to scores, are used in adjusting the score(s) from the first pass stage to identify a best recognition candidate for the input audio.
- a conditional random field process may be used to combine the scores from the first pass stage and the second pass stage to identify the candidate. The first pass stage is described initially, followed by the second pass stage.
- the HMM-based system uses one or more state machines to identify first pass recognition candidates.
- a state machine may be used to recognize an unknown input.
- the state machine determines a sequence of known states representing sounds that that best match input speech. This best-matched sequence is deemed to be the state machine's hypothesis for the input speech.
- the audio element recognized in the first pass stage may be a part of a word (e.g., a syllable), phoneme, etc; a whole word, phoneme, etc.; a part of a sequence of words, phonemes, etc., and so forth.
- each state in the state machine receives the best incoming path to that state (e.g., the incoming path with the lowest cost), determines how good a match incoming audio is to itself, produces a result called the “state matching cost”, and outputs data corresponding to this result to successor state(s).
- the combination of state matching costs with the lowest cost incoming path is referred to as the “path cost”.
- the path with the lowest path cost may be selected as the best-matched sequence for the input speech.
- a “path” includes a sequence of states through a state machine that are compared to input audio data.
- a “path cost” includes a sum of matching costs (e.g., costs of matching a state to a segment of audio) and transition costs (costs to transition from a state_i to a state_j).
- a “best path cost” includes the “path” with the lowest “path cost”.
- a state in a state machine may have several different states that can transition to the current state. To determine the “best input path” leading into a state, the “path cost” for each path arriving at a current state should be known. If any of the incoming “path costs” are unknown at the current time, then the “best path cost” for this state cannot be determined until incoming path costs become known.
- user 101 utters input speech, e.g., the word “recognize”, into mobile device 101 .
- Mobile device 101 converts the input speech into audio data 110 .
- the audio data is part of a continuous stream that is sent from a microphone to speech recognition system 104 .
- the speech is received at acoustic model 115 at both the first and second pass stages.
- the part of the speech recognition process performed by acoustic model 115 employs state machine 200 that include states 201 .
- states may represent sub-phonemes in the English language.
- a phoneme is the smallest piece of sound that provides meaningful distinctions between different words in one or more languages (e.g., every word has a sequence of phonemes).
- the acoustic data of phonemes are further broken down into smaller components called sub-phonemes, which can facilitate more accurate speech recognition (since smaller units of sound are recognized).
- state machine 200 determines best path cost 204 , which corresponds to a sequence of sub-phonemes that best matches the corresponding input audio element. The better the match is between an audio element and a sequence of sub-phonemes, the smaller the resulting path cost is. Therefore, in this example, the best path cost corresponds to the sequence of sub-phonemes which has the smallest path cost.
- the speech recognition system includes a state machine 200 with M states, where M ⁇ 1.
- Audio element, “recognize”, can be broken-down into the following set of sub-phonemes: r-r-r-eh-eh-eh-k-k-ao-ao-g-g-g-n-n-ay-ay-z-z-z, which are labeled as follows: r1, r2, r3, eh1, eh2, eh3, k1, k2, k3, ao1, ao2, ao3, g1, g2, g3, n1, n2, n3, ay1, ay2, ay3, z1, z2, z3.
- State machine 200 should ultimately find the best path to be the following sequence of sub-phoneme states: r1, r2, r3, eh1, eh2, eh3, k1, k2, k3, ao1, ao2, ao3, g1, g2, g3, n1, n2, n3, ay1, ay2, ay3, z1, z2, z3.
- first pass stage 115 a of the acoustic model compares the input audio to its model for the word “recognize” and finds a candidate 208 with a best path cost.
- the candidate corresponds to the sequence of sub-phonemes that has the lowest path cost. More than one best path cost may be obtained in some cases. For example, if determined best path costs are close (e.g., within a predefined tolerance of each other or another metric), several candidates may be selected. One or more words, phrases, etc. 208 thus may be identified and sent to second pass stage 115 b for further processing.
- the input audio is broken-down into time duration segments.
- the segments may be, for example, 10 ms each or any other appropriate duration.
- An average word is around 500 ms. So, in the 10 ms example, an average word contains about 50 segments. Other words, however, may have fewer or lesser numbers of segments.
- the segments are represented by templates.
- the templates include vectors 210 having a number of features (e.g., one feature per dimension of a vector). In an example implementation, there are 39 features per vector; however, other implementations may use different numbers of features.
- the acoustic model is not a series of states as in the first pass stage, but rather a number of vectors for a sound sequence (e.g., a word or phrase).
- a sound sequence e.g., a word or phrase.
- vectors for input audio are generated by performing a Fast Fourier Transform (FFT) on the input audio to obtain its frequency components.
- FFT Fast Fourier Transform
- a cosine transformation is performed on the frequency components to obtain features for the vectors.
- thirteen features are obtained per 10 ms segment.
- First and second derivatives of those features are taken over time to obtain an additional 26 features to produce the full 39 features for a vector.
- PLP perceptual linear prediction
- MFCC Mel frequency cepstrum coefficients
- second pass stage 115 b stored vectors 211 are identified that correspond to recognition candidate(s) identified in first pass stage 115 a .
- the speech recognition system generates, identifies, and stores in a database, vectors for different speech elements.
- the speech element is a word; however, vectors may be pre-stored in a database for syllables phrases, word combinations, or other sounds sequences as well, and used as described herein to recognize more or less than a single word.
- audio is recognized and vectors are generated for the corresponding audio as described above.
- the audio may be recognized using automatic and/or manual recognition processes.
- the training data may be unsupervised or supervised.
- input audio may be for the word “recognize”.
- the input audio is recognized, e.g., using a standard HMM-based approach with, or without, manual (e.g., human) confirmation.
- Vectors for that input audio may be generated and stored in memory. For example, if the audio is the word “recognize”, and that word is 1000 ms in duration, then there are 100 vectors stored, one for each 10 ms of audio on the word “recognize”.
- the foregoing process may be performed, during the training phase, for various instances of the word “recognize”. For example, different groups of vectors may be generated for the word “recognize” spoken using different speech patterns, for different durations, in different accents, in different (e.g., noise or quiet) environments, in different word contexts, and so forth.
- the result may be numerous groups of vectors, all of which represent different versions of the same words e.g., “recognize”.
- the vectors may differ in content for reasons noted above, and may be used in the second pass stage to generate a recognition candidate for the input audio in the manner described herein.
- the training phase may associate metadata with each vector identifying, e.g., the word that the vector represents.
- each vector may also be assigned a weight, which may be represented by metadata.
- the weights may be indicative or the likelihood (e.g., a confidence or relevance score) that the vector representation is accurate. For example, higher weights (indicative of more accuracy) may be assigned to manually-verified vector representations than for vector representations that are not manually verified.
- vector representations for noisy audio, or other audio that is deemed generally unreliable for some reason may be assigned lower weights (indicative of less accuracy), since such noise may affect recognition accuracy.
- vector representations for audio that exceeds a predefined noise threshold may be assigned weights of zero.
- a regularization process may be used to obtain the weights of zero.
- the weight assigned to the vector may be proportionate to the noise level of the associated audio, or to its reliability in general.
- the weights may be negative, which indicates a negative correlation between a vector representation and audio.
- the weights may be determined using a conditional random field technique.
- the metadata may also identify other features associated with the input audio.
- the metadata may identify one or more words that neighbor the word that is the subject of a vector.
- neighbor may include, but is not limited to, one or more words either before or after the word at issue.
- the one or more words are directly before or after the word at issue; however, this need not be the case always.
- the metadata may also identify other contextual aspects of the audio.
- the metadata may specify a source of the audio, e.g., a television network, an online video service, a video device (e.g., a digital video camera), and so forth.
- the metadata may also include, if available, information about the linguistic characteristics of the audio, e.g., the speaker's accent, location, and so forth.
- the metadata may also identify the condition of the audio, e.g., whether the audio is noisy, the amount of noise, the type of noise, and so forth.
- Vectors stored in the training phase are used in second pass stage 115 b in recognizing input audio. More specifically, vectors are identified, in storage, for a first pass stage recognition candidate. Vector for the input audio (the “input audio vectors”) are compared 215 to the stored vectors, and are scored against the stored vectors. In this example, the scores are based, at least in part, on a calculated distance between the input audio vectors and each stored vector. In some implementations, the calculated distance between two vectors is the Dynamic Time Warping (DTW) distance. In an example, the DTW distance is the summed Euclidean distances of the best warping of two vectors. The warping usually is subject to certain constraints, for example, monotonicity and bounded jump size. The DTW distance can be determined using dynamic programming techniques, with a complexity quadratic in a number of frames. The DTW distance may be length-normalized, if necessary, to make vectors of different length comparable.
- DTW Dynamic Time Warping
- the DTW distance is indicative of how closely the input audio corresponds to the word represented by stored vectors.
- the DTW distances between the input audio vectors 210 and stored vectors 211 for “recognize” are indicative of how closely input audio corresponds to the word “recognize”.
- This DTW distances may be determined for any number (e.g., all or a subset) of stored vectors for the same word.
- the DTW distances for various vector comparisons may be considered together or combined mathematically to provide an indication of a likelihood that the input audio is a known word.
- scores 219 resulting from the first pass stage and score 220 resulting from the second pass stage are both used to produce an overall score indicative of how well the input audio matches a word.
- a combiner module 211 which implements a conditional random field technique, may be used to generate a final recognition score and thus an output recognition candidate 224 .
- Factors other than the DTW distances and first pass stage scores may also affect the final recognition score.
- weights applied to the stored vectors may affect the amount that those stored vectors contribute to the final recognition score.
- the output of the second pass stage may be adjusted (e.g., multiplied by) weights for corresponding pre-stored vectors.
- Vectors that are deemed reliable representations of audio e.g., manually-confirmed vectors or vectors generated from audio having low levels of noise
- vector weights are identified prior to vector identification. Only those vectors having (e.g., positive) non-zero weights, or weights that exceed a predefined threshold, may be identified and compared against a vector for input audio. As a result, the amount of vector comparisons that are performed can be reduced. In other implementations, the zero-weighted vectors may be identified; however, their zero weight effectively discounts their effect on the final score.
- neighboring words may be used to adjust scores resulting from DTW distances.
- the input audio may include the word “to”, neighbored by “going”, as in “going to”.
- metadata may be associated with the resulting recognition candidate indicating that the word “going” precedes the word “to”. This information may be used to adjust the weight applied to the DTW distance. For example, in some cases, if it is known what a predecessor word was, the weight may be adjusted so that the resulting score is downgraded or upgraded. For example, “thereto” is a word that ends in “to”.
- a recognition result for the word “to” may be downgraded (e.g., by adjusting the weight for its corresponding vectors downward) to reflect the possibility that the word “to” is part of “thereto”, rather than the stand-alone word “to”.
- more than one neighboring word or sound sequence may affect the determination.
- succeeding neighbor words may affect applied weights.
- neighboring words may affect which vectors are identified for comparison with the input audio. For example, if neighboring words are known, vectors reflecting a combination of those neighboring words with the word at issue may be identified and compared to the input audio. This may reduce the number of comparisons that occur, particularly where there are large numbers of vector examples for words (e.g., for prepositions, such as “to”).
- Metadata such as that described above for the vectors produced in the training phase, may be associated with vectors generated from the input audio, in cases where the appropriate information is available.
- the metadata for the input audio vectors may be used in scoring stored vectors. For example, the metadata of input audio vectors may be matched to corresponding metadata of stored vectors and, where matches are/are not present, recognition scores may be adjusted.
- the final recognition output constitutes recognized audio.
- the recognized audio 119 may include, e.g., a textual transcription of the audio, language information associated with included recognition candidates, or other information representative of its content.
- the recognized audio 119 may be provided as data to the mobile device 101 that provided the input audio.
- a user may input audio to the speech recognition system through the mobile device 101 .
- the recognized audio 119 may be provided to the mobile device 101 or another service and used to control one or more functions associated with the mobile device 101 .
- an application on the mobile device 101 may execute an e-mail or messaging application in response to command(s) in the recognized audio 119 .
- the recognized audio 119 may be used to populate an e-mail or other message. Processes may be implemented, either remote from, or local to, mobile device 101 , to identify commands in an application, such as “send e-mail” to cause actions to occur, such as executing an e-mail application, on mobile device 101 .
- recognized audio 119 may be provided as data to a search engine.
- recognized audio 119 may constitute a search query that is to be input to a search engine.
- the search engine may identify content (e.g., Web pages, images, documents, and the like) that are relevant to the search query, and return that information to the computing device that provided the initial audio.
- the recognized audio may be provided to the computing device prior to searching in order to confirm its accuracy.
- recognized audio 119 may be used to determine advertisements related to the topic of the audio. Such advertisements may be provided in conjunction with output of the audio content.
- FIG. 3 is a block diagram of an example of a system 300 on which the processes of FIGS. 1 and 2 may be implemented.
- input speech may be provided through one or more of communication devices 302 .
- Mobile device 101 of FIG. 1 is an example of a communication device 302 that may be used to perform the processes described herein.
- Resulting audio data may be transmitted to one or more processing entities (e.g., processing entities 308 a and 308 b or more), which may be part of server(s) 304 , for speech recognition performed as described herein.
- processing entities 308 a and 308 b or more may be part of server(s) 304 , for speech recognition performed as described herein.
- Network 306 can represent a mobile communications network that can allow devices (e.g., communication devices 302 ) to communicate wirelessly through a communication interface (not shown), which may include digital signal processing circuitry where appropriate.
- Network 306 can include one or more networks.
- the network(s) may provide for communications under various modes or protocols, e.g., Global System for Mobile communication (GSM) voice calls, Short Message Service (SMS), Enhanced Messaging Service (EMS), or Multimedia Messaging Service (MMS) messaging, Code Division Multiple Access (CDMA), Time Division Multiple Access (TDMA), Personal Digital Cellular (PDC), Wideband Code Division Multiple Access (WCDMA), CDMA2000, or General Packet Radio System (GPRS), among others.
- GSM Global System for Mobile communication
- SMS Short Message Service
- EMS Enhanced Messaging Service
- MMS Multimedia Messaging Service
- CDMA Code Division Multiple Access
- TDMA Time Division Multiple Access
- PDC Personal Digital Cellular
- WCDMA Wideband Code Division Multiple Access
- CDMA2000 Code Division Multiple Access 2000
- GPRS General Packet Radio System
- Communication devices 302 can include various forms of client devices and personal computing devices. Communication devices 302 can include, but are not limited to, a cellular telephone 302 a , personal digital assistant (PDA) 302 b , and a smartphone 302 c . In other implementations, communication devices 302 may include (not shown), personal computing devices, e.g., a laptop computer, a handheld computer, a desktop computer, a tablet computer, a network appliance, a camera, an enhanced general packet radio service (EGPRS) mobile phone, a media player, a navigation device, an email device, a game console, or a combination of any two or more of these data processing devices or other data processing devices. In some implementations, the personal computing device can be included as part of a motor vehicle (e.g., an automobile, an emergency vehicle (e.g., fire truck, ambulance), a bus).
- a motor vehicle e.g., an automobile, an emergency vehicle (e.g., fire truck, ambulance), a bus).
- Communication devices 302 may each include one or more processing devices 322 , memory 324 , and a storage system 326 .
- Storage system 326 can include a speech conversion module 328 and a mobile operating system module 330 .
- Each processing device 322 can run an operating system included in mobile operating system module 330 to execute software included in speech conversion module 328 .
- speech conversion module 328 may receive input speech 106 and perform any processing necessary to convert the input speech into audio data 110 for recognition.
- Server(s) 304 can include various forms of servers including, but not limited to, a web server, an application server, a proxy server, a network server, a gateway, or a server farm.
- Server(s) 304 can include one or more processing entities 308 a , 308 b . Although only two processing entities are shown, any number may be included in system 300 .
- each entity includes a memory 310 and a storage system 312 .
- Processing entities 308 a , 308 b can be real (e.g., different computers, processors, programmed logic, a combination thereof, etc.) or virtual machines, which can be software implementations of machines that execute programs like physical machines.
- Each storage system 312 can include a speech recognition module 314 , a speech recognition database 316 , and a server operating system module 318 .
- Each processing entity 308 a , 308 b can run an operating system included in the server operating system module 318 to execute software included in the modules that make-up speech recognition module 314 .
- the operation of speech recognition module may be spread across various processing entities or performed in a single processing entity.
- a speech recognition module 314 can process received audio data, or a portion thereof, from a communication device 302 (e.g., cellular telephone 302 a ) and use speech recognition database 316 to determine the spoken word content of the speech data.
- Each speech recognition module may include an acoustic model 331 , a language model 332 , and a dictionary 333 .
- acoustic model 331 includes digital representations of individual sounds that are combinable to produce a collection of words, phrases, etc.
- Language model 332 assigns a probability that a sequence of words will occur together in a particular sentence or phrase.
- Dictionary 333 transforms sound sequences into words that can be understood by the language model.
- Speech recognition database 316 includes data for one or more state machines 334 for performing the first stage recognition process described herein and a vector database 335 that includes vectors for known words for performing the second stage recognition process described herein.
- acoustic model 331 includes a first pass module 340 and a second pass module 341 , which implement the first pass and second pass recognition stages described herein.
- First pass module 340 may be a discriminatively trained HMM model (e.g., of the type shown in FIG. 2 ) that uses Gaussian mixtures and PLPs as front-end features. The decoding performed by first pass module 340 may use a trigram language model.
- Second pass module 341 may be an exemplar features-based recognition process, which uses vectors representing segments of audio to identify the content of input audio.
- a combiner module (not shown in FIG. 3 ), which also may be part of the acoustic model, combines scores produced by the first pass module and the second pass module to identify one or more higher-rated recognition candidates for input audio.
- FIG. 4 shows operations performed during a training phase process 400 .
- Process 400 may be performed by speech recognition module 314 of FIG. 3 , either alone or in combination with one or more other appropriate computer programs.
- the training phase includes, among other things, generating a database of vectors for segments of audio; identifying words, phrases or sounds sequences that are represented by groups of the vectors; and associating weights and metadata with the vectors.
- the speech recognition system is trained on a corpus of audio.
- the corpus need not be a single source of audio, but rather may contain multiple sources including, e.g., broadcast audio, audio from online sources, speech, music, other sounds, noise and so forth.
- Training includes receiving (401) segments of the audio from the corpus.
- the segments of audio may be of any appropriate length. In this example, the segments are 10 ms.
- the received audio is identified ( 402 ).
- the retrieved audio may be identified using an HMM-based system having one or more state machines.
- the identification process may be completely automatic, e.g., the HMM-based system may identify sounds in the audio; a language model may provide phonetic representations of words composed of those sounds; and a dictionary may transform sound sequences into words that can be understood by the language model.
- the training phase may include making a manual determination about the identity of input audio. For example, a person may identify the audio or confirm the accuracy of the result produced by an HMM-based system. In other implementations, a person may identify the audio without the assistance of the HMM-based system.
- the automatic portion of the recognition may be a system other than an HMM-based system.
- Vectors are generated ( 403 ) for the audio.
- vectors for input audio are generated by performing a Fast Fourier Transform (FFT) on the audio to obtain its frequency components.
- FFT Fast Fourier Transform
- a cosine transformation is performed on the frequency components to obtain features for the vectors.
- thirteen features are obtained.
- First and second derivatives of those features are taken over time to obtain an additional 26 features to produce the full 39 features for a vector.
- PLP perceptual linear prediction
- MFCC Mel frequency cepstrum coefficients
- Information is associated ( 404 ) with the generated vectors.
- the information may include weights and metadata, including, but not limited to, the weights and metadata described above.
- the applied weights and metadata if appropriate, are used to generate outputs for known audio. Accordingly, a testing phase may be part of the training. If the applied weights do not generate the appropriate output during testing, then the applied weights may be adjusted until the appropriate output is obtained.
- the model weights may be estimated using a maximum mutual information (MMI) training criterion.
- MMI maximum mutual information
- regularization may be used for feature selection.
- regularization may be used to avoid overfitting. Processing may be performed using the general-purpose L-BFGS or Rprop techniques.
- the information associated ( 404 ) with the generated vectors may also identify a word or phrase associated with each vector.
- a single vector will not typically represent an entire word.
- a group of such vectors e.g., 50
- the metadata associated with each vector may identify the word or phrase that the vector is part of, and what part of the word or phrase the vector represents.
- the metadata may specify that the word that a vector is part of is “recognize” and it may specify what part of the word “recognize” that the vector represents (e.g., the first 10 ms, the tenth 10 ms, and so forth).
- a group of vectors is not representative of audio (e.g., a negative representation) and may be indicated as such in metadata.
- Vectors and associated metadata are stored (405) in a database.
- the vectors may be indexed, e.g., by word or words, for retrieval.
- the training process continues 406 for all or part of the corpus of audio.
- the training may be updated, as desired, using new audio or the same audio.
- FIG. 5 is a flow diagram for an example process 500 for performing speech recognition.
- Process 500 may be performed by speech recognition module 314 of FIG. 3 , either alone or in combination with one or more other appropriate computer programs.
- audio is received (501).
- speech recognition module 314 may receive audio from a computing device, such as mobile device 101 ( FIG. 1 ).
- the input audio referred to herein may include all of the audio received between designated start and stop times, or a portion or snippet thereof.
- the audio is input speech; however, any type of audio may be received.
- the audio may be a recorded musical track, a recorded track associated with video, and so forth.
- Phonemes (“phones”) are identified in the input audio and may be used, as described below, to identify the content of the audio.
- a recognition process is performed ( 502 ) on the input audio.
- the recognition process may be performed by first pass module 340 .
- first pass module 340 is an HMM-based system (e.g., like first pass stage 115 a of FIGS. 1 and 2 ), as described above, which produces scored recognition candidates.
- Candidates for recognition of the input audio are identified ( 503 ) by their scores. For example, one or more candidates with the highest recognition scores may be identified and selected. A predefined number of candidates may be selected, or those within a predefined tolerance of the candidate with the highest score may be selected. Selection criteria other than these may also be used.
- the candidates are provided to second pass module 341 . There, at least some of the following operations may be performed to generate final recognition candidates (e.g., a best recognition candidate).
- Vectors are generated ( 504 ) for the input audio.
- the vectors may be for 10 ms segments of the audio, as described above, and may include appropriate metadata.
- Vectors that may correspond to the input audio are identified ( 505 ) in the database.
- the vectors that are identified are vectors for the words, phrase, etc. of audio recognized in the first pass stage. For example, if the first pass stage has identified candidates of “recognize”, “recognized”, and “ignition”, then vectors corresponding to those words are identified in the database based, e.g., on their associated metadata. For example, a search of an index may be performed to identify the vectors. In some implementations, all vectors corresponding to a recognition candidate are identified.
- a subset of all vectors corresponding to a recognition candidate is identified. For example, vectors with weights that are at, or below, a predefined value, e.g., zero, may be excluded from consideration. In this case, it is possible to reduce the effects of noise or other artifacts on the recognition process. Furthermore, as a result, the amount of processing performed is reduced (since vectors with zero weights need not be processed). Thus, the metadata may be used to reduce the amount of processing performed, since it can result in consideration of fewer numbers of vectors than would otherwise be considered.
- a predefined value e.g., zero
- all stored vectors may not be accurately labeled.
- vectors for “recognize” may be inaccurately labeled as being for “recognition”.
- the effects of inaccurately-labeled vectors may be mitigated in some cases.
- the vectors for the input audio are compared ( 506 ) to the identified vectors for the recognition candidates to determine similarity metrics between the vectors for the input audio and the identified vectors.
- the similarity metric may be based on DTW distances between vectors, as noted above.
- the similarity metrics may be such that they reduce the effects of noise and errors on recognition.
- template features (f) may be based on a segmented word W (e.g., broken into segments of 10 ms) and frame features X associated with this word segment.
- a template feature is set to the average DTW distance between a recognition hypothesis X (e.g., the vector for recognition candidate from the first pass stage) and the k-nearest vectors of X associated with the hypothesis word W, where Y ⁇ KNN W (X) (KNN, meaning “k-nearest neighbor vectors) if the word hypothesis W I matches the template word W. Otherwise, the template feature is set to zero. This is expressed in the following equation:
- individual DTW distances are used as the template features.
- the DTW distances may be exponentiated to achieve a more sparse representation and thus, in some cases, faster training.
- this non-linearity enables modeling of arbitrary decision boundaries. This is expressed in the following equation:
- f Y kernel ⁇ ( X , W ) ⁇ exp ⁇ ( - ⁇ ⁇ ⁇ d ⁇ ( X , Y ) ) 0 ⁇ if ⁇ ⁇ Y ⁇ ⁇ ⁇ template ⁇ ⁇ of ⁇ ⁇ W otherwise ⁇ , in the above equation, is a scaling factor.
- the similarity metric may be adjusted by weights associated with the corresponding vectors.
- the similarity metric may be adjusted in accordance with other metadata associated with the vectors (e.g., the identity(ies) of neighboring words, the context of the audio, and so forth).
- the output of second pass module 341 includes one or more scores (e.g., one or more template features) that are indicative of how well the recognition candidate from first pass module matches vectors from database 335 .
- the scores produced by first pass module 340 are re-scored ( 507 ) using the scores produced by second pass module 341 to identify ( 508 ) which of the recognition candidates best matches the input audio.
- the combination of the template features from the second pass module with the first pass scores is performed using a segmental conditional random field.
- a segmental conditional random field is a conditional random field defined on word lattices.
- the features of the conditional random field are defined on the word arc level.
- language and acoustic model scores are used as features. As a result, the re-scoring result is no worse than the first-pass baseline result.
- the resulting output ( 509 ) of speech recognition module 314 may be applied to language model 332 that generates a phonetic representation of the selected (e.g., best) recognition candidate, along with other appropriate information identifying the word or phrase.
- Dictionary 333 may be used to transform sound sequences into words that can be understood by the language model.
- Data corresponding to the selected recognition candidate is output as a recognized version of the audio.
- speech recognition module may output the data to the appropriate device or process.
- the output may be formatted as part of an XML file, a text transcription, a command or command sequence, a search query, and so forth.
- the data may be presented to the user, either audibly or visually, or it may be used as part of a process either on the user's device or elsewhere.
- a transcription of the input audio may be applied to a translation service, which may be programmed to generate an audio and/or textual translation of the input audio into another, different language (e.g., from English to French) for output to the user's computing device.
- the user may be able to specify the accent or dialect of the target language for the output audio.
- FIG. 6 shows examples of computing devices on which the processes described herein, or portions thereof, may be implemented.
- FIG. 6 shows an example of a generic computing device 600 and a generic mobile computing device 650 , which may be used to implement the processes described herein, or portions thereof.
- server(s) 304 may be implemented on computing device 600 .
- Mobile computing device 650 may represent the mobile device 101 of FIG. 1 .
- Computing device 600 is intended to represent various forms of digital computers, examples of which include laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers.
- Computing device 650 is intended to represent various forms of mobile devices, examples of which include personal digital assistants, cellular telephones, smartphones, and other similar computing devices.
- the components shown here, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the implementations described and/or claimed in this document.
- Computing device 600 includes a processor 602 , memory 604 , a storage device 606 , a high-speed interface 608 connecting to memory 604 and high-speed expansion ports 610 , and a low speed interface 612 connecting to low speed bus 614 and storage device 606 .
- Components 602 , 604 , 606 , 608 , 610 , and 612 are interconnected using various busses, and may be mounted on a common motherboard or in other manners as appropriate.
- the processor 602 may process instructions for execution within the computing device 600 , including instructions stored in the memory 604 or on the storage device 606 to display graphical information for a GUI on an external input/output device, for example, display 616 coupled to high speed interface 608 .
- multiple processors and/or multiple buses may be used, as appropriate, along with multiple memories and types of memory.
- multiple computing devices 600 may be connected, with a device providing a portion of the necessary operations (e.g., as a server bank, a group of blade servers, or a multi-processor system).
- the memory 604 stores information within the computing device 600 .
- the memory 604 is a volatile memory unit or units.
- the memory 604 is a non-volatile memory unit or units.
- the memory 604 may also be another form of computer-readable medium, examples of which include a magnetic or optical disk.
- the storage device 606 is capable of providing mass storage for the computing device 600 .
- the storage device 606 may be or contain a computer-readable medium, examples of which include a floppy disk device, a hard disk device, an optical disk device, or a tape device, a flash memory or other similar solid state memory device, or an array of devices, including devices in a storage area network or other configurations.
- a computer program product may be tangibly embodied in an information carrier.
- the computer program product may also contain instructions that, when executed, perform one or more methods, including those described above.
- the information carrier may be a non-transitory computer- or machine-readable medium, for example, the memory 604 , the storage device 606 , or memory on processor 602 .
- the information carrier may be a non-transitory, machine-readable storage medium.
- the high speed controller 608 manages bandwidth-intensive operations for the computing device 600 , while the low speed controller 612 manages lower bandwidth-intensive operations. Such allocation of functions is exemplary only.
- the high-speed controller 608 is coupled to memory 604 , display 616 (e.g., through a graphics processor or accelerator), and to high-speed expansion ports 610 , which may accept various expansion cards (not shown).
- low-speed controller 612 is coupled to storage device 606 and low-speed expansion port 614 .
- the low-speed expansion port which may include various communication ports (e.g., USB, Bluetooth, Ethernet, wireless Ethernet) may be coupled to one or more input/output devices, e.g., a keyboard, a pointing device, a scanner, or a networking device, e.g., a switch or router, e.g., through a network adapter.
- input/output devices e.g., a keyboard, a pointing device, a scanner, or a networking device, e.g., a switch or router, e.g., through a network adapter.
- the computing device 600 may be implemented in a number of different forms, as shown in the figure. For example, it may be implemented as a standard server 620 , or multiple times in a group of such servers. It may also be implemented as part of a rack server system 624 . In addition, it may be implemented in a personal computer, e.g., a laptop computer 622 . Alternatively, components from computing device 600 may be combined with other components in a mobile device (not shown), e.g., device 650 . Such devices may contain one or more of computing device 600 , 650 , and an entire system may be made up of multiple computing devices 600 , 650 communicating with one other.
- Computing device 650 includes a processor 652 , memory 664 , an input/output device, e.g., a display 654 , a communication interface 666 , and a transceiver 668 , among other components.
- the device 650 may also be provided with a storage device, e.g., a microdrive or other device, to provide additional storage.
- the components 650 , 652 , 664 , 654 , 666 , and 668 are interconnected using various buses, and several of the components may be mounted on a common motherboard or in other manners as appropriate.
- the processor 652 may execute instructions within the computing device 650 , including instructions stored in the memory 664 .
- the processor may be implemented as a chipset of chips that include separate and multiple analog and digital processors.
- the processor may provide, for example, for coordination of the other components of the device 650 , e.g., control of user interfaces, applications run by device 650 , and wireless communication by device 650 .
- Processor 652 may communicate with a user through control interface 658 and display interface 656 coupled to a display 654 .
- the display 654 may be, for example, a TFT LCD (Thin-Film-Transistor Liquid Crystal Display) or an OLED (Organic Light Emitting Diode) display, or other appropriate display technology.
- the display interface 656 may include appropriate circuitry for driving the display 654 to present graphical and other information to a user.
- the control interface 658 may receive commands from a user and convert them for submission to the processor 652 .
- an external interface 662 may be provide in communication with processor 652 , so as to enable near area communication of device 650 with other devices. External interface 662 may provide, for example, for wired communication in some implementations, or for wireless communication in other implementations, and multiple interfaces may also be used.
- the memory 664 stores information within the computing device 650 .
- the memory 664 may be implemented as one or more of a computer-readable medium or media, a volatile memory unit or units, or a non-volatile memory unit or units.
- Expansion memory 674 may also be provided and connected to device 650 through expansion interface 672 , which may include, for example, a SIMM (Single In Line Memory Module) card interface.
- SIMM Single In Line Memory Module
- expansion memory 674 may provide extra storage space for device 650 , or may also store applications or other information for device 650 .
- expansion memory 674 may include instructions to carry out or supplement the processes described above, and may include secure information also.
- expansion memory 674 may be provide as a security module for device 650 , and may be programmed with instructions that permit secure use of device 650 .
- secure applications may be provided by the SIMM cards, along with additional information, e.g., placing identifying information on the SIMM card in a non-hackable manner.
- the memory may include, for example, flash memory and/or NVRAM memory, as discussed below.
- a computer program product is tangibly embodied in an information carrier.
- the computer program product contains instructions that, when executed, perform one or more methods, including those described above.
- the information carrier is a computer- or machine-readable medium, e.g., the memory 664 , expansion memory 674 , memory on processor 652 , and so forth that may be received, for example, over transceiver 668 or external interface 662 .
- Device 650 may communicate wirelessly through communication interface 666 , which may include digital signal processing circuitry where necessary. Communication interface 666 may provide for communications under various modes or protocols, examples of which include GSM voice calls, SMS, EMS, or MMS messaging, CDMA, TDMA, PDC, WCDMA, CDMA2000, or GPRS, among others. Such communication may occur, for example, through radio-frequency transceiver 668 . In addition, short-range communication may occur, e.g., using a Bluetooth, Wi-Fi, or other such transceiver (not shown). In addition, GPS (Global Positioning System) receiver module 670 may provide additional navigation- and location-related wireless data to device 650 , which may be used as appropriate by applications running on device 650 .
- GPS Global Positioning System
- Device 650 may also communicate audibly using audio codec 660 , which may receive spoken information from a user and convert it to usable digital information. Audio codec 660 may likewise generate audible sound for a user, e.g., through a speaker, e.g., in a handset of device 650 . Such sound may include sound from voice telephone calls, may include recorded sound (e.g., voice electronic messages, music files, etc.) and may also include sound generated by applications operating on device 650 .
- Audio codec 660 may receive spoken information from a user and convert it to usable digital information. Audio codec 660 may likewise generate audible sound for a user, e.g., through a speaker, e.g., in a handset of device 650 . Such sound may include sound from voice telephone calls, may include recorded sound (e.g., voice electronic messages, music files, etc.) and may also include sound generated by applications operating on device 650 .
- the computing device 650 may be implemented in a number of different forms, as shown in the figure. For example, it may be implemented as a cellular telephone 680 . It may also be implemented as part of a smartphone 682 , personal digital assistant, or other similar mobile device.
- implementations of the systems and techniques described here may be realized in digital electronic circuitry, integrated circuitry, specially designed ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof.
- ASICs application specific integrated circuits
- These various implementations may include implementation in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.
- the systems and techniques described here may be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user and a keyboard and a pointing device (e.g., a mouse or a trackball) by which the user may provide input to the computer.
- a display device e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor
- a keyboard and a pointing device e.g., a mouse or a trackball
- Other kinds of devices may be used to provide for interaction with a user as well; for example, feedback provided to the user may be a form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in a form, including acoustic, speech, or tactile input.
- the systems and techniques described here may be implemented in a computing system that includes a back end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front end component (e.g., a client computer having a graphical user interface or a Web browser through which a user may interact with an implementation of the systems and techniques described here), or a combination of such back end, middleware, or front end components.
- the components of the system may be interconnected by a form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network (“LAN”), a wide area network (“WAN”), and the Internet.
- LAN local area network
- WAN wide area network
- the Internet the global information network
- the computing system may include clients and servers.
- a client and server are generally remote from one other and typically interact through a communication network.
- the relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to one other.
- the engines described herein may be separated, combined or incorporated into a single or combined engine.
- the engines depicted in the figures are not intended to limit the systems described here to the software architectures shown in the figures.
- the users may be provided with an opportunity to control whether programs or features collect personal information (e.g., information about a user's social network, social actions or activities, profession, a user's preferences, or a user's current location), or to control whether and/or how to receive content from the content server that may be more relevant to the user.
- personal information e.g., information about a user's social network, social actions or activities, profession, a user's preferences, or a user's current location
- certain data may be anonymized in one or more ways before it is stored or used, so that personally identifiable information is removed.
- a user's identity may be anonymized so that no personally identifiable information can be determined for the user, or a user's geographic location may be generalized where location information is obtained (such as to a city, ZIP code, or state level), so that a particular location of a user cannot be determined.
- location information such as to a city, ZIP code, or state level
- the user may have control over how information is collected about him or her and used by a content server.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- User Interface Of Digital Computer (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Accordingly, in this example, there is one template feature for each word.
β, in the above equation, is a scaling factor.
Claims (16)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US13/665,245 US8775177B1 (en) | 2012-03-08 | 2012-10-31 | Speech recognition process |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201261608218P | 2012-03-08 | 2012-03-08 | |
US13/665,245 US8775177B1 (en) | 2012-03-08 | 2012-10-31 | Speech recognition process |
Publications (1)
Publication Number | Publication Date |
---|---|
US8775177B1 true US8775177B1 (en) | 2014-07-08 |
Family
ID=51031892
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US13/665,245 Active US8775177B1 (en) | 2012-03-08 | 2012-10-31 | Speech recognition process |
Country Status (1)
Country | Link |
---|---|
US (1) | US8775177B1 (en) |
Cited By (83)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140211669A1 (en) * | 2013-01-28 | 2014-07-31 | Pantech Co., Ltd. | Terminal to communicate data using voice command, and method and system thereof |
US20140244255A1 (en) * | 2013-02-25 | 2014-08-28 | Seiko Epson Corporation | Speech recognition device and method, and semiconductor integrated circuit device |
WO2016167779A1 (en) * | 2015-04-16 | 2016-10-20 | Mitsubishi Electric Corporation | Speech recognition device and rescoring device |
US9502032B2 (en) | 2014-10-08 | 2016-11-22 | Google Inc. | Dynamically biasing language models |
EP3267328A1 (en) * | 2016-07-07 | 2018-01-10 | Samsung Electronics Co., Ltd | Automated interpretation method and apparatus |
US9978367B2 (en) | 2016-03-16 | 2018-05-22 | Google Llc | Determining dialog states for language models |
US10311860B2 (en) | 2017-02-14 | 2019-06-04 | Google Llc | Language model biasing system |
US20190303442A1 (en) * | 2018-03-30 | 2019-10-03 | Apple Inc. | Implicit identification of translation payload with neural machine translation |
CN111199730A (en) * | 2020-01-08 | 2020-05-26 | 北京松果电子有限公司 | Voice recognition method, device, terminal and storage medium |
US10720160B2 (en) | 2018-06-01 | 2020-07-21 | Apple Inc. | Voice interaction at a primary device to access call functionality of a companion device |
US10878809B2 (en) | 2014-05-30 | 2020-12-29 | Apple Inc. | Multi-command single utterance input method |
US10896681B2 (en) | 2015-12-29 | 2021-01-19 | Google Llc | Speech recognition with selective use of dynamic language models |
US10978090B2 (en) | 2013-02-07 | 2021-04-13 | Apple Inc. | Voice trigger for a digital assistant |
US11009970B2 (en) | 2018-06-01 | 2021-05-18 | Apple Inc. | Attention aware virtual assistant dismissal |
US11037565B2 (en) | 2016-06-10 | 2021-06-15 | Apple Inc. | Intelligent digital assistant in a multi-tasking environment |
US11070949B2 (en) | 2015-05-27 | 2021-07-20 | Apple Inc. | Systems and methods for proactively identifying and surfacing relevant content on an electronic device with a touch-sensitive display |
US11087759B2 (en) | 2015-03-08 | 2021-08-10 | Apple Inc. | Virtual assistant activation |
US11107475B2 (en) * | 2019-05-09 | 2021-08-31 | Rovi Guides, Inc. | Word correction using automatic speech recognition (ASR) incremental response |
US11120372B2 (en) | 2011-06-03 | 2021-09-14 | Apple Inc. | Performing actions associated with task items that represent tasks to perform |
US11126400B2 (en) | 2015-09-08 | 2021-09-21 | Apple Inc. | Zero latency digital assistant |
US11133008B2 (en) | 2014-05-30 | 2021-09-28 | Apple Inc. | Reducing the need for manual start/end-pointing and trigger phrases |
US11140099B2 (en) | 2019-05-21 | 2021-10-05 | Apple Inc. | Providing message response suggestions |
US11152002B2 (en) | 2016-06-11 | 2021-10-19 | Apple Inc. | Application integration with a digital assistant |
US11169616B2 (en) | 2018-05-07 | 2021-11-09 | Apple Inc. | Raise to speak |
US11217251B2 (en) | 2019-05-06 | 2022-01-04 | Apple Inc. | Spoken notifications |
US11237797B2 (en) | 2019-05-31 | 2022-02-01 | Apple Inc. | User activity shortcut suggestions |
US11257504B2 (en) | 2014-05-30 | 2022-02-22 | Apple Inc. | Intelligent assistant for home automation |
US11269678B2 (en) | 2012-05-15 | 2022-03-08 | Apple Inc. | Systems and methods for integrating third party services with a digital assistant |
US11289073B2 (en) | 2019-05-31 | 2022-03-29 | Apple Inc. | Device text to speech |
US11307752B2 (en) | 2019-05-06 | 2022-04-19 | Apple Inc. | User configurable task triggers |
US11340925B2 (en) | 2017-05-18 | 2022-05-24 | Peloton Interactive Inc. | Action recipes for a crowdsourced digital assistant system |
US11348582B2 (en) | 2008-10-02 | 2022-05-31 | Apple Inc. | Electronic devices with voice command and contextual data processing capabilities |
US11360641B2 (en) | 2019-06-01 | 2022-06-14 | Apple Inc. | Increasing the relevance of new available information |
US11380310B2 (en) | 2017-05-12 | 2022-07-05 | Apple Inc. | Low-latency intelligent automated assistant |
US11388291B2 (en) | 2013-03-14 | 2022-07-12 | Apple Inc. | System and method for processing voicemail |
US20220223157A1 (en) * | 2021-01-11 | 2022-07-14 | Bank Of America Corporation | System and method for single-speaker identification in a multi-speaker environment on a low-frequency audio recording |
US11405466B2 (en) | 2017-05-12 | 2022-08-02 | Apple Inc. | Synchronization and task delegation of a digital assistant |
US11423908B2 (en) | 2019-05-06 | 2022-08-23 | Apple Inc. | Interpreting spoken requests |
US11423886B2 (en) | 2010-01-18 | 2022-08-23 | Apple Inc. | Task flow identification based on user intent |
US11431642B2 (en) | 2018-06-01 | 2022-08-30 | Apple Inc. | Variable latency device coordination |
US11467802B2 (en) | 2017-05-11 | 2022-10-11 | Apple Inc. | Maintaining privacy of personal information |
US11468282B2 (en) | 2015-05-15 | 2022-10-11 | Apple Inc. | Virtual assistant in a communication session |
US11475884B2 (en) | 2019-05-06 | 2022-10-18 | Apple Inc. | Reducing digital assistant latency when a language is incorrectly determined |
US11475898B2 (en) | 2018-10-26 | 2022-10-18 | Apple Inc. | Low-latency multi-speaker speech recognition |
US11488406B2 (en) | 2019-09-25 | 2022-11-01 | Apple Inc. | Text detection using global geometry estimators |
US11496600B2 (en) | 2019-05-31 | 2022-11-08 | Apple Inc. | Remote execution of machine-learned models |
US11500672B2 (en) | 2015-09-08 | 2022-11-15 | Apple Inc. | Distributed personal assistant |
US11516537B2 (en) | 2014-06-30 | 2022-11-29 | Apple Inc. | Intelligent automated assistant for TV user interactions |
US11520610B2 (en) * | 2017-05-18 | 2022-12-06 | Peloton Interactive Inc. | Crowdsourced on-boarding of digital assistant operations |
US11526368B2 (en) | 2015-11-06 | 2022-12-13 | Apple Inc. | Intelligent automated assistant in a messaging environment |
US11532306B2 (en) | 2017-05-16 | 2022-12-20 | Apple Inc. | Detecting a trigger of a digital assistant |
US11580990B2 (en) | 2017-05-12 | 2023-02-14 | Apple Inc. | User-specific acoustic models |
US11599331B2 (en) | 2017-05-11 | 2023-03-07 | Apple Inc. | Maintaining privacy of personal information |
US11656884B2 (en) | 2017-01-09 | 2023-05-23 | Apple Inc. | Application integration with a digital assistant |
US11657813B2 (en) | 2019-05-31 | 2023-05-23 | Apple Inc. | Voice identification in digital assistant systems |
US11671920B2 (en) | 2007-04-03 | 2023-06-06 | Apple Inc. | Method and system for operating a multifunction portable electronic device using voice-activation |
US11675829B2 (en) | 2017-05-16 | 2023-06-13 | Apple Inc. | Intelligent automated assistant for media exploration |
US11682380B2 (en) | 2017-05-18 | 2023-06-20 | Peloton Interactive Inc. | Systems and methods for crowdsourced actions and commands |
US11696060B2 (en) | 2020-07-21 | 2023-07-04 | Apple Inc. | User identification using headphones |
US11710482B2 (en) | 2018-03-26 | 2023-07-25 | Apple Inc. | Natural assistant interaction |
US11727219B2 (en) | 2013-06-09 | 2023-08-15 | Apple Inc. | System and method for inferring user intent from speech inputs |
US11755276B2 (en) | 2020-05-12 | 2023-09-12 | Apple Inc. | Reducing description length based on confidence |
US11765209B2 (en) | 2020-05-11 | 2023-09-19 | Apple Inc. | Digital assistant hardware abstraction |
US11783815B2 (en) | 2019-03-18 | 2023-10-10 | Apple Inc. | Multimodality in digital assistant systems |
US11790914B2 (en) | 2019-06-01 | 2023-10-17 | Apple Inc. | Methods and user interfaces for voice-based control of electronic devices |
US11798547B2 (en) | 2013-03-15 | 2023-10-24 | Apple Inc. | Voice activated device for use with a voice-based digital assistant |
US11809483B2 (en) | 2015-09-08 | 2023-11-07 | Apple Inc. | Intelligent automated assistant for media search and playback |
US11809783B2 (en) | 2016-06-11 | 2023-11-07 | Apple Inc. | Intelligent device arbitration and control |
US11838734B2 (en) | 2020-07-20 | 2023-12-05 | Apple Inc. | Multi-device audio adjustment coordination |
US11854539B2 (en) | 2018-05-07 | 2023-12-26 | Apple Inc. | Intelligent automated assistant for delivering content from user experiences |
US11853647B2 (en) | 2015-12-23 | 2023-12-26 | Apple Inc. | Proactive assistance based on dialog communication between devices |
US11853536B2 (en) | 2015-09-08 | 2023-12-26 | Apple Inc. | Intelligent automated assistant in a media environment |
US11862156B2 (en) | 2017-05-18 | 2024-01-02 | Peloton Interactive, Inc. | Talk back from actions in applications |
US11886805B2 (en) | 2015-11-09 | 2024-01-30 | Apple Inc. | Unconventional virtual assistant interactions |
US11893992B2 (en) | 2018-09-28 | 2024-02-06 | Apple Inc. | Multi-modal inputs for voice commands |
US11914848B2 (en) | 2020-05-11 | 2024-02-27 | Apple Inc. | Providing relevant data items based on context |
US11947873B2 (en) | 2015-06-29 | 2024-04-02 | Apple Inc. | Virtual assistant for media playback |
US12010262B2 (en) | 2013-08-06 | 2024-06-11 | Apple Inc. | Auto-activating smart responses based on activities from remote devices |
US12014118B2 (en) | 2017-05-15 | 2024-06-18 | Apple Inc. | Multi-modal interfaces having selection disambiguation and text modification capability |
US12051413B2 (en) | 2015-09-30 | 2024-07-30 | Apple Inc. | Intelligent device identification |
US12067985B2 (en) | 2018-06-01 | 2024-08-20 | Apple Inc. | Virtual assistant operations in multi-device environments |
US12073147B2 (en) | 2013-06-09 | 2024-08-27 | Apple Inc. | Device, method, and graphical user interface for enabling conversation persistence across two or more instances of a digital assistant |
US12087308B2 (en) | 2010-01-18 | 2024-09-10 | Apple Inc. | Intelligent automated assistant |
Citations (47)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4571697A (en) | 1981-12-29 | 1986-02-18 | Nippon Electric Co., Ltd. | Apparatus for calculating pattern dissimilarity between patterns |
US4783802A (en) | 1984-10-02 | 1988-11-08 | Kabushiki Kaisha Toshiba | Learning system of dictionary for speech recognition |
US4860358A (en) | 1983-09-12 | 1989-08-22 | American Telephone And Telegraph Company, At&T Bell Laboratories | Speech recognition arrangement with preselection |
US4908865A (en) | 1984-12-27 | 1990-03-13 | Texas Instruments Incorporated | Speaker independent speech recognition method and system |
US5131043A (en) | 1983-09-05 | 1992-07-14 | Matsushita Electric Industrial Co., Ltd. | Method of and apparatus for speech recognition wherein decisions are made based on phonemes |
US5864810A (en) * | 1995-01-20 | 1999-01-26 | Sri International | Method and apparatus for speech recognition adapted to an individual speaker |
US5946653A (en) | 1997-10-01 | 1999-08-31 | Motorola, Inc. | Speaker independent speech recognition system and method |
US5953699A (en) | 1996-10-28 | 1999-09-14 | Nec Corporation | Speech recognition using distance between feature vector of one sequence and line segment connecting feature-variation-end-point vectors in another sequence |
US5983177A (en) * | 1997-12-18 | 1999-11-09 | Nortel Networks Corporation | Method and apparatus for obtaining transcriptions from multiple training utterances |
US6104989A (en) | 1998-07-29 | 2000-08-15 | International Business Machines Corporation | Real time detection of topical changes and topic identification via likelihood based methods |
US6336108B1 (en) * | 1997-12-04 | 2002-01-01 | Microsoft Corporation | Speech recognition with mixtures of bayesian networks |
US6529902B1 (en) | 1999-11-08 | 2003-03-04 | International Business Machines Corporation | Method and system for off-line detection of textual topical changes and topic identification via likelihood based methods for improved language modeling |
US20030055640A1 (en) * | 2001-05-01 | 2003-03-20 | Ramot University Authority For Applied Research & Industrial Development Ltd. | System and method for parameter estimation for pattern recognition |
US6542866B1 (en) * | 1999-09-22 | 2003-04-01 | Microsoft Corporation | Speech recognition method and apparatus utilizing multiple feature streams |
US20050119885A1 (en) * | 2003-11-28 | 2005-06-02 | Axelrod Scott E. | Speech recognition utilizing multitude of speech features |
US20050149326A1 (en) | 2004-01-05 | 2005-07-07 | Kabushiki Kaisha Toshiba | Speech recognition system and technique |
US20050203738A1 (en) * | 2004-03-10 | 2005-09-15 | Microsoft Corporation | New-word pronunciation learning using a pronunciation graph |
US6983246B2 (en) | 2002-05-21 | 2006-01-03 | Thinkengine Networks, Inc. | Dynamic time warping using frequency distributed distance measures |
US20060129392A1 (en) | 2004-12-13 | 2006-06-15 | Lg Electronics Inc | Method for extracting feature vectors for speech recognition |
US20060149710A1 (en) | 2004-12-30 | 2006-07-06 | Ross Koningstein | Associating features with entities, such as categories of web page documents, and/or weighting such features |
US20060184360A1 (en) * | 1999-04-20 | 2006-08-17 | Hy Murveit | Adaptive multi-pass speech recognition system |
US20070037513A1 (en) | 2005-08-15 | 2007-02-15 | International Business Machines Corporation | System and method for targeted message delivery and subscription |
US20070100618A1 (en) | 2005-11-02 | 2007-05-03 | Samsung Electronics Co., Ltd. | Apparatus, method, and medium for dialogue speech recognition using topic domain detection |
US20070106685A1 (en) | 2005-11-09 | 2007-05-10 | Podzinger Corp. | Method and apparatus for updating speech recognition databases and reindexing audio and video content using the same |
US20070118372A1 (en) | 2005-11-23 | 2007-05-24 | General Electric Company | System and method for generating closed captions |
US7272558B1 (en) | 2006-12-01 | 2007-09-18 | Coveo Solutions Inc. | Speech recognition training method for audio and video file indexing on a search engine |
US20070265849A1 (en) * | 2006-05-11 | 2007-11-15 | General Motors Corporation | Distinguishing out-of-vocabulary speech from in-vocabulary speech |
US7310601B2 (en) | 2004-06-08 | 2007-12-18 | Matsushita Electric Industrial Co., Ltd. | Speech recognition apparatus and speech recognition method |
US20080195389A1 (en) * | 2007-02-12 | 2008-08-14 | Microsoft Corporation | Text-dependent speaker verification |
US20080201143A1 (en) | 2007-02-15 | 2008-08-21 | Forensic Intelligence Detection Organization | System and method for multi-modal audio mining of telephone conversations |
US20090030697A1 (en) | 2007-03-07 | 2009-01-29 | Cerra Joseph P | Using contextual information for delivering results generated from a speech recognition facility using an unstructured language model |
US20090055185A1 (en) | 2007-04-16 | 2009-02-26 | Motoki Nakade | Voice chat system, information processing apparatus, speech recognition method, keyword data electrode detection method, and program |
US20090055381A1 (en) | 2007-08-23 | 2009-02-26 | Google Inc. | Domain Dictionary Creation |
US20100070268A1 (en) * | 2008-09-10 | 2010-03-18 | Jun Hyung Sung | Multimodal unification of articulation for device interfacing |
US20100161313A1 (en) | 2008-12-18 | 2010-06-24 | Palo Alto Research Center Incorporated | Region-Matching Transducers for Natural Language Processing |
US7761296B1 (en) * | 1999-04-02 | 2010-07-20 | International Business Machines Corporation | System and method for rescoring N-best hypotheses of an automatic speech recognition system |
US20100305947A1 (en) * | 2009-06-02 | 2010-12-02 | Nuance Communications, Inc. | Speech Recognition Method for Selecting a Combination of List Elements via a Speech Input |
US20110004462A1 (en) | 2009-07-01 | 2011-01-06 | Comcast Interactive Media, Llc | Generating Topic-Specific Language Models |
US20110035210A1 (en) * | 2009-08-10 | 2011-02-10 | Benjamin Rosenfeld | Conditional random fields (crf)-based relation extraction system |
US20110077943A1 (en) | 2006-06-26 | 2011-03-31 | Nec Corporation | System for generating language model, method of generating language model, and program for language model generation |
US20110131046A1 (en) * | 2009-11-30 | 2011-06-02 | Microsoft Corporation | Features for utilization in speech recognition |
US8001066B2 (en) | 2002-07-03 | 2011-08-16 | Sean Colbath | Systems and methods for improving recognition results via user-augmentation of a database |
US20110296374A1 (en) | 2008-11-05 | 2011-12-01 | Google Inc. | Custom language models |
US20120022873A1 (en) | 2009-12-23 | 2012-01-26 | Ballinger Brandon M | Speech Recognition Language Models |
US20120029910A1 (en) | 2009-03-30 | 2012-02-02 | Touchtype Ltd | System and Method for Inputting Text into Electronic Devices |
US20120072215A1 (en) * | 2010-09-21 | 2012-03-22 | Microsoft Corporation | Full-sequence training of deep structures for speech recognition |
US20120150532A1 (en) | 2010-12-08 | 2012-06-14 | At&T Intellectual Property I, L.P. | System and method for feature-rich continuous space language models |
-
2012
- 2012-10-31 US US13/665,245 patent/US8775177B1/en active Active
Patent Citations (49)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4571697A (en) | 1981-12-29 | 1986-02-18 | Nippon Electric Co., Ltd. | Apparatus for calculating pattern dissimilarity between patterns |
US5131043A (en) | 1983-09-05 | 1992-07-14 | Matsushita Electric Industrial Co., Ltd. | Method of and apparatus for speech recognition wherein decisions are made based on phonemes |
US4860358A (en) | 1983-09-12 | 1989-08-22 | American Telephone And Telegraph Company, At&T Bell Laboratories | Speech recognition arrangement with preselection |
US4783802A (en) | 1984-10-02 | 1988-11-08 | Kabushiki Kaisha Toshiba | Learning system of dictionary for speech recognition |
US4908865A (en) | 1984-12-27 | 1990-03-13 | Texas Instruments Incorporated | Speaker independent speech recognition method and system |
US5864810A (en) * | 1995-01-20 | 1999-01-26 | Sri International | Method and apparatus for speech recognition adapted to an individual speaker |
US5953699A (en) | 1996-10-28 | 1999-09-14 | Nec Corporation | Speech recognition using distance between feature vector of one sequence and line segment connecting feature-variation-end-point vectors in another sequence |
US5946653A (en) | 1997-10-01 | 1999-08-31 | Motorola, Inc. | Speaker independent speech recognition system and method |
US6336108B1 (en) * | 1997-12-04 | 2002-01-01 | Microsoft Corporation | Speech recognition with mixtures of bayesian networks |
US5983177A (en) * | 1997-12-18 | 1999-11-09 | Nortel Networks Corporation | Method and apparatus for obtaining transcriptions from multiple training utterances |
US6104989A (en) | 1998-07-29 | 2000-08-15 | International Business Machines Corporation | Real time detection of topical changes and topic identification via likelihood based methods |
US7761296B1 (en) * | 1999-04-02 | 2010-07-20 | International Business Machines Corporation | System and method for rescoring N-best hypotheses of an automatic speech recognition system |
US20060184360A1 (en) * | 1999-04-20 | 2006-08-17 | Hy Murveit | Adaptive multi-pass speech recognition system |
US6542866B1 (en) * | 1999-09-22 | 2003-04-01 | Microsoft Corporation | Speech recognition method and apparatus utilizing multiple feature streams |
US6529902B1 (en) | 1999-11-08 | 2003-03-04 | International Business Machines Corporation | Method and system for off-line detection of textual topical changes and topic identification via likelihood based methods for improved language modeling |
US20030055640A1 (en) * | 2001-05-01 | 2003-03-20 | Ramot University Authority For Applied Research & Industrial Development Ltd. | System and method for parameter estimation for pattern recognition |
US6983246B2 (en) | 2002-05-21 | 2006-01-03 | Thinkengine Networks, Inc. | Dynamic time warping using frequency distributed distance measures |
US8001066B2 (en) | 2002-07-03 | 2011-08-16 | Sean Colbath | Systems and methods for improving recognition results via user-augmentation of a database |
US20050119885A1 (en) * | 2003-11-28 | 2005-06-02 | Axelrod Scott E. | Speech recognition utilizing multitude of speech features |
US20050149326A1 (en) | 2004-01-05 | 2005-07-07 | Kabushiki Kaisha Toshiba | Speech recognition system and technique |
US20050203738A1 (en) * | 2004-03-10 | 2005-09-15 | Microsoft Corporation | New-word pronunciation learning using a pronunciation graph |
US7310601B2 (en) | 2004-06-08 | 2007-12-18 | Matsushita Electric Industrial Co., Ltd. | Speech recognition apparatus and speech recognition method |
US20060129392A1 (en) | 2004-12-13 | 2006-06-15 | Lg Electronics Inc | Method for extracting feature vectors for speech recognition |
US20060149710A1 (en) | 2004-12-30 | 2006-07-06 | Ross Koningstein | Associating features with entities, such as categories of web page documents, and/or weighting such features |
US20070037513A1 (en) | 2005-08-15 | 2007-02-15 | International Business Machines Corporation | System and method for targeted message delivery and subscription |
US20070100618A1 (en) | 2005-11-02 | 2007-05-03 | Samsung Electronics Co., Ltd. | Apparatus, method, and medium for dialogue speech recognition using topic domain detection |
US8301450B2 (en) | 2005-11-02 | 2012-10-30 | Samsung Electronics Co., Ltd. | Apparatus, method, and medium for dialogue speech recognition using topic domain detection |
US20070106685A1 (en) | 2005-11-09 | 2007-05-10 | Podzinger Corp. | Method and apparatus for updating speech recognition databases and reindexing audio and video content using the same |
US20070118372A1 (en) | 2005-11-23 | 2007-05-24 | General Electric Company | System and method for generating closed captions |
US20070265849A1 (en) * | 2006-05-11 | 2007-11-15 | General Motors Corporation | Distinguishing out-of-vocabulary speech from in-vocabulary speech |
US20110077943A1 (en) | 2006-06-26 | 2011-03-31 | Nec Corporation | System for generating language model, method of generating language model, and program for language model generation |
US7272558B1 (en) | 2006-12-01 | 2007-09-18 | Coveo Solutions Inc. | Speech recognition training method for audio and video file indexing on a search engine |
US20080195389A1 (en) * | 2007-02-12 | 2008-08-14 | Microsoft Corporation | Text-dependent speaker verification |
US20080201143A1 (en) | 2007-02-15 | 2008-08-21 | Forensic Intelligence Detection Organization | System and method for multi-modal audio mining of telephone conversations |
US20090030697A1 (en) | 2007-03-07 | 2009-01-29 | Cerra Joseph P | Using contextual information for delivering results generated from a speech recognition facility using an unstructured language model |
US20090055185A1 (en) | 2007-04-16 | 2009-02-26 | Motoki Nakade | Voice chat system, information processing apparatus, speech recognition method, keyword data electrode detection method, and program |
US7983902B2 (en) | 2007-08-23 | 2011-07-19 | Google Inc. | Domain dictionary creation by detection of new topic words using divergence value comparison |
US20090055381A1 (en) | 2007-08-23 | 2009-02-26 | Google Inc. | Domain Dictionary Creation |
US20100070268A1 (en) * | 2008-09-10 | 2010-03-18 | Jun Hyung Sung | Multimodal unification of articulation for device interfacing |
US20110296374A1 (en) | 2008-11-05 | 2011-12-01 | Google Inc. | Custom language models |
US20100161313A1 (en) | 2008-12-18 | 2010-06-24 | Palo Alto Research Center Incorporated | Region-Matching Transducers for Natural Language Processing |
US20120029910A1 (en) | 2009-03-30 | 2012-02-02 | Touchtype Ltd | System and Method for Inputting Text into Electronic Devices |
US20100305947A1 (en) * | 2009-06-02 | 2010-12-02 | Nuance Communications, Inc. | Speech Recognition Method for Selecting a Combination of List Elements via a Speech Input |
US20110004462A1 (en) | 2009-07-01 | 2011-01-06 | Comcast Interactive Media, Llc | Generating Topic-Specific Language Models |
US20110035210A1 (en) * | 2009-08-10 | 2011-02-10 | Benjamin Rosenfeld | Conditional random fields (crf)-based relation extraction system |
US20110131046A1 (en) * | 2009-11-30 | 2011-06-02 | Microsoft Corporation | Features for utilization in speech recognition |
US20120022873A1 (en) | 2009-12-23 | 2012-01-26 | Ballinger Brandon M | Speech Recognition Language Models |
US20120072215A1 (en) * | 2010-09-21 | 2012-03-22 | Microsoft Corporation | Full-sequence training of deep structures for speech recognition |
US20120150532A1 (en) | 2010-12-08 | 2012-06-14 | At&T Intellectual Property I, L.P. | System and method for feature-rich continuous space language models |
Non-Patent Citations (25)
Cited By (136)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11979836B2 (en) | 2007-04-03 | 2024-05-07 | Apple Inc. | Method and system for operating a multi-function portable electronic device using voice-activation |
US11671920B2 (en) | 2007-04-03 | 2023-06-06 | Apple Inc. | Method and system for operating a multifunction portable electronic device using voice-activation |
US11348582B2 (en) | 2008-10-02 | 2022-05-31 | Apple Inc. | Electronic devices with voice command and contextual data processing capabilities |
US11900936B2 (en) | 2008-10-02 | 2024-02-13 | Apple Inc. | Electronic devices with voice command and contextual data processing capabilities |
US12087308B2 (en) | 2010-01-18 | 2024-09-10 | Apple Inc. | Intelligent automated assistant |
US11423886B2 (en) | 2010-01-18 | 2022-08-23 | Apple Inc. | Task flow identification based on user intent |
US11120372B2 (en) | 2011-06-03 | 2021-09-14 | Apple Inc. | Performing actions associated with task items that represent tasks to perform |
US11269678B2 (en) | 2012-05-15 | 2022-03-08 | Apple Inc. | Systems and methods for integrating third party services with a digital assistant |
US11321116B2 (en) | 2012-05-15 | 2022-05-03 | Apple Inc. | Systems and methods for integrating third party services with a digital assistant |
US20140211669A1 (en) * | 2013-01-28 | 2014-07-31 | Pantech Co., Ltd. | Terminal to communicate data using voice command, and method and system thereof |
US11862186B2 (en) | 2013-02-07 | 2024-01-02 | Apple Inc. | Voice trigger for a digital assistant |
US12009007B2 (en) | 2013-02-07 | 2024-06-11 | Apple Inc. | Voice trigger for a digital assistant |
US11636869B2 (en) | 2013-02-07 | 2023-04-25 | Apple Inc. | Voice trigger for a digital assistant |
US10978090B2 (en) | 2013-02-07 | 2021-04-13 | Apple Inc. | Voice trigger for a digital assistant |
US11557310B2 (en) | 2013-02-07 | 2023-01-17 | Apple Inc. | Voice trigger for a digital assistant |
US9886947B2 (en) * | 2013-02-25 | 2018-02-06 | Seiko Epson Corporation | Speech recognition device and method, and semiconductor integrated circuit device |
US20140244255A1 (en) * | 2013-02-25 | 2014-08-28 | Seiko Epson Corporation | Speech recognition device and method, and semiconductor integrated circuit device |
US11388291B2 (en) | 2013-03-14 | 2022-07-12 | Apple Inc. | System and method for processing voicemail |
US11798547B2 (en) | 2013-03-15 | 2023-10-24 | Apple Inc. | Voice activated device for use with a voice-based digital assistant |
US12073147B2 (en) | 2013-06-09 | 2024-08-27 | Apple Inc. | Device, method, and graphical user interface for enabling conversation persistence across two or more instances of a digital assistant |
US11727219B2 (en) | 2013-06-09 | 2023-08-15 | Apple Inc. | System and method for inferring user intent from speech inputs |
US12010262B2 (en) | 2013-08-06 | 2024-06-11 | Apple Inc. | Auto-activating smart responses based on activities from remote devices |
US11257504B2 (en) | 2014-05-30 | 2022-02-22 | Apple Inc. | Intelligent assistant for home automation |
US11699448B2 (en) | 2014-05-30 | 2023-07-11 | Apple Inc. | Intelligent assistant for home automation |
US11810562B2 (en) | 2014-05-30 | 2023-11-07 | Apple Inc. | Reducing the need for manual start/end-pointing and trigger phrases |
US10878809B2 (en) | 2014-05-30 | 2020-12-29 | Apple Inc. | Multi-command single utterance input method |
US12067990B2 (en) | 2014-05-30 | 2024-08-20 | Apple Inc. | Intelligent assistant for home automation |
US11670289B2 (en) | 2014-05-30 | 2023-06-06 | Apple Inc. | Multi-command single utterance input method |
US11133008B2 (en) | 2014-05-30 | 2021-09-28 | Apple Inc. | Reducing the need for manual start/end-pointing and trigger phrases |
US12118999B2 (en) | 2014-05-30 | 2024-10-15 | Apple Inc. | Reducing the need for manual start/end-pointing and trigger phrases |
US11516537B2 (en) | 2014-06-30 | 2022-11-29 | Apple Inc. | Intelligent automated assistant for TV user interactions |
US11838579B2 (en) | 2014-06-30 | 2023-12-05 | Apple Inc. | Intelligent automated assistant for TV user interactions |
US9502032B2 (en) | 2014-10-08 | 2016-11-22 | Google Inc. | Dynamically biasing language models |
US11087759B2 (en) | 2015-03-08 | 2021-08-10 | Apple Inc. | Virtual assistant activation |
US11842734B2 (en) | 2015-03-08 | 2023-12-12 | Apple Inc. | Virtual assistant activation |
WO2016167779A1 (en) * | 2015-04-16 | 2016-10-20 | Mitsubishi Electric Corporation | Speech recognition device and rescoring device |
US12001933B2 (en) | 2015-05-15 | 2024-06-04 | Apple Inc. | Virtual assistant in a communication session |
US11468282B2 (en) | 2015-05-15 | 2022-10-11 | Apple Inc. | Virtual assistant in a communication session |
US11070949B2 (en) | 2015-05-27 | 2021-07-20 | Apple Inc. | Systems and methods for proactively identifying and surfacing relevant content on an electronic device with a touch-sensitive display |
US11947873B2 (en) | 2015-06-29 | 2024-04-02 | Apple Inc. | Virtual assistant for media playback |
US11500672B2 (en) | 2015-09-08 | 2022-11-15 | Apple Inc. | Distributed personal assistant |
US11809483B2 (en) | 2015-09-08 | 2023-11-07 | Apple Inc. | Intelligent automated assistant for media search and playback |
US11853536B2 (en) | 2015-09-08 | 2023-12-26 | Apple Inc. | Intelligent automated assistant in a media environment |
US11126400B2 (en) | 2015-09-08 | 2021-09-21 | Apple Inc. | Zero latency digital assistant |
US11954405B2 (en) | 2015-09-08 | 2024-04-09 | Apple Inc. | Zero latency digital assistant |
US11550542B2 (en) | 2015-09-08 | 2023-01-10 | Apple Inc. | Zero latency digital assistant |
US12051413B2 (en) | 2015-09-30 | 2024-07-30 | Apple Inc. | Intelligent device identification |
US11526368B2 (en) | 2015-11-06 | 2022-12-13 | Apple Inc. | Intelligent automated assistant in a messaging environment |
US11809886B2 (en) | 2015-11-06 | 2023-11-07 | Apple Inc. | Intelligent automated assistant in a messaging environment |
US11886805B2 (en) | 2015-11-09 | 2024-01-30 | Apple Inc. | Unconventional virtual assistant interactions |
US11853647B2 (en) | 2015-12-23 | 2023-12-26 | Apple Inc. | Proactive assistance based on dialog communication between devices |
US10896681B2 (en) | 2015-12-29 | 2021-01-19 | Google Llc | Speech recognition with selective use of dynamic language models |
US11810568B2 (en) | 2015-12-29 | 2023-11-07 | Google Llc | Speech recognition with selective use of dynamic language models |
US9978367B2 (en) | 2016-03-16 | 2018-05-22 | Google Llc | Determining dialog states for language models |
US10553214B2 (en) | 2016-03-16 | 2020-02-04 | Google Llc | Determining dialog states for language models |
US11037565B2 (en) | 2016-06-10 | 2021-06-15 | Apple Inc. | Intelligent digital assistant in a multi-tasking environment |
US11657820B2 (en) | 2016-06-10 | 2023-05-23 | Apple Inc. | Intelligent digital assistant in a multi-tasking environment |
US11809783B2 (en) | 2016-06-11 | 2023-11-07 | Apple Inc. | Intelligent device arbitration and control |
US11152002B2 (en) | 2016-06-11 | 2021-10-19 | Apple Inc. | Application integration with a digital assistant |
US11749275B2 (en) | 2016-06-11 | 2023-09-05 | Apple Inc. | Application integration with a digital assistant |
CN107590135A (en) * | 2016-07-07 | 2018-01-16 | 三星电子株式会社 | Automatic translating method, equipment and system |
US20180011843A1 (en) * | 2016-07-07 | 2018-01-11 | Samsung Electronics Co., Ltd. | Automatic interpretation method and apparatus |
EP3267328A1 (en) * | 2016-07-07 | 2018-01-10 | Samsung Electronics Co., Ltd | Automated interpretation method and apparatus |
CN107590135B (en) * | 2016-07-07 | 2024-01-05 | 三星电子株式会社 | Automatic translation method, device and system |
US10867136B2 (en) * | 2016-07-07 | 2020-12-15 | Samsung Electronics Co., Ltd. | Automatic interpretation method and apparatus |
US11656884B2 (en) | 2017-01-09 | 2023-05-23 | Apple Inc. | Application integration with a digital assistant |
US11682383B2 (en) | 2017-02-14 | 2023-06-20 | Google Llc | Language model biasing system |
US10311860B2 (en) | 2017-02-14 | 2019-06-04 | Google Llc | Language model biasing system |
US11037551B2 (en) | 2017-02-14 | 2021-06-15 | Google Llc | Language model biasing system |
US11467802B2 (en) | 2017-05-11 | 2022-10-11 | Apple Inc. | Maintaining privacy of personal information |
US11599331B2 (en) | 2017-05-11 | 2023-03-07 | Apple Inc. | Maintaining privacy of personal information |
US11837237B2 (en) | 2017-05-12 | 2023-12-05 | Apple Inc. | User-specific acoustic models |
US11862151B2 (en) | 2017-05-12 | 2024-01-02 | Apple Inc. | Low-latency intelligent automated assistant |
US11405466B2 (en) | 2017-05-12 | 2022-08-02 | Apple Inc. | Synchronization and task delegation of a digital assistant |
US11580990B2 (en) | 2017-05-12 | 2023-02-14 | Apple Inc. | User-specific acoustic models |
US11538469B2 (en) | 2017-05-12 | 2022-12-27 | Apple Inc. | Low-latency intelligent automated assistant |
US11380310B2 (en) | 2017-05-12 | 2022-07-05 | Apple Inc. | Low-latency intelligent automated assistant |
US12014118B2 (en) | 2017-05-15 | 2024-06-18 | Apple Inc. | Multi-modal interfaces having selection disambiguation and text modification capability |
US11532306B2 (en) | 2017-05-16 | 2022-12-20 | Apple Inc. | Detecting a trigger of a digital assistant |
US12026197B2 (en) | 2017-05-16 | 2024-07-02 | Apple Inc. | Intelligent automated assistant for media exploration |
US11675829B2 (en) | 2017-05-16 | 2023-06-13 | Apple Inc. | Intelligent automated assistant for media exploration |
US11340925B2 (en) | 2017-05-18 | 2022-05-24 | Peloton Interactive Inc. | Action recipes for a crowdsourced digital assistant system |
US11682380B2 (en) | 2017-05-18 | 2023-06-20 | Peloton Interactive Inc. | Systems and methods for crowdsourced actions and commands |
US11520610B2 (en) * | 2017-05-18 | 2022-12-06 | Peloton Interactive Inc. | Crowdsourced on-boarding of digital assistant operations |
US12093707B2 (en) | 2017-05-18 | 2024-09-17 | Peloton Interactive Inc. | Action recipes for a crowdsourced digital assistant system |
US11862156B2 (en) | 2017-05-18 | 2024-01-02 | Peloton Interactive, Inc. | Talk back from actions in applications |
US11710482B2 (en) | 2018-03-26 | 2023-07-25 | Apple Inc. | Natural assistant interaction |
US10909331B2 (en) * | 2018-03-30 | 2021-02-02 | Apple Inc. | Implicit identification of translation payload with neural machine translation |
US20190303442A1 (en) * | 2018-03-30 | 2019-10-03 | Apple Inc. | Implicit identification of translation payload with neural machine translation |
US11169616B2 (en) | 2018-05-07 | 2021-11-09 | Apple Inc. | Raise to speak |
US11854539B2 (en) | 2018-05-07 | 2023-12-26 | Apple Inc. | Intelligent automated assistant for delivering content from user experiences |
US11487364B2 (en) | 2018-05-07 | 2022-11-01 | Apple Inc. | Raise to speak |
US11907436B2 (en) | 2018-05-07 | 2024-02-20 | Apple Inc. | Raise to speak |
US11900923B2 (en) | 2018-05-07 | 2024-02-13 | Apple Inc. | Intelligent automated assistant for delivering content from user experiences |
US12080287B2 (en) | 2018-06-01 | 2024-09-03 | Apple Inc. | Voice interaction at a primary device to access call functionality of a companion device |
US11360577B2 (en) | 2018-06-01 | 2022-06-14 | Apple Inc. | Attention aware virtual assistant dismissal |
US11630525B2 (en) | 2018-06-01 | 2023-04-18 | Apple Inc. | Attention aware virtual assistant dismissal |
US11431642B2 (en) | 2018-06-01 | 2022-08-30 | Apple Inc. | Variable latency device coordination |
US10984798B2 (en) | 2018-06-01 | 2021-04-20 | Apple Inc. | Voice interaction at a primary device to access call functionality of a companion device |
US11009970B2 (en) | 2018-06-01 | 2021-05-18 | Apple Inc. | Attention aware virtual assistant dismissal |
US12061752B2 (en) | 2018-06-01 | 2024-08-13 | Apple Inc. | Attention aware virtual assistant dismissal |
US12067985B2 (en) | 2018-06-01 | 2024-08-20 | Apple Inc. | Virtual assistant operations in multi-device environments |
US10720160B2 (en) | 2018-06-01 | 2020-07-21 | Apple Inc. | Voice interaction at a primary device to access call functionality of a companion device |
US11893992B2 (en) | 2018-09-28 | 2024-02-06 | Apple Inc. | Multi-modal inputs for voice commands |
US11475898B2 (en) | 2018-10-26 | 2022-10-18 | Apple Inc. | Low-latency multi-speaker speech recognition |
US11783815B2 (en) | 2019-03-18 | 2023-10-10 | Apple Inc. | Multimodality in digital assistant systems |
US11307752B2 (en) | 2019-05-06 | 2022-04-19 | Apple Inc. | User configurable task triggers |
US11705130B2 (en) | 2019-05-06 | 2023-07-18 | Apple Inc. | Spoken notifications |
US11423908B2 (en) | 2019-05-06 | 2022-08-23 | Apple Inc. | Interpreting spoken requests |
US11217251B2 (en) | 2019-05-06 | 2022-01-04 | Apple Inc. | Spoken notifications |
US11675491B2 (en) | 2019-05-06 | 2023-06-13 | Apple Inc. | User configurable task triggers |
US11475884B2 (en) | 2019-05-06 | 2022-10-18 | Apple Inc. | Reducing digital assistant latency when a language is incorrectly determined |
US11651775B2 (en) * | 2019-05-09 | 2023-05-16 | Rovi Guides, Inc. | Word correction using automatic speech recognition (ASR) incremental response |
US20230252997A1 (en) * | 2019-05-09 | 2023-08-10 | Rovi Guides, Inc. | Word correction using automatic speech recognition (asr) incremental response |
US11107475B2 (en) * | 2019-05-09 | 2021-08-31 | Rovi Guides, Inc. | Word correction using automatic speech recognition (ASR) incremental response |
US20210350807A1 (en) * | 2019-05-09 | 2021-11-11 | Rovi Guides, Inc. | Word correction using automatic speech recognition (asr) incremental response |
US11140099B2 (en) | 2019-05-21 | 2021-10-05 | Apple Inc. | Providing message response suggestions |
US11888791B2 (en) | 2019-05-21 | 2024-01-30 | Apple Inc. | Providing message response suggestions |
US11360739B2 (en) | 2019-05-31 | 2022-06-14 | Apple Inc. | User activity shortcut suggestions |
US11496600B2 (en) | 2019-05-31 | 2022-11-08 | Apple Inc. | Remote execution of machine-learned models |
US11657813B2 (en) | 2019-05-31 | 2023-05-23 | Apple Inc. | Voice identification in digital assistant systems |
US11237797B2 (en) | 2019-05-31 | 2022-02-01 | Apple Inc. | User activity shortcut suggestions |
US11289073B2 (en) | 2019-05-31 | 2022-03-29 | Apple Inc. | Device text to speech |
US11360641B2 (en) | 2019-06-01 | 2022-06-14 | Apple Inc. | Increasing the relevance of new available information |
US11790914B2 (en) | 2019-06-01 | 2023-10-17 | Apple Inc. | Methods and user interfaces for voice-based control of electronic devices |
US11488406B2 (en) | 2019-09-25 | 2022-11-01 | Apple Inc. | Text detection using global geometry estimators |
CN111199730A (en) * | 2020-01-08 | 2020-05-26 | 北京松果电子有限公司 | Voice recognition method, device, terminal and storage medium |
US11765209B2 (en) | 2020-05-11 | 2023-09-19 | Apple Inc. | Digital assistant hardware abstraction |
US11924254B2 (en) | 2020-05-11 | 2024-03-05 | Apple Inc. | Digital assistant hardware abstraction |
US11914848B2 (en) | 2020-05-11 | 2024-02-27 | Apple Inc. | Providing relevant data items based on context |
US11755276B2 (en) | 2020-05-12 | 2023-09-12 | Apple Inc. | Reducing description length based on confidence |
US11838734B2 (en) | 2020-07-20 | 2023-12-05 | Apple Inc. | Multi-device audio adjustment coordination |
US11750962B2 (en) | 2020-07-21 | 2023-09-05 | Apple Inc. | User identification using headphones |
US11696060B2 (en) | 2020-07-21 | 2023-07-04 | Apple Inc. | User identification using headphones |
US20220223157A1 (en) * | 2021-01-11 | 2022-07-14 | Bank Of America Corporation | System and method for single-speaker identification in a multi-speaker environment on a low-frequency audio recording |
US11521623B2 (en) * | 2021-01-11 | 2022-12-06 | Bank Of America Corporation | System and method for single-speaker identification in a multi-speaker environment on a low-frequency audio recording |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8775177B1 (en) | Speech recognition process | |
US11854545B2 (en) | Privacy mode based on speaker identifier | |
US11990127B2 (en) | User recognition for speech processing systems | |
US11496582B2 (en) | Generation of automated message responses | |
US11270685B2 (en) | Speech based user recognition | |
US20240221737A1 (en) | Recognizing speech in the presence of additional audio | |
US11061644B2 (en) | Maintaining context for voice processes | |
US11594215B2 (en) | Contextual voice user interface | |
US9972318B1 (en) | Interpreting voice commands | |
US9293136B2 (en) | Multiple recognizer speech recognition | |
US11830485B2 (en) | Multiple speech processing system with synthesized speech styles | |
US10917758B1 (en) | Voice-based messaging | |
US10448115B1 (en) | Speech recognition for localized content | |
CN107967916B (en) | Determining phonetic relationships | |
US10565989B1 (en) | Ingesting device specific content | |
US9135912B1 (en) | Updating phonetic dictionaries | |
US11810556B2 (en) | Interactive content output | |
US9263033B2 (en) | Utterance selection for automated speech recognizer training | |
US20210241760A1 (en) | Speech-processing system | |
US11887583B1 (en) | Updating models with trained model update objects | |
US11468897B2 (en) | Systems and methods related to automated transcription of voice communications |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: GOOGLE INC., CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:HEIGOLD, GEORG;NGUYEN, PATRICK AN;WEINTRAUB, MITCHEL;AND OTHERS;SIGNING DATES FROM 20121009 TO 20121016;REEL/FRAME:029885/0923 |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
CC | Certificate of correction | ||
AS | Assignment |
Owner name: GOOGLE LLC, CALIFORNIA Free format text: CHANGE OF NAME;ASSIGNOR:GOOGLE INC.;REEL/FRAME:044277/0001 Effective date: 20170929 |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551) Year of fee payment: 4 |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 8TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1552); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 8 |