WO2018151772A1 - Server side hotwording - Google Patents

Server side hotwording Download PDF

Info

Publication number
WO2018151772A1
WO2018151772A1 PCT/US2017/058944 US2017058944W WO2018151772A1 WO 2018151772 A1 WO2018151772 A1 WO 2018151772A1 US 2017058944 W US2017058944 W US 2017058944W WO 2018151772 A1 WO2018151772 A1 WO 2018151772A1
Authority
WO
WIPO (PCT)
Prior art keywords
key phrase
threshold
utterances
audio signal
client device
Prior art date
Application number
PCT/US2017/058944
Other languages
French (fr)
Inventor
Alexander H. Gruenstein
Petar Aleksic
Johan Schalkwyk
Pedro J. Moreno Mengibar
Original Assignee
Google Llc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Google Llc filed Critical Google Llc
Priority to EP20194706.6A priority Critical patent/EP3767623A1/en
Priority to CN202310534112.4A priority patent/CN116504238A/en
Priority to CN201780086256.0A priority patent/CN110268469B/en
Priority to EP17804349.3A priority patent/EP3559944B1/en
Priority to JP2019543379A priority patent/JP6855588B2/en
Priority to KR1020197025555A priority patent/KR102332944B1/en
Publication of WO2018151772A1 publication Critical patent/WO2018151772A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/32Multiple recognisers used in sequence or in parallel; Score combination systems therefor, e.g. voting systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Definitions

  • Automatic speech recognition is one technology that is used in mobile devices among other types of devices.
  • One task that is a common goal for this technology is to be able to use voice commands to wake up a device and have basic spoken interactions with the device. For example, it may be desirable for the device to recognize a "hotword" that signals that the device should activate when the device is in a sleep state.
  • a system may use two thresholds to determine whether a user spoke a key phrase.
  • a client device included in the system, uses the first, lower threshold to determine whether a portion of words spoken by the user are the same as a portion of the key phrase. For instance, when the key phrase is "okay google," the client device may use the first, lower threshold to determine whether the user spoke "okay” or "okay g" or "okay google.”
  • the client device determines that the portion of the words spoken by the user are the same as a portion of the key phrase
  • the client device sends data for the words to a server.
  • the server uses a second, higher threshold to determine whether the words spoken by the user are the same as the key phrase.
  • the server analyzes the entire key phrase to determine whether the user spoke the key phrase.
  • the server may parse other words spoken by the user to generate data for an action that the client device should perform.
  • a client device may receive an audio signal that encodes one or more utterances.
  • An utterance is the vocalization (a user speaking) of a word or words that represent a single meaning to the computer.
  • An utterance can be a single word, a few words, a sentence, or even multiple sentences.
  • An utterance may thereby comprise an n-gram being a contiguous sequence of n items (n being equal or greater than 1) from a given sequence of text or speech.
  • the items can be phonemes, syllables, letters, words or base pairs, to name a few examples.
  • the client device uses a first threshold to determine whether one or more first utterances encoded at the beginning of the audio signal satisfy a first threshold of being a key phrase.
  • the client device may analyze a portion of an utterance, a single utterance from the one or more first utterances when the key phrase includes multiple words, or both.
  • an utterance corresponding to a key phrase, for which it is determined whether it satisfies the first threshold will usually consist of speech items such as at least a plurality of phonemes, at least a plurality of syllables, or one or more words in order to make the key phrase in some sense unique and distinguishable from accidentally and generally frequently spoken utterances like single letters, single phonemes, etcetera.
  • the client device determines that the one or more first utterances satisfy the first threshold of being a key phrase
  • the client device sends the audio signal to a speech recognition system, e.g., included on a server separate from the client device, for additional analysis.
  • the speech recognition system receives the audio signal.
  • the speech recognition system analyzes the one or more first utterances to determine whether the one or more first utterances satisfy a second threshold of being the key phrase.
  • the second threshold is more restrictive than the first threshold, e.g., the first threshold is less accurate or lower than the second threshold.
  • the first threshold e.g., fifty percent
  • the second threshold e.g., seventy-five or ninety percent.
  • a corresponding system may determine that the likelihood of the one or more first utterances being the key phrase is greater than, or greater than or equal to, the respective threshold.
  • the speech recognition system receives, from the client device, data for the entire audio signal including the one or more first utterances so that the speech recognition system can analyze all of the data included in the audio signal.
  • the entire audio signal may include multiple n-grams uttered by the user after the one or more first utterances, at least as long as the n-grams fall within a certain time window or by some other metric are within a maximum distance apart from the one or more first utterances.
  • the speech recognition system receives the entire audio signal from the client device when the client device determines that at least a portion of the one or more first utterances satisfy the first threshold of being the key phrase.
  • This may allow the server to improve automated speech analysis of the audio signal because of the greater amount of resources available to the server compared to the client device, a larger number of speech items at the server than at the client, e.g., more robust analysis models at the server, or both, thereby improving recognition.
  • the client device may analyze a prefix or a portion of one of the first utterances. For instance, when the key phrase is "Okay Google", the client device may determine that the one or more first utterances encode "Okay G" or "Okay", without analyzing all of the second utterance and, in response, send the audio signal to the speech recognition system.
  • the one or more first utterances for which it is determined whether they meet the first threshold may consist of only a portion of the key phrase, e.g., the beginning portion of the key phrase.
  • “other utterances” or “second utterances”, which comprise speech items after the first utterances, e.g., within a threshold distance of the first utterances, may be sent to the server together with the first utterances to be analyzed by the server as to whether the combination of the first and second utterances together meet the second threshold as to whether there exists a match with the key phrase.
  • the client device may send, with the data for the audio signal and to the speech recognition system, data for the key phrase.
  • the data for the key phrase may be text representing the key phrase, or an identifier, e.g., for the client device, which the speech recognition system may use to determine the key phrase.
  • the speech recognition system may use the data for the key phrase to determine whether the one or more first utterances included in the audio signal satisfy the second threshold of being the key phrase.
  • one innovative aspect of the subject matter described in this specification can be embodied in methods that include the actions of receiving an audio signal encoding one or more utterances including a first utterance; determining whether at least a portion of the first utterance satisfies a first threshold of being at least a portion of a key phrase; in response to determining that at least the portion of the first utterance satisfies the first threshold of being at least a portion of a key phrase, sending the audio signal to a server system that determines whether the first utterance satisfies a second threshold of being the key phrase, the second threshold being more restrictive than the first threshold; and receiving, from the server system, tagged text data representing the one or more utterances encoded in the audio signal when the server system determines that the first utterance satisfies the second threshold.
  • inventions of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods.
  • a system of one or more computers can be configured to perform particular operations or actions by virtue of having software, firmware, hardware, or a combination of them installed on the system that in operation causes or cause the system to perform the actions.
  • One or more computer programs can be configured to perform particular operations or actions by virtue of including instructions that, when executed by data processing apparatus, cause the apparatus to perform the actions.
  • one innovative aspect of the subject matter described in this specification can be embodied in methods that include the actions of receiving, from a client device, an audio signal encoding one or more utterances including one or more first utterances for which the client device determined that at least a portion of the one or more first utterances satisfies a first threshold of being at least a portion of a key phrase;
  • a system of one or more computers can be configured to perform particular operations or actions by virtue of having software, firmware, hardware, or a combination of them installed on the system that in operation causes or cause the system to perform the actions.
  • One or more computer programs can be configured to perform particular operations or actions by virtue of including instructions that, when executed by data processing apparatus, cause the apparatus to perform the actions.
  • the method may include performing an action using the tagged text data subsequent to receiving, from the server system, the tagged text data representing the one or more utterances encoded in the audio signal when the server system determines that the first utterance satisfies the second threshold.
  • the one or more utterances may include two or more utterances, the first utterance encoded prior to the other utterances from the two or more utterances in the audio signal.
  • Performing the action using the tagged text data may include performing an action using the tagged text data for the one or more utterances encoded in the audio signal after the first utterance.
  • Determining whether at least a portion of the first utterance satisfies the first threshold of being at least a portion of the key phrase may include determining whether at least a portion of the first utterance satisfies the first threshold of being at least a portion of the key phrase that includes two or more words.
  • the method may include receiving a second audio signal encoding one or more second utterances including a second utterance;
  • the method may include determining to not perform an action using data from the second audio signal in response to determining that at least the portion of the second utterance does not satisfy the first threshold of being at least a portion of a key phrase.
  • Determining whether at least a portion of the first utterance satisfies the first threshold of being a key phrase may include determining whether at least a portion of the first utterance satisfies a first likelihood of being at least a portion of a key phrase.
  • sending, to the client device, the result of determining whether the one or more first utterances satisfy the second threshold of being the key phrase may include sending, to the client device, data indicating that the key phrase is not likely included in the audio signal in response to determining that the one or more first utterances do not satisfy the second threshold of being the key phrase.
  • Sending, to the client device, the result of determining whether the one or more first utterances satisfy the second threshold of being the key phrase may include sending, to the client device, data for the audio signal in response to determining that the one or more first utterances satisfy the second threshold of being the key phrase.
  • Sending, to the client device, data for the audio signal in response to determining that the one or more first utterances satisfy the second threshold of being the key phrase may include sending, to the client device, tagged text data representing the one or more utterances encoded in the audio signal.
  • the method may include analyzing the entire audio signal to determine first data for each of the one or more utterances.
  • Sending, to the client device, the data for the audio signal in response to determining that the one or more first utterances satisfy the second threshold of being the key phrase may include sending, to the client device, the first data for the audio signal in response to determining that the one or more first utterances satisfy the second threshold of being the key phrase.
  • determining whether the one or more first utterances satisfy the second threshold of being the key phrase may include determining, using a language model, whether the one or more first utterances satisfy the second threshold of being the key phrase.
  • a language model may define a probability distribution over a sequence of speech items, e.g., phonemes, syllables, or words, that indicates a likelihood that the sequence of speech items occurs in the order specified by the sequence.
  • one or more computers match sounds with speech item sequences defined by the language model and determine corresponding probabilities for the speech item sequences. Using the probabilities in the language model, the one or more computers can distinguish between speech items, e.g., words and phrases, that sound similar given the context of the individual speech items in the sequence, e.g., given the order in which the speech items occur in the sequence.
  • the method may include customizing the language model for the key phrase prior to determining, using the language model, whether the one or more first utterances satisfy the second threshold of being the key phrase.
  • the method may include receiving text identifying the key phrase.
  • Customizing the language model for the key phrase may include customizing the language model for the key phrase using the text identifying the key phrase.
  • the method may include receiving an identifier; and determining, using the identifier, key phrase data for the key phrase.
  • Customizing the language model for the key phrase may include customizing the language model for the key phrase using the key phrase data.
  • Determining, using the language model, whether the one or more first utterances satisfy the second threshold of being the key phrase may include determining, using the language model and an acoustic model, whether the one or more first utterances satisfy the second threshold of being the key phrase.
  • An acoustic model may define a mapping between a phoneme, or another speech item, and a vocalization of the phoneme, or the corresponding other speech item.
  • a computer may use an acoustic model in an automatic speech recognition process to determine the relationship between a vocalization of a speech item encoded in an audio signal and the corresponding speech item.
  • Determining, using the language model and the acoustic model, whether the one or more first utterances satisfy the second threshold of being the key phrase may include providing data for the one or more first utterances to the language model to cause the language model generate a first output; providing data for the one or more first utterances to the acoustic model to cause the acoustic model to generate a second output; combining the first output and the second output to generate a combined output; and determining, using the combined output, whether the one or more first utterances satisfy the second threshold of being the key phrase.
  • the method may include selecting the language model for a default key phrase.
  • the method may include determining whether to use the default key phrase.
  • the systems and methods described in this document may reduce resources used by a client device during hotword analysis with a first, lower threshold, improve an accuracy of hotword analysis by using a second, more restrictive threshold at a speech recognition system, or both.
  • the systems and methods described below may more accurately parse, segment, or both, text in an audio signal, e.g., may more accurately identify a key phrase encoded in the audio signal separate from other utterances encoded in the audio signal, by sending an entire audio signal, that includes the key phrase, to a speech recognition system for analysis.
  • the systems and methods described below may reduce client processing time, send an audio signal to a speech recognition system more quickly, or both, compared to other systems, when the client uses a lower hotword analysis threshold than a more restrictive hotword analysis threshold used by the speech recognition system.
  • the systems and methods described below may reduce bandwidth usage when the client device sends fewer audio signals to a server system for analysis when an utterance does not satisfy a first, lower threshold.
  • FIG. 1 is an example of an environment in which a client device analyzes an audio signal using a first threshold and a speech recognition system analyzes the audio signal using a second threshold that is more restrictive than the first threshold.
  • FIG. 2 is a flow diagram of a process for determining whether to perform an action.
  • FIG. 3 is a flow diagram of a process for generating tagged text data for an audio signal.
  • FIG. 4 is a block diagram of a computing system that can be used in connection with computer-implemented methods described in this document.
  • FIG. 1 is an example of an environment 100 in which a client device 102 analyzes an audio signal using a first threshold and a speech recognition system 1 12 analyzes the audio signal using a second threshold that is more restrictive than the first threshold.
  • the client device 102 uses the first threshold to determine whether the audio signal encodes at least a portion of a key phrase.
  • the client device 102 determines that the audio signal satisfies the first threshold of being the key phrase, the client device 102 sends the audio signal to the speech recognition system 1 12 that uses the second threshold to determine whether the audio signal encodes the entire key phrase.
  • the client device 102 may send not only the audio signal representing a portion of the key phrase, which has been recognized, but the entire audio signal, or at least parts of the audio signal lying within a certain range after the part representing the recognized portion of the key phrase, to the speech recognition system 112.
  • the speech recognition system 112 may provide the client device 102 with tagged text data of the speech recognized utterances encoded in the audio signal to allow the client device 102 to perform an action based on the audio signal.
  • the tagged text data may comprise the speech recognized utterances and "tags", which may represent actions to be performed or which otherwise identify a category of text within the recognized utterances, such that the client device 102 can identify the tag and the speech recognized utterances that correspond to the tag.
  • the client device 102 may use the tagged text data to determine an action to perform, e.g., instructions to execute, to determine which portions of the speech recognized utterances to analyze when determining whether to perform an action, or both.
  • the client device 102 includes a microphone 104 that captures the audio signal.
  • the client device 102 may be in a lower powered state, e.g., standby, while the microphone 104 captures at least part of the audio signal.
  • the at least part of the audio signal may be the entire audio signal, one or more first utterances included in the audio signal, or a different part of the beginning of the audio signal.
  • One example of utterances encoded in an audio signal is "ok google play some music.” In this example, the first utterances may be "ok" or "ok google.”
  • the microphone 104 provides the audio signal, or some of the audio signal as the audio signal is captured, to a client hotword detection module 106.
  • the microphone 104 or a combination of components in the client device 102, may provide portions of the audio signal to the client hotword detection module 106 as the audio signal is captured by the microphone 104.
  • the client hotword detection module 106 determines whether the audio signal satisfies a first threshold 108. For instance, the client hotword detection module 106 may analyze at least a portion of the one or more first utterances, included at the beginning of the audio signal, to determine whether the portion of the one or more first utterances satisfy the first threshold 108 of being a key phrase. The portion of the first utterances may by "ok” or "ok google.” One example of a key phrase may be "ok google.” In some examples, the client hotword detection module 106 is configured to detect occurrence of only one key phrase. In some implementations, the client hotword detection module is configured to detect occurrence of any of multiple different key phrases, e.g., ten key phrases. The multiple different key phrases include a limited number of different key phrases for which the client hotword detection module 106 is trained.
  • the client hotword detection module 106 may determine a likelihood that at least a portion of the first utterances are the same as at least a portion of the key phrase. For that purpose, the client hotword detection module 106 may apply any known automated speech recognition approach, which segments the at least a portion of the first utterances into phonemes or other linguistic units and uses an acoustic model and/or language model to obtain a likelihood whether the first utterances match a key phrase or a portion of a key phrase.
  • the portion of the key phrase may be the beginning portion of the key phrase, e.g., that includes the speech items at the beginning of the key phrase.
  • the client hotword detection module 106 may compare the likelihood with the first threshold 108.
  • the client device 102 may send the audio signal to the speech recognition system 1 12, e.g., located on one or more servers.
  • the client device 102 may take no further action based on the utterances included in the audio signal, e.g., and may discard the audio signal.
  • the client hotword detection module 106 may determine that the key phrase is "ok google" and that the utterance "ok", as one of the first utterances in the audio signal, satisfies the first threshold 108 of being part of the key phrase. In some examples, the client hotword detection module 106 may determine that the utterance "ok google" from the audio signal satisfies the first threshold 108 of being part of the key phrase, e.g., the entire key phrase.
  • the client hotword detection module 106 may determine whether a total length of the first utterances matches a length for the key phrases. For instance, the client hotword detection module 106 may determine that a time during which the one or more first utterances were spoken matches an average time for the key phrase to be spoken. The average time may be for a user of the client device 102 or for multiple different people, e.g., including the user of the client device 102.
  • the client hotword detection module 106 may determine that the total length of the first utterances and a total number of n-grams, e.g., words, included in the first utterances matches a total length of the key phrase and a number of n-grams included in the key phrase, e.g., when only analyzing a portion of a first utterance or of the first utterances. For instance, the client hotword detection module 106 may determine a number of silences between the first utterances that indicates the number of first utterances.
  • the client hotword detection module 106 may compare the number of first utterances, the spoken length of the first utterances, or both, with a total number of words in the key phrase, the spoken length of the key phrase, or both. When the client hotword detection module 106 determines that the total number of first utterances and the total number of words in the key phrase are the same, that the spoken length of the first utterances is within a threshold amount from the spoken length of the key phrase, or both, the client hotword detection module 106 may determine that the first utterances in the audio signal satisfy the first threshold 108 of being the key phrase, e.g., when at least a portion of the first utterances satisfy the first threshold 108 of being a portion of the key phrase.
  • the first utterances may satisfy the first threshold 108 of being a key phrase when the likelihood is greater than the first threshold 108.
  • the first utterances may satisfy the first threshold 108 of being a key phrase when the likelihood is greater than or equal to the first threshold 108.
  • the first utterances do not satisfy the first threshold 108 of being a key phrase when the likelihood is less than the first threshold 108.
  • the first utterances might not satisfy the first threshold 108 of being a key phrase when the likelihood is less than or equal to the first threshold 108.
  • the client device 102 In response to determining that at least a portion of the first utterances satisfy the first threshold 108 of being at least a portion of a key phrase, the client device 102, at time TB, sends the audio signal to the speech recognition system 112.
  • the speech recognition system 112 receives the audio signal and uses a server hotword detection module 114 to determine, at time Tc, whether the audio signal satisfies a second threshold 116 of being the key phrase. For instance, the speech recognition system 112 uses the server hotword detection module 114 to determine whether the audio signal satisfies the second threshold 116 of being a key phrase.
  • the second threshold 116 is more restrictive than the first threshold 108.
  • the server hotword detection module 114 using the second threshold 116, is less likely to incorrectly determine that the first utterances represent the same text as a key phrase, e.g., are a false positive, compared to the client hotword detection module 106, using the first threshold 108.
  • the thresholds are likelihoods
  • the first threshold 108 has a lower numerical value than the second threshold 116.
  • the server hotword detection module 114 may use a language model 118, an acoustic model 120, or both, to determine whether the one or more first utterances satisfy the second threshold 116 of being a key phrase.
  • the language model 118, and the acoustic model 120 are each trained using a large amount of training data, e.g., compared to the client hotword detection module 106.
  • the language model 118, the acoustic model 120, or both may be trained using 30,000 hours of training data.
  • the client hotword detection module 106 may be trained using 100 hours of training data.
  • the server hotword detection module 114 may create a hotword biasing model, that includes the language model 118, the acoustic model 120, or both, on the fly, for use analyzing the audio signal.
  • a hotword biasing model may be a combination of a language model, which defines a probability distribution over a sequence of speech items, and an acoustic model, which defines a mapping between speech items and corresponding vocalizations of the speech items, that is specific to a few key phrases or hotwords.
  • the speech recognition system 112 may create a hotword biasing model for the client device 102 that is specific to the key phrase or the key phrases for which the client device 102 analyzed the one or more first utterances.
  • the server hotword detection module 114 may receive data from the client device 102 that identifies a key phrase for which the server hotword detection module 1 14 will analyze the audio signal to determine whether the client device 102 should wake up, perform an action, or both.
  • the data that identifies the key phrase may be text data for the key phrase, e.g., a text string, or an identifier for the client device 102, e.g., either of which may be included in the request to analyze the audio signal received from the client device 102.
  • the server hotword detection module 114 may use the identifier for the client device 102 to access a database and determine the key phrase for the client device 102 and the audio signal.
  • the server hotword detection module 114 may use the determined key phrase or key phrases for the client device 102 to create a hotword biasing model for the client device 102 using an existing language model 118, an existing acoustic model 120, or both, already stored in a memory of the speech recognition system 112.
  • the server hotword detection module 114 may use a pre-built hotword biasing model. For instance, the server hotword detection module 114 may analyzes multiple audio signals from the client device 102 or from multiple different client devices, all of which are for the same key phrase, using the same hotword biasing model.
  • the hotword biasing model may identify one or more n-grams for which the hotword biasing model performs analysis. For instance, when the key phrase is "ok google,” the hotword biasing model may generate scores for one or more of the n-grams " ⁇ S> ok google,” " ⁇ S> ok,” or "ok google,” where ⁇ S> denotes silence at the beginning of a sentence.
  • One or both of the language model 1 18 or the acoustic model 120 may use the n-grams for the hotword biasing model to determine whether the audio signal includes the key phrase.
  • the language model 118 may use one or more of the n- grams to generate a score that indicates a likelihood that the audio signal includes the key phrase.
  • the language model 1 18 may use the n-grams or some of the n-grams to increase a likelihood that the key phrase is correctly identified in the audio signal when the one or more first utterances are the same as the key phrase.
  • the language model 118 may add the key phrase, e.g., "ok google,” to the language model 1 18 to increase the likelihood that the key phrase is identified, e.g., compared to when the language model 118 does not already include the key phrase.
  • the acoustic model 120 may use one or more of the n-grams to generate a score that indicates a likelihood that the audio signal includes the key phrase. For example, the acoustic model 120 may generate multiple scores for different phrases, including the key phrase, and select the score for the key phrase as output.
  • the server hotword detection module 1 14 may receive the two scores from the language model 118 and the acoustic model 120. The server hotword detection module 114 may combine the two scores to determine an overall score for the audio signal. The server hotword detection module 114 may compare the overall score with the second threshold 1 16. When the overall score satisfies the second threshold 116, the server hotword detection module 1 14 determines that the audio signal likely encodes the key phrase. When the overall score does not satisfy the second threshold 1 16, the server hotword detection module 1 14 determines that the audio signal likely does not encode the key phrase.
  • the speech recognition system 1 12 may send a message to the client device 102 indicating that the audio signal does not likely encode the key phrase. In some examples, the speech recognition system 1 12 might not send the client device 102 a message upon determining that the audio signal likely does not encode the key phrase.
  • a tagged text generator 122 When the server hotword detection module 114 determines that the audio signal likely encodes the key phrase, a tagged text generator 122 generates tagged text for the audio signal.
  • the tagged text generator 122 may receive data from the language model 1 18, the acoustic model 120, or both, that indicates the n-grams encoded in the audio signal. For instance, the tagged text generator 122 may receive data from the acoustic model 120 that indicates scores for n-grams that are likely encoded in the audio signal, data representing the n-grams that are encoded in the audio signal, or other appropriate data.
  • the tagged text generator 122 uses the data from the language model 118, the acoustic model 120, or both, to generate tags for the n-grams encoded in the audio signal. For example, when the audio signal encodes "ok google play some music," the tagged text generator 122 may generate data representing the string " ⁇ hotword biasing> ok google ⁇ /hotword biasing> play some music". The tag " ⁇ hotword biasing>” thereby identifies the first string "ok google” as a hotword.
  • the tag " ⁇ /hotword biasing>" identifies both the end of the hotword and indicates that the following string likely includes an instruction for the client device 102 a) which has been recognized by an automated speech recognition process and b) which the client device 102 should analyze to determine whether the client device 102 can execute a corresponding instruction.
  • the speech recognition system 112 provides the tagged text for the audio signal to the client device 102 at time TD.
  • the client device 102 receives the tagged text and analyzes the tagged text to determine an action to perform. For instance, the client device 102 may use the tags included in the text to determine which portion of the text corresponds to the key phrase, e.g., the one or more first utterances, and which portion of the text corresponds to an action for the client device 102 to perform. For example, the client device 102 may determine, using the text "play some music," to launch a music player application and play music. The client device 102 may provide a user prompt requesting input of a music genre, a music station, an artist, or another type of music for playback using the music player application.
  • the client device 102 may be configured to detect any of multiple different key phrases encoded in an audio signal. For example, the client device 102 may receive input representing a user specified hotword, such as "hey indigo" or "hey gennie.” The client device 102 may provide the speech recognition system 112 with data representing the user specified hotword. For instance, the client device 102 may send a text representation of the user specified hotword with the audio signal. In some examples, the client device 102 may provide the speech recognition system 112 with data for the user specified hotword that the speech recognition system 112 associates with an identifier for the client device 102, e.g., with a user account for the client device 102.
  • a user specified hotword such as "hey indigo" or "hey gennie.”
  • the client device 102 may provide the speech recognition system 112 with data representing the user specified hotword. For instance, the client device 102 may send a text representation of the user specified hotword with the audio signal.
  • the client device 102 may provide the speech recognition system 112 with
  • the client device 102 may have different key phrases for different physical geographic locations. For instance, the client device 102 may have a first key phrase for a user's home and a second, different key phrase for the user's office. The client device 102 may use one or more location devices 110 to determine a current physical geographic location for the client device 102 and select a corresponding key phrase. The client device 102 may send data to the speech recognition system 112 with the audio signal that identifies the key phrase based on the physical geographic location of the client device 102.
  • the location devices 110 may include one or more of a global positioning system, a wireless device that detects a wireless signature, e.g., of a wireless hotspot or another device that broadcasts a signature, or a cellular antenna that detects information of cellular base stations.
  • the client device 102 may send data to the speech recognition system 112 that indicates the physical geographic location of the client device 102.
  • the client hotword detection module 106 may be configured for multiple, e.g., five, different key phrases each of which begin with the same n-gram prefix, e.g., "ok,” and each of which is for use in a different physical geographic location.
  • the client device 102 may have a key phrase of "ok google" in a first location and "ok indigo" in a second location that is a different location from the first location.
  • the client hotword detection module 106 may determine that an audio signal includes the n-gram prefix without determining which of the multiple different key phrases may be encoded in the audio signal.
  • the client device 102 upon a determination by the client hotword detection module 106 that utterances in the audio signal satisfy the first threshold 108 of being a key phrase, may send the audio signal and location data for the client device 102 to the speech recognition system 112.
  • the speech recognition system 1 12 receives the audio signal and the location data and uses the location data to determine a key phrase from the multiple different key phrases to use for analysis.
  • the server hotword detection module 114 uses the determined key phrase to analyze the audio signal and determines whether the audio signal satisfies the second threshold 116 of being the determined key phrase.
  • the client device 102 is asleep, e.g., in a low power mode, when the client device 102 captures the audio signal, e.g., using the microphone 104.
  • the client device 102 may not have full functionality. For instance, some features of the client device 102 may be disabled to reduce battery usage.
  • the client device 102 may begin to wake up upon determining that the first utterances satisfy the first threshold 108 of being a key phrase. For example, the client device 102 may enable one or more network connectivity devices, one or more of the location devices 1 10, or both, to allow the client device 102 to communicate with the speech recognition system 1 12.
  • the client device 102 When the client device 102 receives the tagged text data from the speech recognition system 112, the client device 102 exits the sleep mode. For instance, the client device 102 enables more functionality of the client device 102 to determine an action to perform using the tagged text, to perform an action determined using the tagged text, or both.
  • the speech recognition system 1 12 is an example of a system
  • the client device 102 may include a personal computer, a mobile communication device, or another device that can send and receive data over a network 124.
  • the network 124 such as a local area network (LAN), wide area network (WAN), the Internet, or a combination thereof, connects the client device 102, and the speech recognition system 1 12.
  • the speech recognition system 112 may use a single server computer or multiple server computers operating in conjunction with one another, including, for example, a set of remote computers deployed as a cloud computing service.
  • FIG. 2 is a flow diagram of a process 200 for determining whether to perform an action.
  • the process 200 can be used by the client device 102 from the environment 100.
  • a client device receives an audio signal encoding one or more utterances including a first utterance (202).
  • the client device may use any appropriate type of device to capture the audio signal.
  • the client device may receive the audio signal from another device, e.g., a smart watch.
  • the client device determines whether at least a portion of the first utterance satisfies a first threshold of being at least a portion of a key phrase (204).
  • the client device may include data for one or more key phrases.
  • the client device may determine whether at least the portion of the first utterance has at least a predetermined likelihood, defined by the first threshold, of being a portion of one of the key phrases.
  • the portion of the first utterance may include one or more n-grams from the first utterance or another appropriate type of segment from the first utterance.
  • the portion may include a single word from two or more first utterances.
  • the client device may determine whether multiple first utterances, e.g., one or more first utterances, satisfy the first threshold of being one of the key phrases.
  • the client device In response to determining that at least a portion of the first utterance satisfies the first threshold of being at least a portion of a key phrase, the client device sends the audio signal to a server system that determines whether the first utterance satisfies a second threshold of being the key phrase (206).
  • the second threshold is more restrictive than the first threshold.
  • the client device may send the audio signal, or a portion of the audio signal, to the server, e.g., a speech recognition system, to cause the server to determine whether the first utterance satisfies the second threshold of being the key phrase.
  • the server always analyzes all of the first utterances to determine whether the first utterances satisfy the second threshold of being the entire key phrase.
  • the portion of the audio signal the client device sends to the server system may include the first utterances that satisfies the first threshold and one or more other utterances. For instance, the client device may continue to receive the audio signal while analyzing the first utterances such that the additional portion of the received audio signal includes the one or more other utterances. The client device may send the portion of the audio signal that includes the first utterances and the other utterances to the server.
  • the client device determines whether response data, received from the server system, includes tagged text data representing the one or more utterances encoded in the audio signal (208). For example, the client device may receive the response data from the server in response to sending the audio signal to the server. The client device may analyze the response data to determine whether the response data includes tagged text data.
  • the client device In response to determining that the response data includes tagged text data representing the one or more utterances encoded in the audio signal, the client device performs an action using the tagged text data (210). For instance, the client device uses the tags in the data to determine the action to perform.
  • the tags may indicate which portion of the tagged data, and the respective portion of the audio signal, correspond to the first utterances for the key phrase.
  • the tags may indicate which portion of the tagged data correspond to an action for the client device to perform, e.g., "play some music.”
  • the client device determines to not perform an action using data from the audio signal (212). For instance, when none of the first utterance satisfies the first threshold of being the key phrase, the client device does not perform any action using the audio signal. In some examples, when the client device receives a message from the server that indicates that the audio signal did not encode the key phrase, e.g., the response data does not include tagged text data, the client device does not perform any action using the audio signal.
  • the client device In response to determining that at least a portion of the first utterance does not satisfy the first threshold of being at least a portion of a key phrase or in response to determining that the response data does not include tagged text data, the client device discards the audio signal (214). For instance, when none of the first utterance satisfies the first threshold of being the key phrase, the client device may discard the audio signal. In some examples, when the client device receives a message from the server that indicates that the audio signal did not encode the key phrase, e.g., the response data does not include tagged text data, the client device may discard the audio signal. In some implementations, the client device may discard the audio signal after a predetermined period of time when one of these conditions occurs.
  • the order of steps in the process 200 described above is illustrative only, and determining whether to perform an action can be performed in different orders.
  • the client device may discard the audio signal and then not perform an action using data from the audio signal or may perform these two steps concurrently.
  • the process 200 can include additional steps, fewer steps, or some of the steps can be divided into multiple steps.
  • the client device may either discard the audio signal or not perform an action using data from the audio signal, instead of performing both steps.
  • FIG. 3 is a flow diagram of a process 300 for generating tagged text data for an audio signal.
  • the process 300 can be used by the speech recognition system 1 12 from the environment 100.
  • a speech recognition system receives, from a client device, an audio signal encoding one or more utterances including one or more first utterances for which the client device determined that at least a portion of the first utterance satisfies a first threshold of being at least a portion of a key phrase (302).
  • the speech recognition system may receive the audio signal from the client device across a network.
  • the client device may have sent the audio signal to the speech recognition system as part of a process that includes performing steps 202 through 206 described above with reference to FIG. 2.
  • the speech recognition system customizes a language model for the key phrase (304). For instance, the speech recognition system may increase a likelihood that the language model, which is not specific to any particular key phrase, will accurately identify an occurrence of the key phrase encoded in the audio signal. In some examples, the speech recognition system may adjust weights for the language model specific to the key phrase.
  • the speech recognition system may determine whether to use a default key phrase. For instance, the speech recognition system may determine whether a message received from the client device that includes the audio signal also includes data identifying a key phrase, e.g., text for the key phrase or an identifier that can be used to look up a key phrase in a database. The speech recognition system may determine to use a default key phrase when the message does not include data identifying a key phrase. For example, the speech recognition system may determine that the client device, or a corresponding user account, does not have a customized key phrase and to use a default key phrase.
  • a key phrase e.g., text for the key phrase or an identifier that can be used to look up a key phrase in a database.
  • the speech recognition system may determine to use a default key phrase when the message does not include data identifying a key phrase. For example, the speech recognition system may determine that the client device, or a corresponding user account, does not have a customized key phrase and to use a default key phrase.
  • the speech recognition system determines whether the one or more first utterances satisfy the second threshold of being a key phrase based on output from the language model, an acoustic model, or both (306). For instance, the speech recognition system provides the audio signal to the language model, the acoustic model, or both. The speech recognition system receives a score from the language model, the acoustic model, or both, that each indicate a likelihood that the one or more first utterances are the key phrase. The speech recognition system may combine the separate scores from the language model and the acoustic model to determine whether the combined score for the audio signal satisfies the second threshold of being the key phrase.
  • the speech recognition system analyzes the entire audio signal to determine data for each of the one or more utterances (308). For example, an acoustic model generates output indicating a text string for the words likely encoded in the audio signal.
  • a tagged text generator may apply tags to the text string that indicate one or more attributes of n-grams, e.g., words, included in the text string. For instance, the tagged text generator may apply tags that identify a key phrase, an action word, e.g., "play," an application, e.g., music player, or a combination of two or more of these, to the text string.
  • the speech recognition system sends, to the client device, tagged text data representing the one or more utterances encoded in the audio signal generated using the data for each of the one or more utterances (310).
  • the speech recognition system may send the tagged text data to the client device to cause the client device to perform an action using the tagged text data.
  • the speech recognition system In response to determining that the first utterance does not satisfy the second threshold of being a key phrase based on output from the language model, the acoustic model, or both, the speech recognition system sends, to the client device, data indicating that the key phrase is not likely encoded in the audio signal (312). For instance, the speech recognition system may provide the client device with a message that indicates that the client device should not perform any action using data for the audio signal.
  • the process 300 can include additional steps, fewer steps, or some of the steps can be divided into multiple steps.
  • the speech recognition system might not customize the language model.
  • the speech recognition system may determine whether the first utterance satisfies the second threshold of being a key phrase using data or systems other than the language model, the acoustic model, or both.
  • Embodiments of the subject matter and the functional operations described in this specification can be implemented in digital electronic circuitry, in tangibly- embodied computer software or firmware, in computer hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them.
  • Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible non-transitory program carrier for execution by, or to control the operation of, data processing apparatus.
  • the program instructions can be encoded on an artificially-generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus for execution by a data processing apparatus.
  • the computer storage medium can be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.
  • data processing apparatus refers to data processing hardware and encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers.
  • the apparatus can also be or further include special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC
  • the apparatus can optionally include, in addition to hardware, code that creates an execution environment for computer programs, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.
  • a computer program which may also be referred to or described as a program, software, a software application, a module, a software module, a script, or code, can be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
  • a computer program may, but need not, correspond to a file in a file system.
  • a program can be stored in a portion of a file that holds other programs or data, e.g., one or more scripts stored in a markup language document, in a single file dedicated to the program in question, or in multiple coordinated files, e.g., files that store one or more modules, sub-programs, or portions of code.
  • a computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and
  • the processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform functions by operating on input data and generating output.
  • the processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).
  • special purpose logic circuitry e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).
  • Computers suitable for the execution of a computer program include, by way of example, general or special purpose microprocessors or both, or any other kind of central processing unit.
  • a central processing unit will receive instructions and data from a read-only memory or a random access memory or both.
  • the essential elements of a computer are a central processing unit for performing or executing instructions and one or more memory devices for storing instructions and data.
  • a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks.
  • mass storage devices for storing data
  • a computer need not have such devices.
  • a computer can be embedded in another device, e.g., a mobile telephone, a smart phone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device, e.g., a universal serial bus (USB) flash drive, to name just a few.
  • PDA personal digital assistant
  • GPS Global Positioning System
  • USB universal serial bus
  • Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks.
  • semiconductor memory devices e.g., EPROM, EEPROM, and flash memory devices
  • magnetic disks e.g., internal hard disks or removable disks
  • magneto-optical disks e.g., CD-ROM and DVD-ROM disks.
  • the processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
  • a computer having a display device, e.g., LCD (liquid crystal display), OLED (organic light emitting diode) or other monitor, for displaying information to the user and a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer.
  • a display device e.g., LCD (liquid crystal display), OLED (organic light emitting diode) or other monitor
  • a keyboard and a pointing device e.g., a mouse or a trackball
  • Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input.
  • a computer can interact with a user by sending documents to and receiving documents from a device that is used by the user; for example, by sending web pages to
  • Embodiments of the subject matter described in this specification can be implemented in a computing system that includes a back-end component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a front-end component, e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the subject matter described in this specification, or any combination of one or more such back-end, middleware, or front-end components.
  • the components of the system can be
  • Examples of communication networks include a local area network (LAN) and a wide area network (WAN), e.g., the Internet.
  • LAN local area network
  • WAN wide area network
  • the computing system can include clients and servers.
  • a client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
  • a server transmits data, e.g., an Hypert ext Markup Language (HTML) page, to a user device, e.g., for purposes of displaying data to and receiving user input from a user interacting with the user device, which acts as a client.
  • HTML Hypert ext Markup Language
  • Data generated at the user device e.g., a result of the user interaction, can be received from the user device at the server.
  • FIG. 4 is a block diagram of computing devices 400, 450 that may be used to implement the systems and methods described in this document, as either a client or as a server or plurality of servers.
  • Computing device 400 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers.
  • Computing device 450 is intended to represent various forms of mobile devices, such as personal digital assistants, cellular telephones, smartphones, smartwatches, head-worn devices, and other similar computing devices.
  • mobile devices such as personal digital assistants, cellular telephones, smartphones, smartwatches, head-worn devices, and other similar computing devices.
  • the components shown here, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations described and/or claimed in this document.
  • Computing device 400 includes a processor 402, memory 404, a storage device 406, a high-speed interface 408 connecting to memory 404 and high-speed expansion ports 410, and a low speed interface 412 connecting to low speed bus 414 and storage device 406.
  • Each of the components 402, 404, 406, 408, 410, and 412, are interconnected using various busses, and may be mounted on a common motherboard or in other manners as appropriate.
  • the processor 402 can process instructions for execution within the computing device 400, including instructions stored in the memory 404 or on the storage device 406 to display graphical information for a GUI on an external input/output device, such as display 416 coupled to high speed interface 408.
  • multiple processors and/or multiple buses may be used, as appropriate, along with multiple memories and types of memory.
  • multiple computing devices 400 may be connected, with each device providing portions of the necessary operations (e.g., as a server bank, a group of blade servers, or a multi-processor system).
  • the memory 404 stores information within the computing device 400.
  • the memory 404 is a computer-readable medium.
  • the memory 404 is a volatile memory unit or units.
  • the memory 404 is a non-volatile memory unit or units.
  • the storage device 406 is capable of providing mass storage for the computing device 400.
  • the storage device 406 is a computer- readable medium.
  • the storage device 406 may be a floppy disk device, a hard disk device, an optical disk device, or a tape device, a flash memory or other similar solid state memory device, or an array of devices, including devices in a storage area network or other configurations.
  • a computer program product is tangibly embodied in an information carrier.
  • the computer program product contains instructions that, when executed, perform one or more methods, such as those described above.
  • the information carrier is a computer- or machine- readable medium, such as the memory 404, the storage device 406, or memory on processor 402.
  • the high speed controller 408 manages bandwidth-intensive operations for the computing device 400, while the low speed controller 412 manages lower bandwidth- intensive operations.
  • Such allocation of duties is exemplary only.
  • the high-speed controller 408 is coupled to memory 404, display 416 (e.g., through a graphics processor or accelerator), and to high-speed expansion ports 410, which may accept various expansion cards (not shown).
  • low- speed controller 412 is coupled to storage device 406 and low-speed expansion port 414.
  • the low-speed expansion port which may include various communication ports (e.g., USB, Bluetooth, Ethernet, wireless Ethernet) may be coupled to one or more input/output devices, such as a keyboard, a pointing device, a scanner, or a networking device such as a switch or router, e.g., through a network adapter.
  • the computing device 400 may be implemented in a number of different forms, as shown in the figure. For example, it may be implemented as a standard server 420, or multiple times in a group of such servers. It may also be implemented as part of a rack server system 424. In addition, it may be implemented in a personal computer such as a laptop computer 422. Alternatively, components from computing device 400 may be combined with other components in a mobile device (not shown), such as device 450. Each of such devices may contain one or more of computing device 400, 450, and an entire system may be made up of multiple computing devices 400, 450 communicating with each other.
  • Computing device 450 includes a processor 452, memory 464, an input/output device such as a display 454, a communication interface 466, and a transceiver 468, among other components.
  • the device 450 may also be provided with a storage device, such as a microdrive or other device, to provide additional storage.
  • a storage device such as a microdrive or other device, to provide additional storage.
  • Each of the components 450, 452, 464, 454, 466, and 468, are interconnected using various buses, and several of the components may be mounted on a common motherboard or in other manners as appropriate.
  • the processor 452 can process instructions for execution within the computing device 450, including instructions stored in the memory 464.
  • the processor may also include separate analog and digital processors.
  • the processor may provide, for example, for coordination of the other components of the device 450, such as control of user interfaces, applications run by device 450, and wireless communication by device 450.
  • Processor 452 may communicate with a user through control interface 458 and display interface 456 coupled to a display 454.
  • the display 454 may be, for example, a TFT LCD display or an OLED display, or other appropriate display technology.
  • the display interface 456 may comprise appropriate circuitry for driving the display 454 to present graphical and other information to a user.
  • the control interface 458 may receive commands from a user and convert them for submission to the processor 452.
  • an external interface 462 may be provided in communication with processor 452, so as to enable near area communication of device 450 with other devices.
  • Extemal interface 462 may provide, for example, for wired communication (e.g., via a docking procedure) or for wireless communication (e.g., via Bluetooth or other such
  • the memory 464 stores information within the computing device 450.
  • the memory 464 is a computer-readable medium.
  • the memory 464 is a volatile memory unit or units.
  • the memory 464 is a non-volatile memory unit or units.
  • Expansion memory 474 may also be provided and connected to device 450 through expansion interface 472, which may include, for example, a SIMM card interface. Such expansion memory 474 may provide extra storage space for device 450, or may also store applications or other information for device 450.
  • expansion memory 474 may include instructions to carry out or supplement the processes described above, and may include secure information also.
  • expansion memory 474 may be provided as a security module for device 450, and may be programmed with instructions that permit secure use of device 450.
  • secure applications may be provided via the SIMM cards, along with additional information, such as placing identifying information on the SIMM card in a non-hackable manner.
  • the memory may include for example, flash memory and/or MRAM memory, as discussed below.
  • a computer program product is tangibly embodied in an information carrier.
  • the computer program product contains instructions that, when executed, perform one or more methods, such as those described above.
  • the information carrier is a computer- or machine-readable medium, such as the memory 464, expansion memory 474, or memory on processor 452.
  • Device 450 may communicate wirelessly through communication interface
  • Communication interface 466 may provide for communications under various modes or protocols, such as GSM voice calls, SMS, EMS, or MMS messaging, CDMA, TDMA, PDC, WCDMA, CDMA2000, or GPRS, among others. Such communication may occur, for example, through radio-frequency transceiver 468. In addition, short-range communication may occur, such as using a Bluetooth, WiFi, or other such transceiver (not shown). In addition, GPS receiver module 470 may provide additional wireless data to device 450, which may be used as appropriate by applications running on device 450.
  • Device 450 may also communicate audibly using audio codec 460, which may receive spoken information from a user and convert it to usable digital information. Audio codec 460 may likewise generate audible sound for a user, such as through a speaker, e.g., in a handset of device 450. Such sound may include sound from voice telephone calls, may include recorded sound (e.g., voice messages, music files, etc.) and may also include sound generated by applications operating on device 450.
  • Audio codec 460 may receive spoken information from a user and convert it to usable digital information. Audio codec 460 may likewise generate audible sound for a user, such as through a speaker, e.g., in a handset of device 450. Such sound may include sound from voice telephone calls, may include recorded sound (e.g., voice messages, music files, etc.) and may also include sound generated by applications operating on device 450.
  • the computing device 450 may be implemented in a number of different forms, as shown in the figure. For example, it may be implemented as a cellular telephone 480. It may also be implemented as part of a smartphone 482, personal digital assistant, or other similar mobile device.
  • Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, specially designed ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof.
  • ASICs application specific integrated circuits
  • These various implementations can include implementation in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.
  • machine-readable signal refers to any signal used to provide machine instructions and/or data to a programmable processor.
  • the systems and techniques described here can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user and a keyboard and a pointing device (e.g., a mouse or a trackball) by which the user can provide input to the computer.
  • a display device e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor
  • a keyboard and a pointing device e.g., a mouse or a trackball
  • Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form, including acoustic, speech, or tactile input.
  • the systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components.
  • the components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN”), a wide area network (“WAN”), and the Internet.
  • LAN local area network
  • WAN wide area network
  • the Internet the global information network
  • the computing system can include clients and servers.
  • a client and server are generally remote from each other and typically interact through a communication network.
  • the relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
  • Example 1 A non-transitory computer storage medium encoded with instructions that, when executed by a computer, cause the computer to perform operations comprising:
  • Example 2 The computer storage medium of example 1, the operations comprising performing an action using the tagged text data subsequent to receiving, from the server system, the tagged text data representing the one or more utterances encoded in the audio signal when the server system determines that the first utterance satisfies the second threshold.
  • Example 3 The computer storage medium of example 1 or 2, wherein:
  • the one or more utterances comprises two or more utterances, the first utterance encoded prior to the other utterances from the two or more utterances in the audio signal;
  • performing the action using the tagged text data comprises performing an action using the tagged text data for the one or more utterances encoded in the audio signal after the first utterance.
  • Example 4 The computer storage medium of one of examples 1 to 3, wherein determining whether at least a portion of the first utterance satisfies the first threshold of being at least a portion of the key phrase comprises determining whether at least a portion of the first utterance satisfies the first threshold of being at least a portion of the key phrase that includes two or more words.
  • Example 5 The computer storage medium of one of examples 1 to 4, the operations comprising:
  • Example 6 The computer storage medium of example 5, the operations comprising determining to not perform an action using data from the second audio signal in response to determining that at least the portion of the second utterance does not satisfy the first threshold of being at least a portion of a key phrase.
  • Example 7 The computer storage medium of one of examples 1 to 6, wherein determining whether at least a portion of the first utterance satisfies the first threshold of being a key phrase comprises determining whether at least a portion of the first utterance satisfies a first likelihood of being at least a portion of a key phrase.
  • Example 8 A system comprising one or more computers and one or more storage devices on which are stored instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising:
  • an audio signal encoding one or more utterances including one or more first utterances for which the client device determined that at least a portion of the one or more first utterances satisfies a first threshold of being at least a portion of a key phrase;
  • Example 9 The system of example 8, wherein sending, to the client device, the result of determining whether the one or more first utterances satisfy the second threshold of being the key phrase comprises sending, to the client device, data indicating that the key phrase is not likely included in the audio signal in response to determining that the one or more first utterances do not satisfy the second threshold of being the key phrase.
  • Example 10 The system of example 8 or 9, wherein sending, to the client device, the result of determining whether the one or more first utterances satisfy the second threshold of being the key phrase comprises sending, to the client device, data for the audio signal in response to determining that the one or more first utterances satisfy the second threshold of being the key phrase.
  • Example 1 1 The system of one of examples 8 to 10, wherein sending, to the client device, data for the audio signal in response to determining that the one or more first utterances satisfy the second threshold of being the key phrase comprises sending, to the client device, tagged text data representing the one or more utterances encoded in the audio signal.
  • Example 12 The system of one of examples 8 to 11 , the operations comprising analyzing the entire audio signal to determine first data for each of the one or more utterances, wherein sending, to the client device, the data for the audio signal in response to determining that the one or more first utterances satisfy the second threshold of being the key phrase comprises sending, to the client device, the first data for the audio signal in response to determining that the one or more first utterances satisfy the second threshold of being the key phrase.
  • Example 13 The system of one of examples 8 to 12, wherein determining whether the one or more first utterances satisfy the second threshold of being the key phrase comprises determining, using a language model, whether the one or more first utterances satisfy the second threshold of being the key phrase.
  • Example 14 The system of one of examples 8 to 13, the operations comprising customizing the language model for the key phrase prior to determining, using the language model, whether the one or more first utterances satisfy the second threshold of being the key phrase.
  • Example 15 The system of one of examples 8 to 14, the operations comprising receiving text identifying the key phrase, wherein customizing the language model for the key phrase comprises customizing the language model for the key phrase using the text identifying the key phrase.
  • Example 16 The system of one of examples 8 to 15, the operations comprising:
  • customizing the language model for the key phrase comprises customizing the language model for the key phrase using the key phrase data.
  • Example 17 The system of one of examples 8 to 16, wherein determining, using the language model, whether the one or more first utterances satisfy the second threshold of being the key phrase comprises determining, using the language model and an acoustic model, whether the one or more first utterances satisfy the second threshold of being the key phrase.
  • Example 18 The system of one of examples 8 to 17, wherein
  • determining, using the language model and the acoustic model, whether the one or more first utterances satisfy the second threshold of being the key phrase comprises:
  • Example 19 The system of one of examples 8 to 18, the operations comprising selecting the language model for a default key phrase.
  • Example 20 The system of one of examples 8 to 19, the operations comprising determining whether to use the default key phrase.
  • Example 21 A computer-implemented method comprising:

Landscapes

  • Engineering & Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)
  • Telephonic Communication Services (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for detecting hotwords using a server. One of the methods includes receiving an audio signal encoding one or more utterances including a first utterance; determining whether at least a portion of the first utterance satisfies a first threshold of being at least a portion of a key phrase; in response to determining that at least the portion of the first utterance satisfies the first threshold of being at least a portion of a key phrase, sending the audio signal to a server system that determines whether the first utterance satisfies a second threshold of being the key phrase, the second threshold being more restrictive than the first threshold; and receiving tagged text data representing the one or more utterances encoded in the audio signal when the server system determines that the first utterance satisfies the second threshold.

Description

SERVER SIDE HOTWORDING
BACKGROUND
[0001] Automatic speech recognition is one technology that is used in mobile devices among other types of devices. One task that is a common goal for this technology is to be able to use voice commands to wake up a device and have basic spoken interactions with the device. For example, it may be desirable for the device to recognize a "hotword" that signals that the device should activate when the device is in a sleep state.
SUMMARY
[0002] A system may use two thresholds to determine whether a user spoke a key phrase. A client device, included in the system, uses the first, lower threshold to determine whether a portion of words spoken by the user are the same as a portion of the key phrase. For instance, when the key phrase is "okay google," the client device may use the first, lower threshold to determine whether the user spoke "okay" or "okay g" or "okay google." When the client device determines that the portion of the words spoken by the user are the same as a portion of the key phrase, the client device sends data for the words to a server. The server uses a second, higher threshold to determine whether the words spoken by the user are the same as the key phrase. The server analyzes the entire key phrase to determine whether the user spoke the key phrase. When the server determines that the key phrase is included in the words, the server may parse other words spoken by the user to generate data for an action that the client device should perform.
[0003] In some implementations, a client device may receive an audio signal that encodes one or more utterances. An utterance is the vocalization (a user speaking) of a word or words that represent a single meaning to the computer. An utterance can be a single word, a few words, a sentence, or even multiple sentences. An utterance may thereby comprise an n-gram being a contiguous sequence of n items (n being equal or greater than 1) from a given sequence of text or speech. The items can be phonemes, syllables, letters, words or base pairs, to name a few examples. The client device uses a first threshold to determine whether one or more first utterances encoded at the beginning of the audio signal satisfy a first threshold of being a key phrase. The client device may analyze a portion of an utterance, a single utterance from the one or more first utterances when the key phrase includes multiple words, or both. In practice an utterance corresponding to a key phrase, for which it is determined whether it satisfies the first threshold, will usually consist of speech items such as at least a plurality of phonemes, at least a plurality of syllables, or one or more words in order to make the key phrase in some sense unique and distinguishable from accidentally and generally frequently spoken utterances like single letters, single phonemes, etcetera.
[0004] When the client device determines that the one or more first utterances satisfy the first threshold of being a key phrase, the client device sends the audio signal to a speech recognition system, e.g., included on a server separate from the client device, for additional analysis. The speech recognition system receives the audio signal. The speech recognition system analyzes the one or more first utterances to determine whether the one or more first utterances satisfy a second threshold of being the key phrase. The second threshold is more restrictive than the first threshold, e.g., the first threshold is less accurate or lower than the second threshold. For instance, when the first threshold and the second threshold are both likelihoods, the first threshold, e.g., fifty percent, is a lower likelihood than the second threshold, e.g., seventy-five or ninety percent. For the one or more first utterances to satisfy the first threshold or the second threshold of being a key phrase, a corresponding system may determine that the likelihood of the one or more first utterances being the key phrase is greater than, or greater than or equal to, the respective threshold.
[0005] The speech recognition system receives, from the client device, data for the entire audio signal including the one or more first utterances so that the speech recognition system can analyze all of the data included in the audio signal. The entire audio signal may include multiple n-grams uttered by the user after the one or more first utterances, at least as long as the n-grams fall within a certain time window or by some other metric are within a maximum distance apart from the one or more first utterances. For example, to reduce the possibility of the speech recognition system receiving data for an audio signal that includes a partial utterance at the beginning of the audio signal, to improve the speech recognition analysis by the speech recognition system, or both, the speech recognition system receives the entire audio signal from the client device when the client device determines that at least a portion of the one or more first utterances satisfy the first threshold of being the key phrase. This may allow the server to improve automated speech analysis of the audio signal because of the greater amount of resources available to the server compared to the client device, a larger number of speech items at the server than at the client, e.g., more robust analysis models at the server, or both, thereby improving recognition.
[0006] In some implementations, the client device may analyze a prefix or a portion of one of the first utterances. For instance, when the key phrase is "Okay Google", the client device may determine that the one or more first utterances encode "Okay G" or "Okay", without analyzing all of the second utterance and, in response, send the audio signal to the speech recognition system. In other words, the one or more first utterances for which it is determined whether they meet the first threshold may consist of only a portion of the key phrase, e.g., the beginning portion of the key phrase. When the first threshold is met, "other utterances" or "second utterances", which comprise speech items after the first utterances, e.g., within a threshold distance of the first utterances, may be sent to the server together with the first utterances to be analyzed by the server as to whether the combination of the first and second utterances together meet the second threshold as to whether there exists a match with the key phrase.
[0007] In some implementations, the client device may send, with the data for the audio signal and to the speech recognition system, data for the key phrase. The data for the key phrase may be text representing the key phrase, or an identifier, e.g., for the client device, which the speech recognition system may use to determine the key phrase. The speech recognition system may use the data for the key phrase to determine whether the one or more first utterances included in the audio signal satisfy the second threshold of being the key phrase.
[0008] In general, one innovative aspect of the subject matter described in this specification can be embodied in methods that include the actions of receiving an audio signal encoding one or more utterances including a first utterance; determining whether at least a portion of the first utterance satisfies a first threshold of being at least a portion of a key phrase; in response to determining that at least the portion of the first utterance satisfies the first threshold of being at least a portion of a key phrase, sending the audio signal to a server system that determines whether the first utterance satisfies a second threshold of being the key phrase, the second threshold being more restrictive than the first threshold; and receiving, from the server system, tagged text data representing the one or more utterances encoded in the audio signal when the server system determines that the first utterance satisfies the second threshold. Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods. A system of one or more computers can be configured to perform particular operations or actions by virtue of having software, firmware, hardware, or a combination of them installed on the system that in operation causes or cause the system to perform the actions. One or more computer programs can be configured to perform particular operations or actions by virtue of including instructions that, when executed by data processing apparatus, cause the apparatus to perform the actions.
[0009] In general, one innovative aspect of the subject matter described in this specification can be embodied in methods that include the actions of receiving, from a client device, an audio signal encoding one or more utterances including one or more first utterances for which the client device determined that at least a portion of the one or more first utterances satisfies a first threshold of being at least a portion of a key phrase;
determining whether the one or more first utterances satisfy a second threshold of being at least a portion of the key phrase, the second threshold more restrictive than the first threshold; and sending, to the client device, a result of determining whether the one or more first utterances satisfy the second threshold of being the key phrase. Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods. A system of one or more computers can be configured to perform particular operations or actions by virtue of having software, firmware, hardware, or a combination of them installed on the system that in operation causes or cause the system to perform the actions. One or more computer programs can be configured to perform particular operations or actions by virtue of including instructions that, when executed by data processing apparatus, cause the apparatus to perform the actions.
[0010] The foregoing and other embodiments can each optionally include one or more of the following features, alone or in combination. The method may include performing an action using the tagged text data subsequent to receiving, from the server system, the tagged text data representing the one or more utterances encoded in the audio signal when the server system determines that the first utterance satisfies the second threshold. The one or more utterances may include two or more utterances, the first utterance encoded prior to the other utterances from the two or more utterances in the audio signal. Performing the action using the tagged text data may include performing an action using the tagged text data for the one or more utterances encoded in the audio signal after the first utterance. Determining whether at least a portion of the first utterance satisfies the first threshold of being at least a portion of the key phrase may include determining whether at least a portion of the first utterance satisfies the first threshold of being at least a portion of the key phrase that includes two or more words.
[0011] In some implementations, the method may include receiving a second audio signal encoding one or more second utterances including a second utterance;
determining whether at least a portion of the second utterance satisfies the first threshold of being at least a portion of a key phrase; and in response to determining that at least the portion of the second utterance does not satisfy the first threshold of being at least a portion of a key phrase, discarding the second audio signal. The method may include determining to not perform an action using data from the second audio signal in response to determining that at least the portion of the second utterance does not satisfy the first threshold of being at least a portion of a key phrase. Determining whether at least a portion of the first utterance satisfies the first threshold of being a key phrase may include determining whether at least a portion of the first utterance satisfies a first likelihood of being at least a portion of a key phrase.
[0012] In some implementations, sending, to the client device, the result of determining whether the one or more first utterances satisfy the second threshold of being the key phrase may include sending, to the client device, data indicating that the key phrase is not likely included in the audio signal in response to determining that the one or more first utterances do not satisfy the second threshold of being the key phrase.
Sending, to the client device, the result of determining whether the one or more first utterances satisfy the second threshold of being the key phrase may include sending, to the client device, data for the audio signal in response to determining that the one or more first utterances satisfy the second threshold of being the key phrase. Sending, to the client device, data for the audio signal in response to determining that the one or more first utterances satisfy the second threshold of being the key phrase may include sending, to the client device, tagged text data representing the one or more utterances encoded in the audio signal. The method may include analyzing the entire audio signal to determine first data for each of the one or more utterances. Sending, to the client device, the data for the audio signal in response to determining that the one or more first utterances satisfy the second threshold of being the key phrase may include sending, to the client device, the first data for the audio signal in response to determining that the one or more first utterances satisfy the second threshold of being the key phrase. [0013] In some implementations, determining whether the one or more first utterances satisfy the second threshold of being the key phrase may include determining, using a language model, whether the one or more first utterances satisfy the second threshold of being the key phrase. A language model may define a probability distribution over a sequence of speech items, e.g., phonemes, syllables, or words, that indicates a likelihood that the sequence of speech items occurs in the order specified by the sequence. In automated speech recognition, one or more computers match sounds with speech item sequences defined by the language model and determine corresponding probabilities for the speech item sequences. Using the probabilities in the language model, the one or more computers can distinguish between speech items, e.g., words and phrases, that sound similar given the context of the individual speech items in the sequence, e.g., given the order in which the speech items occur in the sequence. The method may include customizing the language model for the key phrase prior to determining, using the language model, whether the one or more first utterances satisfy the second threshold of being the key phrase. The method may include receiving text identifying the key phrase. Customizing the language model for the key phrase may include customizing the language model for the key phrase using the text identifying the key phrase. The method may include receiving an identifier; and determining, using the identifier, key phrase data for the key phrase. Customizing the language model for the key phrase may include customizing the language model for the key phrase using the key phrase data. Determining, using the language model, whether the one or more first utterances satisfy the second threshold of being the key phrase may include determining, using the language model and an acoustic model, whether the one or more first utterances satisfy the second threshold of being the key phrase. An acoustic model may define a mapping between a phoneme, or another speech item, and a vocalization of the phoneme, or the corresponding other speech item. A computer may use an acoustic model in an automatic speech recognition process to determine the relationship between a vocalization of a speech item encoded in an audio signal and the corresponding speech item.
Determining, using the language model and the acoustic model, whether the one or more first utterances satisfy the second threshold of being the key phrase may include providing data for the one or more first utterances to the language model to cause the language model generate a first output; providing data for the one or more first utterances to the acoustic model to cause the acoustic model to generate a second output; combining the first output and the second output to generate a combined output; and determining, using the combined output, whether the one or more first utterances satisfy the second threshold of being the key phrase. The method may include selecting the language model for a default key phrase. The method may include determining whether to use the default key phrase.
[0014] The subject matter described in this specification can be implemented in particular embodiments and may result in one or more of the following advantages. In some implementations, the systems and methods described in this document may reduce resources used by a client device during hotword analysis with a first, lower threshold, improve an accuracy of hotword analysis by using a second, more restrictive threshold at a speech recognition system, or both. In some implementations, the systems and methods described below may more accurately parse, segment, or both, text in an audio signal, e.g., may more accurately identify a key phrase encoded in the audio signal separate from other utterances encoded in the audio signal, by sending an entire audio signal, that includes the key phrase, to a speech recognition system for analysis. In some
implementations, the systems and methods described below may reduce client processing time, send an audio signal to a speech recognition system more quickly, or both, compared to other systems, when the client uses a lower hotword analysis threshold than a more restrictive hotword analysis threshold used by the speech recognition system. In some implementations, the systems and methods described below may reduce bandwidth usage when the client device sends fewer audio signals to a server system for analysis when an utterance does not satisfy a first, lower threshold.
[0015] The details of one or more implementations of the subject matter described in this specification are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages of the subject matter will become apparent from the description, the drawings, and the claims.
BRIEF DESCRIPTION OF THE DRAWINGS
[0016] FIG. 1 is an example of an environment in which a client device analyzes an audio signal using a first threshold and a speech recognition system analyzes the audio signal using a second threshold that is more restrictive than the first threshold.
[0017] FIG. 2 is a flow diagram of a process for determining whether to perform an action.
[0018] FIG. 3 is a flow diagram of a process for generating tagged text data for an audio signal. [0019] FIG. 4 is a block diagram of a computing system that can be used in connection with computer-implemented methods described in this document.
[0020] Like reference numbers and designations in the various drawings indicate like elements.
DETAILED DESCRIPTION
[0021] FIG. 1 is an example of an environment 100 in which a client device 102 analyzes an audio signal using a first threshold and a speech recognition system 1 12 analyzes the audio signal using a second threshold that is more restrictive than the first threshold. The client device 102 uses the first threshold to determine whether the audio signal encodes at least a portion of a key phrase. When the client device 102 determines that the audio signal satisfies the first threshold of being the key phrase, the client device 102 sends the audio signal to the speech recognition system 1 12 that uses the second threshold to determine whether the audio signal encodes the entire key phrase. For that purpose, the client device 102 may send not only the audio signal representing a portion of the key phrase, which has been recognized, but the entire audio signal, or at least parts of the audio signal lying within a certain range after the part representing the recognized portion of the key phrase, to the speech recognition system 112.
[0022] If the speech recognition system 1 12 determines that the audio signal encodes the entire key phrase, the speech recognition system 112 may provide the client device 102 with tagged text data of the speech recognized utterances encoded in the audio signal to allow the client device 102 to perform an action based on the audio signal. Thereby the tagged text data may comprise the speech recognized utterances and "tags", which may represent actions to be performed or which otherwise identify a category of text within the recognized utterances, such that the client device 102 can identify the tag and the speech recognized utterances that correspond to the tag. The client device 102 may use the tagged text data to determine an action to perform, e.g., instructions to execute, to determine which portions of the speech recognized utterances to analyze when determining whether to perform an action, or both.
[0023] The client device 102 includes a microphone 104 that captures the audio signal. For instance, the client device 102 may be in a lower powered state, e.g., standby, while the microphone 104 captures at least part of the audio signal. The at least part of the audio signal may be the entire audio signal, one or more first utterances included in the audio signal, or a different part of the beginning of the audio signal. One example of utterances encoded in an audio signal is "ok google play some music." In this example, the first utterances may be "ok" or "ok google."
[0024] The microphone 104 provides the audio signal, or some of the audio signal as the audio signal is captured, to a client hotword detection module 106. For example, the microphone 104, or a combination of components in the client device 102, may provide portions of the audio signal to the client hotword detection module 106 as the audio signal is captured by the microphone 104.
[0025] The client hotword detection module 106, at time TA, determines whether the audio signal satisfies a first threshold 108. For instance, the client hotword detection module 106 may analyze at least a portion of the one or more first utterances, included at the beginning of the audio signal, to determine whether the portion of the one or more first utterances satisfy the first threshold 108 of being a key phrase. The portion of the first utterances may by "ok" or "ok google." One example of a key phrase may be "ok google." In some examples, the client hotword detection module 106 is configured to detect occurrence of only one key phrase. In some implementations, the client hotword detection module is configured to detect occurrence of any of multiple different key phrases, e.g., ten key phrases. The multiple different key phrases include a limited number of different key phrases for which the client hotword detection module 106 is trained.
[0026] The client hotword detection module 106 may determine a likelihood that at least a portion of the first utterances are the same as at least a portion of the key phrase. For that purpose, the client hotword detection module 106 may apply any known automated speech recognition approach, which segments the at least a portion of the first utterances into phonemes or other linguistic units and uses an acoustic model and/or language model to obtain a likelihood whether the first utterances match a key phrase or a portion of a key phrase. The portion of the key phrase may be the beginning portion of the key phrase, e.g., that includes the speech items at the beginning of the key phrase. The client hotword detection module 106 may compare the likelihood with the first threshold 108. When the likelihood satisfies the first threshold 108, the client device 102 may send the audio signal to the speech recognition system 1 12, e.g., located on one or more servers. When the likelihood does not satisfy the first threshold 108, the client device 102 may take no further action based on the utterances included in the audio signal, e.g., and may discard the audio signal. [0027] The client hotword detection module 106 may determine that the key phrase is "ok google" and that the utterance "ok", as one of the first utterances in the audio signal, satisfies the first threshold 108 of being part of the key phrase. In some examples, the client hotword detection module 106 may determine that the utterance "ok google" from the audio signal satisfies the first threshold 108 of being part of the key phrase, e.g., the entire key phrase.
[0028] In some implementations, when the client hotword detection module 106 determines that one or a portion of one of the first utterances satisfies the first threshold 108 of being a portion of the key phrase, the client hotword detection module 106 may determine whether a total length of the first utterances matches a length for the key phrases. For instance, the client hotword detection module 106 may determine that a time during which the one or more first utterances were spoken matches an average time for the key phrase to be spoken. The average time may be for a user of the client device 102 or for multiple different people, e.g., including the user of the client device 102.
[0029] In some implementations, the client hotword detection module 106 may determine that the total length of the first utterances and a total number of n-grams, e.g., words, included in the first utterances matches a total length of the key phrase and a number of n-grams included in the key phrase, e.g., when only analyzing a portion of a first utterance or of the first utterances. For instance, the client hotword detection module 106 may determine a number of silences between the first utterances that indicates the number of first utterances. The client hotword detection module 106 may compare the number of first utterances, the spoken length of the first utterances, or both, with a total number of words in the key phrase, the spoken length of the key phrase, or both. When the client hotword detection module 106 determines that the total number of first utterances and the total number of words in the key phrase are the same, that the spoken length of the first utterances is within a threshold amount from the spoken length of the key phrase, or both, the client hotword detection module 106 may determine that the first utterances in the audio signal satisfy the first threshold 108 of being the key phrase, e.g., when at least a portion of the first utterances satisfy the first threshold 108 of being a portion of the key phrase.
[0030] The first utterances may satisfy the first threshold 108 of being a key phrase when the likelihood is greater than the first threshold 108. The first utterances may satisfy the first threshold 108 of being a key phrase when the likelihood is greater than or equal to the first threshold 108. In some examples, the first utterances do not satisfy the first threshold 108 of being a key phrase when the likelihood is less than the first threshold 108. The first utterances might not satisfy the first threshold 108 of being a key phrase when the likelihood is less than or equal to the first threshold 108.
[0031] In response to determining that at least a portion of the first utterances satisfy the first threshold 108 of being at least a portion of a key phrase, the client device 102, at time TB, sends the audio signal to the speech recognition system 112. The speech recognition system 112 receives the audio signal and uses a server hotword detection module 114 to determine, at time Tc, whether the audio signal satisfies a second threshold 116 of being the key phrase. For instance, the speech recognition system 112 uses the server hotword detection module 114 to determine whether the audio signal satisfies the second threshold 116 of being a key phrase.
[0032] The second threshold 116 is more restrictive than the first threshold 108.
For example, the server hotword detection module 114, using the second threshold 116, is less likely to incorrectly determine that the first utterances represent the same text as a key phrase, e.g., are a false positive, compared to the client hotword detection module 106, using the first threshold 108. In some examples, when the thresholds are likelihoods, the first threshold 108 has a lower numerical value than the second threshold 116.
[0033] The server hotword detection module 114 may use a language model 118, an acoustic model 120, or both, to determine whether the one or more first utterances satisfy the second threshold 116 of being a key phrase. The language model 118, and the acoustic model 120, are each trained using a large amount of training data, e.g., compared to the client hotword detection module 106. For example, the language model 118, the acoustic model 120, or both, may be trained using 30,000 hours of training data. The client hotword detection module 106 may be trained using 100 hours of training data.
[0034] In some examples, the server hotword detection module 114 may create a hotword biasing model, that includes the language model 118, the acoustic model 120, or both, on the fly, for use analyzing the audio signal. A hotword biasing model may be a combination of a language model, which defines a probability distribution over a sequence of speech items, and an acoustic model, which defines a mapping between speech items and corresponding vocalizations of the speech items, that is specific to a few key phrases or hotwords. The speech recognition system 112 may create a hotword biasing model for the client device 102 that is specific to the key phrase or the key phrases for which the client device 102 analyzed the one or more first utterances. [0035] For instance, the server hotword detection module 114 may receive data from the client device 102 that identifies a key phrase for which the server hotword detection module 1 14 will analyze the audio signal to determine whether the client device 102 should wake up, perform an action, or both. The data that identifies the key phrase may be text data for the key phrase, e.g., a text string, or an identifier for the client device 102, e.g., either of which may be included in the request to analyze the audio signal received from the client device 102. The server hotword detection module 114 may use the identifier for the client device 102 to access a database and determine the key phrase for the client device 102 and the audio signal. The server hotword detection module 114 may use the determined key phrase or key phrases for the client device 102 to create a hotword biasing model for the client device 102 using an existing language model 118, an existing acoustic model 120, or both, already stored in a memory of the speech recognition system 112.
[0036] In some examples, the server hotword detection module 114 may use a pre-built hotword biasing model. For instance, the server hotword detection module 114 may analyzes multiple audio signals from the client device 102 or from multiple different client devices, all of which are for the same key phrase, using the same hotword biasing model.
[0037] The hotword biasing model may identify one or more n-grams for which the hotword biasing model performs analysis. For instance, when the key phrase is "ok google," the hotword biasing model may generate scores for one or more of the n-grams "<S> ok google," "<S> ok," or "ok google," where <S> denotes silence at the beginning of a sentence.
[0038] One or both of the language model 1 18 or the acoustic model 120 may use the n-grams for the hotword biasing model to determine whether the audio signal includes the key phrase. For instance, the language model 118 may use one or more of the n- grams to generate a score that indicates a likelihood that the audio signal includes the key phrase. The language model 1 18 may use the n-grams or some of the n-grams to increase a likelihood that the key phrase is correctly identified in the audio signal when the one or more first utterances are the same as the key phrase. For example, when the key phrase includes two or more words, the language model 118 may add the key phrase, e.g., "ok google," to the language model 1 18 to increase the likelihood that the key phrase is identified, e.g., compared to when the language model 118 does not already include the key phrase. [0039] The acoustic model 120 may use one or more of the n-grams to generate a score that indicates a likelihood that the audio signal includes the key phrase. For example, the acoustic model 120 may generate multiple scores for different phrases, including the key phrase, and select the score for the key phrase as output.
[0040] The server hotword detection module 1 14 may receive the two scores from the language model 118 and the acoustic model 120. The server hotword detection module 114 may combine the two scores to determine an overall score for the audio signal. The server hotword detection module 114 may compare the overall score with the second threshold 1 16. When the overall score satisfies the second threshold 116, the server hotword detection module 1 14 determines that the audio signal likely encodes the key phrase. When the overall score does not satisfy the second threshold 1 16, the server hotword detection module 1 14 determines that the audio signal likely does not encode the key phrase.
[0041] In response to determining that the audio signal likely does not encode the key phrase, the speech recognition system 1 12 may send a message to the client device 102 indicating that the audio signal does not likely encode the key phrase. In some examples, the speech recognition system 1 12 might not send the client device 102 a message upon determining that the audio signal likely does not encode the key phrase.
[0042] When the server hotword detection module 114 determines that the audio signal likely encodes the key phrase, a tagged text generator 122 generates tagged text for the audio signal. The tagged text generator 122 may receive data from the language model 1 18, the acoustic model 120, or both, that indicates the n-grams encoded in the audio signal. For instance, the tagged text generator 122 may receive data from the acoustic model 120 that indicates scores for n-grams that are likely encoded in the audio signal, data representing the n-grams that are encoded in the audio signal, or other appropriate data.
[0043] The tagged text generator 122 uses the data from the language model 118, the acoustic model 120, or both, to generate tags for the n-grams encoded in the audio signal. For example, when the audio signal encodes "ok google play some music," the tagged text generator 122 may generate data representing the string "<hotword biasing> ok google </hotword biasing> play some music". The tag "<hotword biasing>" thereby identifies the first string "ok google" as a hotword. The tag "</hotword biasing>" identifies both the end of the hotword and indicates that the following string likely includes an instruction for the client device 102 a) which has been recognized by an automated speech recognition process and b) which the client device 102 should analyze to determine whether the client device 102 can execute a corresponding instruction.
[0044] The speech recognition system 112 provides the tagged text for the audio signal to the client device 102 at time TD. The client device 102 receives the tagged text and analyzes the tagged text to determine an action to perform. For instance, the client device 102 may use the tags included in the text to determine which portion of the text corresponds to the key phrase, e.g., the one or more first utterances, and which portion of the text corresponds to an action for the client device 102 to perform. For example, the client device 102 may determine, using the text "play some music," to launch a music player application and play music. The client device 102 may provide a user prompt requesting input of a music genre, a music station, an artist, or another type of music for playback using the music player application.
[0045] In some implementations, the client device 102 may be configured to detect any of multiple different key phrases encoded in an audio signal. For example, the client device 102 may receive input representing a user specified hotword, such as "hey indigo" or "hey gennie." The client device 102 may provide the speech recognition system 112 with data representing the user specified hotword. For instance, the client device 102 may send a text representation of the user specified hotword with the audio signal. In some examples, the client device 102 may provide the speech recognition system 112 with data for the user specified hotword that the speech recognition system 112 associates with an identifier for the client device 102, e.g., with a user account for the client device 102.
[0046] The client device 102 may have different key phrases for different physical geographic locations. For instance, the client device 102 may have a first key phrase for a user's home and a second, different key phrase for the user's office. The client device 102 may use one or more location devices 110 to determine a current physical geographic location for the client device 102 and select a corresponding key phrase. The client device 102 may send data to the speech recognition system 112 with the audio signal that identifies the key phrase based on the physical geographic location of the client device 102. The location devices 110 may include one or more of a global positioning system, a wireless device that detects a wireless signature, e.g., of a wireless hotspot or another device that broadcasts a signature, or a cellular antenna that detects information of cellular base stations. [0047] In some examples, the client device 102 may send data to the speech recognition system 112 that indicates the physical geographic location of the client device 102. For instance, the client hotword detection module 106 may be configured for multiple, e.g., five, different key phrases each of which begin with the same n-gram prefix, e.g., "ok," and each of which is for use in a different physical geographic location. For example, the client device 102 may have a key phrase of "ok google" in a first location and "ok indigo" in a second location that is a different location from the first location. The client hotword detection module 106 may determine that an audio signal includes the n-gram prefix without determining which of the multiple different key phrases may be encoded in the audio signal. The client device 102, upon a determination by the client hotword detection module 106 that utterances in the audio signal satisfy the first threshold 108 of being a key phrase, may send the audio signal and location data for the client device 102 to the speech recognition system 112. The speech recognition system 1 12 receives the audio signal and the location data and uses the location data to determine a key phrase from the multiple different key phrases to use for analysis. The server hotword detection module 114 uses the determined key phrase to analyze the audio signal and determines whether the audio signal satisfies the second threshold 116 of being the determined key phrase.
[0048] In some implementations, the client device 102 is asleep, e.g., in a low power mode, when the client device 102 captures the audio signal, e.g., using the microphone 104. In the sleep mode, the client device 102 may not have full functionality. For instance, some features of the client device 102 may be disabled to reduce battery usage.
[0049] The client device 102 may begin to wake up upon determining that the first utterances satisfy the first threshold 108 of being a key phrase. For example, the client device 102 may enable one or more network connectivity devices, one or more of the location devices 1 10, or both, to allow the client device 102 to communicate with the speech recognition system 1 12.
[0050] When the client device 102 receives the tagged text data from the speech recognition system 112, the client device 102 exits the sleep mode. For instance, the client device 102 enables more functionality of the client device 102 to determine an action to perform using the tagged text, to perform an action determined using the tagged text, or both. [0051] The speech recognition system 1 12 is an example of a system
implemented as computer programs on one or more computers in one or more locations, in which the systems, components, and techniques described in this document are implemented. The client device 102 may include a personal computer, a mobile communication device, or another device that can send and receive data over a network 124. The network 124, such as a local area network (LAN), wide area network (WAN), the Internet, or a combination thereof, connects the client device 102, and the speech recognition system 1 12. The speech recognition system 112 may use a single server computer or multiple server computers operating in conjunction with one another, including, for example, a set of remote computers deployed as a cloud computing service.
[0052] FIG. 2 is a flow diagram of a process 200 for determining whether to perform an action. For example, the process 200 can be used by the client device 102 from the environment 100.
[0053] A client device receives an audio signal encoding one or more utterances including a first utterance (202). The client device may use any appropriate type of device to capture the audio signal. In some examples, the client device may receive the audio signal from another device, e.g., a smart watch.
[0054] The client device determines whether at least a portion of the first utterance satisfies a first threshold of being at least a portion of a key phrase (204). The client device may include data for one or more key phrases. The client device may determine whether at least the portion of the first utterance has at least a predetermined likelihood, defined by the first threshold, of being a portion of one of the key phrases. The portion of the first utterance may include one or more n-grams from the first utterance or another appropriate type of segment from the first utterance. In some examples, when the key phrase includes two or more words, the portion may include a single word from two or more first utterances. In some examples, the client device may determine whether multiple first utterances, e.g., one or more first utterances, satisfy the first threshold of being one of the key phrases.
[0055] In response to determining that at least a portion of the first utterance satisfies the first threshold of being at least a portion of a key phrase, the client device sends the audio signal to a server system that determines whether the first utterance satisfies a second threshold of being the key phrase (206). The second threshold is more restrictive than the first threshold. For instance, the client device may send the audio signal, or a portion of the audio signal, to the server, e.g., a speech recognition system, to cause the server to determine whether the first utterance satisfies the second threshold of being the key phrase. The server always analyzes all of the first utterances to determine whether the first utterances satisfy the second threshold of being the entire key phrase.
[0056] In some implementations, the portion of the audio signal the client device sends to the server system may include the first utterances that satisfies the first threshold and one or more other utterances. For instance, the client device may continue to receive the audio signal while analyzing the first utterances such that the additional portion of the received audio signal includes the one or more other utterances. The client device may send the portion of the audio signal that includes the first utterances and the other utterances to the server.
[0057] The client device determines whether response data, received from the server system, includes tagged text data representing the one or more utterances encoded in the audio signal (208). For example, the client device may receive the response data from the server in response to sending the audio signal to the server. The client device may analyze the response data to determine whether the response data includes tagged text data.
[0058] In response to determining that the response data includes tagged text data representing the one or more utterances encoded in the audio signal, the client device performs an action using the tagged text data (210). For instance, the client device uses the tags in the data to determine the action to perform. The tags may indicate which portion of the tagged data, and the respective portion of the audio signal, correspond to the first utterances for the key phrase. The tags may indicate which portion of the tagged data correspond to an action for the client device to perform, e.g., "play some music."
[0059] In response to determining that at least a portion of the first utterance does not satisfy the first threshold of being at least a portion of a key phrase or in response to determining that the response data does not include tagged text data, the client device determines to not perform an action using data from the audio signal (212). For instance, when none of the first utterance satisfies the first threshold of being the key phrase, the client device does not perform any action using the audio signal. In some examples, when the client device receives a message from the server that indicates that the audio signal did not encode the key phrase, e.g., the response data does not include tagged text data, the client device does not perform any action using the audio signal.
[0060] In response to determining that at least a portion of the first utterance does not satisfy the first threshold of being at least a portion of a key phrase or in response to determining that the response data does not include tagged text data, the client device discards the audio signal (214). For instance, when none of the first utterance satisfies the first threshold of being the key phrase, the client device may discard the audio signal. In some examples, when the client device receives a message from the server that indicates that the audio signal did not encode the key phrase, e.g., the response data does not include tagged text data, the client device may discard the audio signal. In some implementations, the client device may discard the audio signal after a predetermined period of time when one of these conditions occurs.
[0061] The order of steps in the process 200 described above is illustrative only, and determining whether to perform an action can be performed in different orders. For example, the client device may discard the audio signal and then not perform an action using data from the audio signal or may perform these two steps concurrently.
[0062] In some implementations, the process 200 can include additional steps, fewer steps, or some of the steps can be divided into multiple steps. For example, the client device may either discard the audio signal or not perform an action using data from the audio signal, instead of performing both steps.
[0063] FIG. 3 is a flow diagram of a process 300 for generating tagged text data for an audio signal. For example, the process 300 can be used by the speech recognition system 1 12 from the environment 100.
[0064] A speech recognition system receives, from a client device, an audio signal encoding one or more utterances including one or more first utterances for which the client device determined that at least a portion of the first utterance satisfies a first threshold of being at least a portion of a key phrase (302). The speech recognition system may receive the audio signal from the client device across a network. The client device may have sent the audio signal to the speech recognition system as part of a process that includes performing steps 202 through 206 described above with reference to FIG. 2.
[0065] The speech recognition system customizes a language model for the key phrase (304). For instance, the speech recognition system may increase a likelihood that the language model, which is not specific to any particular key phrase, will accurately identify an occurrence of the key phrase encoded in the audio signal. In some examples, the speech recognition system may adjust weights for the language model specific to the key phrase.
[0066] In some implementations, the speech recognition system may determine whether to use a default key phrase. For instance, the speech recognition system may determine whether a message received from the client device that includes the audio signal also includes data identifying a key phrase, e.g., text for the key phrase or an identifier that can be used to look up a key phrase in a database. The speech recognition system may determine to use a default key phrase when the message does not include data identifying a key phrase. For example, the speech recognition system may determine that the client device, or a corresponding user account, does not have a customized key phrase and to use a default key phrase.
[0067] The speech recognition system determines whether the one or more first utterances satisfy the second threshold of being a key phrase based on output from the language model, an acoustic model, or both (306). For instance, the speech recognition system provides the audio signal to the language model, the acoustic model, or both. The speech recognition system receives a score from the language model, the acoustic model, or both, that each indicate a likelihood that the one or more first utterances are the key phrase. The speech recognition system may combine the separate scores from the language model and the acoustic model to determine whether the combined score for the audio signal satisfies the second threshold of being the key phrase.
[0068] In response to determining that the first utterance satisfies the second threshold of being a key phrase based on output from the language model, the acoustic model, or both, the speech recognition system analyzes the entire audio signal to determine data for each of the one or more utterances (308). For example, an acoustic model generates output indicating a text string for the words likely encoded in the audio signal. A tagged text generator may apply tags to the text string that indicate one or more attributes of n-grams, e.g., words, included in the text string. For instance, the tagged text generator may apply tags that identify a key phrase, an action word, e.g., "play," an application, e.g., music player, or a combination of two or more of these, to the text string.
[0069] The speech recognition system sends, to the client device, tagged text data representing the one or more utterances encoded in the audio signal generated using the data for each of the one or more utterances (310). The speech recognition system may send the tagged text data to the client device to cause the client device to perform an action using the tagged text data.
[0070] In response to determining that the first utterance does not satisfy the second threshold of being a key phrase based on output from the language model, the acoustic model, or both, the speech recognition system sends, to the client device, data indicating that the key phrase is not likely encoded in the audio signal (312). For instance, the speech recognition system may provide the client device with a message that indicates that the client device should not perform any action using data for the audio signal.
[0071] In some implementations, the process 300 can include additional steps, fewer steps, or some of the steps can be divided into multiple steps. For example, the speech recognition system might not customize the language model. In some examples, the speech recognition system may determine whether the first utterance satisfies the second threshold of being a key phrase using data or systems other than the language model, the acoustic model, or both.
[0072] Embodiments of the subject matter and the functional operations described in this specification can be implemented in digital electronic circuitry, in tangibly- embodied computer software or firmware, in computer hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible non-transitory program carrier for execution by, or to control the operation of, data processing apparatus. Alternatively or in addition, the program instructions can be encoded on an artificially-generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus for execution by a data processing apparatus. The computer storage medium can be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.
[0073] The term "data processing apparatus" refers to data processing hardware and encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers. The apparatus can also be or further include special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC
(application-specific integrated circuit). The apparatus can optionally include, in addition to hardware, code that creates an execution environment for computer programs, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them. [0074] A computer program, which may also be referred to or described as a program, software, a software application, a module, a software module, a script, or code, can be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program may, but need not, correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data, e.g., one or more scripts stored in a markup language document, in a single file dedicated to the program in question, or in multiple coordinated files, e.g., files that store one or more modules, sub-programs, or portions of code. A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and
interconnected by a communication network.
[0075] The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).
[0076] Computers suitable for the execution of a computer program include, by way of example, general or special purpose microprocessors or both, or any other kind of central processing unit. Generally, a central processing unit will receive instructions and data from a read-only memory or a random access memory or both. The essential elements of a computer are a central processing unit for performing or executing instructions and one or more memory devices for storing instructions and data.
Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. However, a computer need not have such devices. Moreover, a computer can be embedded in another device, e.g., a mobile telephone, a smart phone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device, e.g., a universal serial bus (USB) flash drive, to name just a few.
[0077] Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
[0078] To provide for interaction with a user, embodiments of the subject matter described in this specification can be implemented on a computer having a display device, e.g., LCD (liquid crystal display), OLED (organic light emitting diode) or other monitor, for displaying information to the user and a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer. Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input. In addition, a computer can interact with a user by sending documents to and receiving documents from a device that is used by the user; for example, by sending web pages to a web browser on a user's device in response to requests received from the web browser.
[0079] Embodiments of the subject matter described in this specification can be implemented in a computing system that includes a back-end component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a front-end component, e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the subject matter described in this specification, or any combination of one or more such back-end, middleware, or front-end components. The components of the system can be
interconnected by any form or medium of digital data communication, e.g., a
communication network. Examples of communication networks include a local area network (LAN) and a wide area network (WAN), e.g., the Internet.
[0080] The computing system can include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. In some embodiments, a server transmits data, e.g., an Hypert ext Markup Language (HTML) page, to a user device, e.g., for purposes of displaying data to and receiving user input from a user interacting with the user device, which acts as a client. Data generated at the user device, e.g., a result of the user interaction, can be received from the user device at the server.
[0081] FIG. 4 is a block diagram of computing devices 400, 450 that may be used to implement the systems and methods described in this document, as either a client or as a server or plurality of servers. Computing device 400 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers.
Computing device 450 is intended to represent various forms of mobile devices, such as personal digital assistants, cellular telephones, smartphones, smartwatches, head-worn devices, and other similar computing devices. The components shown here, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations described and/or claimed in this document.
[0082] Computing device 400 includes a processor 402, memory 404, a storage device 406, a high-speed interface 408 connecting to memory 404 and high-speed expansion ports 410, and a low speed interface 412 connecting to low speed bus 414 and storage device 406. Each of the components 402, 404, 406, 408, 410, and 412, are interconnected using various busses, and may be mounted on a common motherboard or in other manners as appropriate. The processor 402 can process instructions for execution within the computing device 400, including instructions stored in the memory 404 or on the storage device 406 to display graphical information for a GUI on an external input/output device, such as display 416 coupled to high speed interface 408. In other implementations, multiple processors and/or multiple buses may be used, as appropriate, along with multiple memories and types of memory. Also, multiple computing devices 400 may be connected, with each device providing portions of the necessary operations (e.g., as a server bank, a group of blade servers, or a multi-processor system).
[0083] The memory 404 stores information within the computing device 400. In one implementation, the memory 404 is a computer-readable medium. In one implementation, the memory 404 is a volatile memory unit or units. In another implementation, the memory 404 is a non-volatile memory unit or units.
[0084] The storage device 406 is capable of providing mass storage for the computing device 400. In one implementation, the storage device 406 is a computer- readable medium. In various different implementations, the storage device 406 may be a floppy disk device, a hard disk device, an optical disk device, or a tape device, a flash memory or other similar solid state memory device, or an array of devices, including devices in a storage area network or other configurations. In one implementation, a computer program product is tangibly embodied in an information carrier. The computer program product contains instructions that, when executed, perform one or more methods, such as those described above. The information carrier is a computer- or machine- readable medium, such as the memory 404, the storage device 406, or memory on processor 402.
[0085] The high speed controller 408 manages bandwidth-intensive operations for the computing device 400, while the low speed controller 412 manages lower bandwidth- intensive operations. Such allocation of duties is exemplary only. In one
implementation, the high-speed controller 408 is coupled to memory 404, display 416 (e.g., through a graphics processor or accelerator), and to high-speed expansion ports 410, which may accept various expansion cards (not shown). In the implementation, low- speed controller 412 is coupled to storage device 406 and low-speed expansion port 414. The low-speed expansion port, which may include various communication ports (e.g., USB, Bluetooth, Ethernet, wireless Ethernet) may be coupled to one or more input/output devices, such as a keyboard, a pointing device, a scanner, or a networking device such as a switch or router, e.g., through a network adapter.
[0086] The computing device 400 may be implemented in a number of different forms, as shown in the figure. For example, it may be implemented as a standard server 420, or multiple times in a group of such servers. It may also be implemented as part of a rack server system 424. In addition, it may be implemented in a personal computer such as a laptop computer 422. Alternatively, components from computing device 400 may be combined with other components in a mobile device (not shown), such as device 450. Each of such devices may contain one or more of computing device 400, 450, and an entire system may be made up of multiple computing devices 400, 450 communicating with each other.
[0087] Computing device 450 includes a processor 452, memory 464, an input/output device such as a display 454, a communication interface 466, and a transceiver 468, among other components. The device 450 may also be provided with a storage device, such as a microdrive or other device, to provide additional storage. Each of the components 450, 452, 464, 454, 466, and 468, are interconnected using various buses, and several of the components may be mounted on a common motherboard or in other manners as appropriate. [0088] The processor 452 can process instructions for execution within the computing device 450, including instructions stored in the memory 464. The processor may also include separate analog and digital processors. The processor may provide, for example, for coordination of the other components of the device 450, such as control of user interfaces, applications run by device 450, and wireless communication by device 450.
[0089] Processor 452 may communicate with a user through control interface 458 and display interface 456 coupled to a display 454. The display 454 may be, for example, a TFT LCD display or an OLED display, or other appropriate display technology. The display interface 456 may comprise appropriate circuitry for driving the display 454 to present graphical and other information to a user. The control interface 458 may receive commands from a user and convert them for submission to the processor 452. In addition, an external interface 462 may be provided in communication with processor 452, so as to enable near area communication of device 450 with other devices. Extemal interface 462 may provide, for example, for wired communication (e.g., via a docking procedure) or for wireless communication (e.g., via Bluetooth or other such
technologies).
[0090] The memory 464 stores information within the computing device 450. In one implementation, the memory 464 is a computer-readable medium. In one implementation, the memory 464 is a volatile memory unit or units. In another implementation, the memory 464 is a non-volatile memory unit or units. Expansion memory 474 may also be provided and connected to device 450 through expansion interface 472, which may include, for example, a SIMM card interface. Such expansion memory 474 may provide extra storage space for device 450, or may also store applications or other information for device 450. Specifically, expansion memory 474 may include instructions to carry out or supplement the processes described above, and may include secure information also. Thus, for example, expansion memory 474 may be provided as a security module for device 450, and may be programmed with instructions that permit secure use of device 450. In addition, secure applications may be provided via the SIMM cards, along with additional information, such as placing identifying information on the SIMM card in a non-hackable manner.
[0091] The memory may include for example, flash memory and/or MRAM memory, as discussed below. In one implementation, a computer program product is tangibly embodied in an information carrier. The computer program product contains instructions that, when executed, perform one or more methods, such as those described above. The information carrier is a computer- or machine-readable medium, such as the memory 464, expansion memory 474, or memory on processor 452.
[0092] Device 450 may communicate wirelessly through communication interface
466, which may include digital signal processing circuitry where necessary.
Communication interface 466 may provide for communications under various modes or protocols, such as GSM voice calls, SMS, EMS, or MMS messaging, CDMA, TDMA, PDC, WCDMA, CDMA2000, or GPRS, among others. Such communication may occur, for example, through radio-frequency transceiver 468. In addition, short-range communication may occur, such as using a Bluetooth, WiFi, or other such transceiver (not shown). In addition, GPS receiver module 470 may provide additional wireless data to device 450, which may be used as appropriate by applications running on device 450.
[0093] Device 450 may also communicate audibly using audio codec 460, which may receive spoken information from a user and convert it to usable digital information. Audio codec 460 may likewise generate audible sound for a user, such as through a speaker, e.g., in a handset of device 450. Such sound may include sound from voice telephone calls, may include recorded sound (e.g., voice messages, music files, etc.) and may also include sound generated by applications operating on device 450.
[0094] The computing device 450 may be implemented in a number of different forms, as shown in the figure. For example, it may be implemented as a cellular telephone 480. It may also be implemented as part of a smartphone 482, personal digital assistant, or other similar mobile device.
[0095] Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, specially designed ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various implementations can include implementation in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.
[0096] These computer programs (also known as programs, software, software applications or code) include machine instructions for a programmable processor, and can be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms "machine-readable medium" "computer-readable medium" refers to any computer program product, apparatus and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a
programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.
[0097] To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user and a keyboard and a pointing device (e.g., a mouse or a trackball) by which the user can provide input to the computer. Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form, including acoustic, speech, or tactile input.
[0098] The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), and the Internet.
[0099] The computing system can include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
[0100] While this specification contains many specific implementation details, these should not be construed as limitations on the scope of what may be claimed, but rather as descriptions of features that may be specific to particular embodiments. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.
[0101] Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.
[0102] Further implementations are summarized in the following examples:
[0103] Example 1 : A non-transitory computer storage medium encoded with instructions that, when executed by a computer, cause the computer to perform operations comprising:
receiving an audio signal encoding one or more utterances including a first utterance;
determining whether at least a portion of the first utterance satisfies a first threshold of being at least a portion of a key phrase;
in response to determining that at least the portion of the first utterance satisfies the first threshold of being at least a portion of a key phrase, sending the audio signal to a server system that determines whether the first utterance satisfies a second threshold of being the key phrase, the second threshold being more restrictive than the first threshold; and
receiving, from the server system, tagged text data representing the one or more utterances encoded in the audio signal when the server system determines that the first utterance satisfies the second threshold.
[0104] Example 2: The computer storage medium of example 1, the operations comprising performing an action using the tagged text data subsequent to receiving, from the server system, the tagged text data representing the one or more utterances encoded in the audio signal when the server system determines that the first utterance satisfies the second threshold.
[0105] Example 3 : The computer storage medium of example 1 or 2, wherein:
the one or more utterances comprises two or more utterances, the first utterance encoded prior to the other utterances from the two or more utterances in the audio signal; and
performing the action using the tagged text data comprises performing an action using the tagged text data for the one or more utterances encoded in the audio signal after the first utterance.
[0106] Example 4: The computer storage medium of one of examples 1 to 3, wherein determining whether at least a portion of the first utterance satisfies the first threshold of being at least a portion of the key phrase comprises determining whether at least a portion of the first utterance satisfies the first threshold of being at least a portion of the key phrase that includes two or more words.
[0107] Example 5 : The computer storage medium of one of examples 1 to 4, the operations comprising:
receiving a second audio signal encoding one or more second utterances including a second utterance;
determining whether at least a portion of the second utterance satisfies the first threshold of being at least a portion of a key phrase; and
in response to determining that at least the portion of the second utterance does not satisfy the first threshold of being at least a portion of a key phrase, discarding the second audio signal.
[0108] Example 6: The computer storage medium of example 5, the operations comprising determining to not perform an action using data from the second audio signal in response to determining that at least the portion of the second utterance does not satisfy the first threshold of being at least a portion of a key phrase.
[0109] Example 7: The computer storage medium of one of examples 1 to 6, wherein determining whether at least a portion of the first utterance satisfies the first threshold of being a key phrase comprises determining whether at least a portion of the first utterance satisfies a first likelihood of being at least a portion of a key phrase.
[0110] Example 8: A system comprising one or more computers and one or more storage devices on which are stored instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising:
receiving, from a client device, an audio signal encoding one or more utterances including one or more first utterances for which the client device determined that at least a portion of the one or more first utterances satisfies a first threshold of being at least a portion of a key phrase;
determining whether the one or more first utterances satisfy a second threshold of being at least a portion of the key phrase, the second threshold more restrictive than the first threshold; and
sending, to the client device, a result of determining whether the one or more first utterances satisfy the second threshold of being the key phrase.
[0111] Example 9: The system of example 8, wherein sending, to the client device, the result of determining whether the one or more first utterances satisfy the second threshold of being the key phrase comprises sending, to the client device, data indicating that the key phrase is not likely included in the audio signal in response to determining that the one or more first utterances do not satisfy the second threshold of being the key phrase.
[0112] Example 10: The system of example 8 or 9, wherein sending, to the client device, the result of determining whether the one or more first utterances satisfy the second threshold of being the key phrase comprises sending, to the client device, data for the audio signal in response to determining that the one or more first utterances satisfy the second threshold of being the key phrase.
[0113] Example 1 1 : The system of one of examples 8 to 10, wherein sending, to the client device, data for the audio signal in response to determining that the one or more first utterances satisfy the second threshold of being the key phrase comprises sending, to the client device, tagged text data representing the one or more utterances encoded in the audio signal.
[0114] Example 12: The system of one of examples 8 to 11 , the operations comprising analyzing the entire audio signal to determine first data for each of the one or more utterances, wherein sending, to the client device, the data for the audio signal in response to determining that the one or more first utterances satisfy the second threshold of being the key phrase comprises sending, to the client device, the first data for the audio signal in response to determining that the one or more first utterances satisfy the second threshold of being the key phrase. [0115] Example 13: The system of one of examples 8 to 12, wherein determining whether the one or more first utterances satisfy the second threshold of being the key phrase comprises determining, using a language model, whether the one or more first utterances satisfy the second threshold of being the key phrase.
[0116] Example 14: The system of one of examples 8 to 13, the operations comprising customizing the language model for the key phrase prior to determining, using the language model, whether the one or more first utterances satisfy the second threshold of being the key phrase.
[0117] Example 15: The system of one of examples 8 to 14, the operations comprising receiving text identifying the key phrase, wherein customizing the language model for the key phrase comprises customizing the language model for the key phrase using the text identifying the key phrase.
[0118] Example 16: The system of one of examples 8 to 15, the operations comprising:
receiving an identifier; and
determining, using the identifier, key phrase data for the key phrase, wherein customizing the language model for the key phrase comprises customizing the language model for the key phrase using the key phrase data.
[0119] Example 17: The system of one of examples 8 to 16, wherein determining, using the language model, whether the one or more first utterances satisfy the second threshold of being the key phrase comprises determining, using the language model and an acoustic model, whether the one or more first utterances satisfy the second threshold of being the key phrase.
[0120] Example 18: The system of one of examples 8 to 17, wherein
determining, using the language model and the acoustic model, whether the one or more first utterances satisfy the second threshold of being the key phrase comprises:
providing data for the one or more first utterances to the language model to cause the language model generate a first output;
providing data for the one or more first utterances to the acoustic model to cause the acoustic model to generate a second output;
combining the first output and the second output to generate a combined output; and
determining, using the combined output, whether the one or more first utterances satisfy the second threshold of being the key phrase. [0121] Example 19: The system of one of examples 8 to 18, the operations comprising selecting the language model for a default key phrase.
[0122] Example 20: The system of one of examples 8 to 19, the operations comprising determining whether to use the default key phrase.
[0123] Example 21 : A computer-implemented method comprising:
receiving an audio signal encoding one or more utterances including a first utterance;
determining whether at least a portion of the first utterance satisfies a first threshold of being at least a portion of a key phrase;
in response to determining that at least the portion of the first utterance satisfies the first threshold of being at least a portion of a key phrase, sending the audio signal to a server system that determines whether the first utterance satisfies a second threshold of being the key phrase, the second threshold being more restrictive than the first threshold; and
receiving, from the server system, tagged text data representing the one or more utterances encoded in the audio signal when the server system determines that the first utterance satisfies the second threshold.
[0124] Particular embodiments of the subj ect matter have been described. Other embodiments are within the scope of the following claims. For example, the actions recited in the claims can be performed in a different order and still achieve desirable results. As one example, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous.

Claims

1. A non-transitory computer storage medium encoded with instructions that, when executed by a computer, cause the computer to perform operations comprising:
receiving an audio signal encoding one or more utterances including a first utterance;
determining whether at least a portion of the first utterance satisfies a first threshold of being at least a portion of a key phrase;
in response to determining that at least the portion of the first utterance satisfies the first threshold of being at least a portion of a key phrase, sending the audio signal to a server system that determines whether the first utterance satisfies a second threshold of being the key phrase, the second threshold being more restrictive than the first threshold; and
receiving, from the server system, tagged text data representing the one or more utterances encoded in the audio signal when the server system determines that the first utterance satisfies the second threshold.
2. The computer storage medium of claim 1 , the operations comprising performing an action using the tagged text data subsequent to receiving, from the server system, the tagged text data representing the one or more utterances encoded in the audio signal when the server system determines that the first utterance satisfies the second threshold.
3. The computer storage medium of claim 2, wherein:
the one or more utterances comprises two or more utterances, the first utterance encoded prior to the other utterances from the two or more utterances in the audio signal; and
performing the action using the tagged text data comprises performing an action using the tagged text data for the one or more utterances encoded in the audio signal after the first utterance.
4. The computer storage medium of claim 1 , wherein determining whether at least a portion of the first utterance satisfies the first threshold of being at least a portion of the key phrase comprises determining whether at least a portion of the first utterance satisfies the first threshold of being at least a portion of the key phrase that includes two or more words.
5. The computer storage medium of claim 1, the operations comprising:
receiving a second audio signal encoding one or more second utterances including a second utterance;
determining whether at least a portion of the second utterance satisfies the first threshold of being at least a portion of a key phrase; and
in response to determining that at least the portion of the second utterance does not satisfy the first threshold of being at least a portion of a key phrase, discarding the second audio signal.
6. The computer storage medium of claim 5, the operations comprising determining to not perform an action using data from the second audio signal in response to determining that at least the portion of the second utterance does not satisfy the first threshold of being at least a portion of a key phrase.
7. The computer storage medium of claim 1, wherein determining whether at least a portion of the first utterance satisfies the first threshold of being a key phrase comprises determining whether at least a portion of the first utterance satisfies a first likelihood of being at least a portion of a key phrase.
8. A system comprising one or more computers and one or more storage devices on which are stored instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising:
receiving, from a client device, an audio signal encoding one or more utterances including one or more first utterances for which the client device determined that at least a portion of the one or more first utterances satisfies a first threshold of being at least a portion of a key phrase;
determining whether the one or more first utterances satisfy a second threshold of being at least a portion of the key phrase, the second threshold more restrictive than the first threshold; and
sending, to the client device, a result of determining whether the one or more first utterances satisfy the second threshold of being the key phrase.
9. The system of claim 8, wherein sending, to the client device, the result of determining whether the one or more first utterances satisfy the second threshold of being the key phrase comprises sending, to the client device, data indicating that the key phrase is not likely included in the audio signal in response to determining that the one or more first utterances do not satisfy the second threshold of being the key phrase.
10. The system of claim 8, wherein sending, to the client device, the result of determining whether the one or more first utterances satisfy the second threshold of being the key phrase comprises sending, to the client device, data for the audio signal in response to determining that the one or more first utterances satisfy the second threshold of being the key phrase.
1 1. The system of claim 10, wherein sending, to the client device, data for the audio signal in response to determining that the one or more first utterances satisfy the second threshold of being the key phrase comprises sending, to the client device, tagged text data representing the one or more utterances encoded in the audio signal.
12. The system of claim 10, the operations comprising analyzing the entire audio signal to determine first data for each of the one or more utterances, wherein sending, to the client device, the data for the audio signal in response to determining that the one or more first utterances satisfy the second threshold of being the key phrase comprises sending, to the client device, the first data for the audio signal in response to determining that the one or more first utterances satisfy the second threshold of being the key phrase.
13. The system of claim 8, wherein determining whether the one or more first utterances satisfy the second threshold of being the key phrase comprises determining, using a language model, whether the one or more first utterances satisfy the second threshold of being the key phrase.
14. The system of claim 13, the operations comprising customizing the language model for the key phrase prior to determining, using the language model, whether the one or more first utterances satisfy the second threshold of being the key phrase.
15. The system of claim 14, the operations comprising receiving text identifying the key phrase, wherein customizing the language model for the key phrase comprises customizing the language model for the key phrase using the text identifying the key phrase.
16. The system of claim 14, the operations comprising:
receiving an identifier; and
determining, using the identifier, key phrase data for the key phrase, wherein customizing the language model for the key phrase comprises customizing the language model for the key phrase using the key phrase data.
17. The system of claim 13, wherein determining, using the language model, whether the one or more first utterances satisfy the second threshold of being the key phrase comprises determining, using the language model and an acoustic model, whether the one or more first utterances satisfy the second threshold of being the key phrase.
18. The system of claim 17, wherein determining, using the language model and the acoustic model, whether the one or more first utterances satisfy the second threshold of being the key phrase comprises:
providing data for the one or more first utterances to the language model to cause the language model generate a first output;
providing data for the one or more first utterances to the acoustic model to cause the acoustic model to generate a second output;
combining the first output and the second output to generate a combined output; and
determining, using the combined output, whether the one or more first utterances satisfy the second threshold of being the key phrase.
19. The system of claim 13, the operations comprising selecting the language model for a default key phrase.
20. The system of claim 19, the operations comprising determining whether to use the default key phrase.
21. A computer-implemented method comprising:
receiving an audio signal encoding one or more utterances including a first utterance;
determining whether at least a portion of the first utterance satisfies a first threshold of being at least a portion of a key phrase;
in response to determining that at least the portion of the first utterance satisfies the first threshold of being at least a portion of a key phrase, sending the audio signal to a server system that determines whether the first utterance satisfies a second threshold of being the key phrase, the second threshold being more restrictive than the first threshold; and
receiving, from the server system, tagged text data representing the one or more utterances encoded in the audio signal when the server system determines that the first utterance satisfies the second threshold.
PCT/US2017/058944 2017-02-14 2017-10-30 Server side hotwording WO2018151772A1 (en)

Priority Applications (6)

Application Number Priority Date Filing Date Title
EP20194706.6A EP3767623A1 (en) 2017-02-14 2017-10-30 Server side hotwording
CN202310534112.4A CN116504238A (en) 2017-02-14 2017-10-30 Server side hotword
CN201780086256.0A CN110268469B (en) 2017-02-14 2017-10-30 Server side hotword
EP17804349.3A EP3559944B1 (en) 2017-02-14 2017-10-30 Server side hotwording
JP2019543379A JP6855588B2 (en) 2017-02-14 2017-10-30 Server-side hotwarding
KR1020197025555A KR102332944B1 (en) 2017-02-14 2017-10-30 server side hotwording

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US15/432,358 2017-02-14
US15/432,358 US10311876B2 (en) 2017-02-14 2017-02-14 Server side hotwording

Publications (1)

Publication Number Publication Date
WO2018151772A1 true WO2018151772A1 (en) 2018-08-23

Family

ID=60452744

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2017/058944 WO2018151772A1 (en) 2017-02-14 2017-10-30 Server side hotwording

Country Status (7)

Country Link
US (5) US10311876B2 (en)
EP (2) EP3767623A1 (en)
JP (2) JP6855588B2 (en)
KR (1) KR102332944B1 (en)
CN (2) CN110268469B (en)
DE (1) DE202017106606U1 (en)
WO (1) WO2018151772A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020073839A1 (en) * 2018-10-11 2020-04-16 阿里巴巴集团控股有限公司 Voice wake-up method, apparatus and system, and electronic device
WO2020204907A1 (en) * 2019-04-01 2020-10-08 Google Llc Adaptive management of casting requests and/or user inputs at a rechargeable device
US11699443B2 (en) 2017-02-14 2023-07-11 Google Llc Server side hotwording

Families Citing this family (71)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10095470B2 (en) 2016-02-22 2018-10-09 Sonos, Inc. Audio response playback
US10743101B2 (en) 2016-02-22 2020-08-11 Sonos, Inc. Content mixing
US9947316B2 (en) 2016-02-22 2018-04-17 Sonos, Inc. Voice control of a media playback system
US10264030B2 (en) 2016-02-22 2019-04-16 Sonos, Inc. Networked microphone device control
US10509626B2 (en) 2016-02-22 2019-12-17 Sonos, Inc Handling of loss of pairing between networked devices
US9965247B2 (en) 2016-02-22 2018-05-08 Sonos, Inc. Voice controlled media playback system based on user profile
US9978390B2 (en) 2016-06-09 2018-05-22 Sonos, Inc. Dynamic player selection for audio signal processing
US10134399B2 (en) 2016-07-15 2018-11-20 Sonos, Inc. Contextualization of voice inputs
DE102016114265A1 (en) * 2016-08-02 2018-02-08 Claas Selbstfahrende Erntemaschinen Gmbh Method for at least partially machine transferring a word sequence written in a source language into a word sequence of a target language
US10115400B2 (en) 2016-08-05 2018-10-30 Sonos, Inc. Multiple voice services
US9942678B1 (en) 2016-09-27 2018-04-10 Sonos, Inc. Audio playback settings for voice interaction
US10181323B2 (en) 2016-10-19 2019-01-15 Sonos, Inc. Arbitration-based voice recognition
KR20180118461A (en) * 2017-04-21 2018-10-31 엘지전자 주식회사 Voice recognition module and and voice recognition method
US10475449B2 (en) 2017-08-07 2019-11-12 Sonos, Inc. Wake-word detection suppression
CN107591151B (en) * 2017-08-22 2021-03-16 百度在线网络技术(北京)有限公司 Far-field voice awakening method and device and terminal equipment
US10048930B1 (en) 2017-09-08 2018-08-14 Sonos, Inc. Dynamic computation of system response volume
US10446165B2 (en) 2017-09-27 2019-10-15 Sonos, Inc. Robust short-time fourier transform acoustic echo cancellation during audio playback
US10621981B2 (en) 2017-09-28 2020-04-14 Sonos, Inc. Tone interference cancellation
US10051366B1 (en) 2017-09-28 2018-08-14 Sonos, Inc. Three-dimensional beam forming with a microphone array
US10482868B2 (en) 2017-09-28 2019-11-19 Sonos, Inc. Multi-channel acoustic echo cancellation
US10466962B2 (en) 2017-09-29 2019-11-05 Sonos, Inc. Media playback system with voice assistance
WO2019079962A1 (en) * 2017-10-24 2019-05-02 Beijing Didi Infinity Technology And Development Co., Ltd. System and method for speech recognition with decoupling awakening phrase
TWI661319B (en) * 2017-11-30 2019-06-01 財團法人資訊工業策進會 Apparatus, method, and computer program product thereof for generatiing control instructions based on text
US10672380B2 (en) * 2017-12-27 2020-06-02 Intel IP Corporation Dynamic enrollment of user-defined wake-up key-phrase for speech enabled computer system
US11343614B2 (en) 2018-01-31 2022-05-24 Sonos, Inc. Device designation of playback and network microphone device arrangements
CN108665900B (en) * 2018-04-23 2020-03-03 百度在线网络技术(北京)有限公司 Cloud wake-up method and system, terminal and computer readable storage medium
US11175880B2 (en) 2018-05-10 2021-11-16 Sonos, Inc. Systems and methods for voice-assisted media content selection
US10959029B2 (en) 2018-05-25 2021-03-23 Sonos, Inc. Determining and adapting to changes in microphone performance of playback devices
US10681460B2 (en) 2018-06-28 2020-06-09 Sonos, Inc. Systems and methods for associating playback devices with voice assistant services
US11076035B2 (en) 2018-08-28 2021-07-27 Sonos, Inc. Do not disturb feature for audio notifications
US10461710B1 (en) 2018-08-28 2019-10-29 Sonos, Inc. Media playback system with maximum volume setting
US10587430B1 (en) 2018-09-14 2020-03-10 Sonos, Inc. Networked devices, systems, and methods for associating playback devices based on sound codes
US11024331B2 (en) 2018-09-21 2021-06-01 Sonos, Inc. Voice detection optimization using sound metadata
US11308939B1 (en) * 2018-09-25 2022-04-19 Amazon Technologies, Inc. Wakeword detection using multi-word model
US11100923B2 (en) 2018-09-28 2021-08-24 Sonos, Inc. Systems and methods for selective wake word detection using neural network models
US10692518B2 (en) 2018-09-29 2020-06-23 Sonos, Inc. Linear filtering for noise-suppressed speech detection via multiple network microphone devices
US11899519B2 (en) 2018-10-23 2024-02-13 Sonos, Inc. Multiple stage network microphone device with reduced power consumption and processing load
EP3654249A1 (en) 2018-11-15 2020-05-20 Snips Dilated convolutions and gating for efficient keyword spotting
US11183183B2 (en) 2018-12-07 2021-11-23 Sonos, Inc. Systems and methods of operating media playback systems having multiple voice assistant services
US11232788B2 (en) * 2018-12-10 2022-01-25 Amazon Technologies, Inc. Wakeword detection
US11132989B2 (en) 2018-12-13 2021-09-28 Sonos, Inc. Networked microphone devices, systems, and methods of localized arbitration
US10602268B1 (en) 2018-12-20 2020-03-24 Sonos, Inc. Optimization of network microphone devices using noise classification
US10867604B2 (en) 2019-02-08 2020-12-15 Sonos, Inc. Devices, systems, and methods for distributed voice processing
US11315556B2 (en) 2019-02-08 2022-04-26 Sonos, Inc. Devices, systems, and methods for distributed voice processing by transmitting sound data associated with a wake word to an appropriate device for identification
US11093720B2 (en) * 2019-03-28 2021-08-17 Lenovo (Singapore) Pte. Ltd. Apparatus, method, and program product for converting multiple language variations
US11120794B2 (en) 2019-05-03 2021-09-14 Sonos, Inc. Voice assistant persistence across multiple network microphone devices
CN113692616B (en) * 2019-05-03 2024-01-05 谷歌有限责任公司 Phoneme-based contextualization for cross-language speech recognition in an end-to-end model
KR20200141860A (en) * 2019-06-11 2020-12-21 삼성전자주식회사 Electronic apparatus and the control method thereof
US11361756B2 (en) 2019-06-12 2022-06-14 Sonos, Inc. Conditional wake word eventing based on environment
US11200894B2 (en) 2019-06-12 2021-12-14 Sonos, Inc. Network microphone device with command keyword eventing
US10586540B1 (en) 2019-06-12 2020-03-10 Sonos, Inc. Network microphone device with command keyword conditioning
US11282500B2 (en) * 2019-07-19 2022-03-22 Cisco Technology, Inc. Generating and training new wake words
US10871943B1 (en) 2019-07-31 2020-12-22 Sonos, Inc. Noise classification for event detection
US11138975B2 (en) 2019-07-31 2021-10-05 Sonos, Inc. Locally distributed keyword detection
US20210050003A1 (en) * 2019-08-15 2021-02-18 Sameer Syed Zaheer Custom Wake Phrase Training
WO2021071115A1 (en) * 2019-10-07 2021-04-15 Samsung Electronics Co., Ltd. Electronic device for processing user utterance and method of operating same
US11189286B2 (en) 2019-10-22 2021-11-30 Sonos, Inc. VAS toggle based on device orientation
US11200900B2 (en) 2019-12-20 2021-12-14 Sonos, Inc. Offline voice control
US11562740B2 (en) 2020-01-07 2023-01-24 Sonos, Inc. Voice verification for media playback
US11556307B2 (en) 2020-01-31 2023-01-17 Sonos, Inc. Local voice data processing
US11308958B2 (en) 2020-02-07 2022-04-19 Sonos, Inc. Localized wakeword verification
US11308962B2 (en) 2020-05-20 2022-04-19 Sonos, Inc. Input detection windowing
US11482224B2 (en) 2020-05-20 2022-10-25 Sonos, Inc. Command keywords with input detection windowing
US11610578B2 (en) 2020-06-10 2023-03-21 Google Llc Automatic hotword threshold tuning
US11698771B2 (en) 2020-08-25 2023-07-11 Sonos, Inc. Vocal guidance engines for playback devices
US11984123B2 (en) 2020-11-12 2024-05-14 Sonos, Inc. Network device interaction by range
US11749267B2 (en) * 2020-11-20 2023-09-05 Google Llc Adapting hotword recognition based on personalized negatives
US12014727B2 (en) * 2021-07-14 2024-06-18 Google Llc Hotwording by degree
DE102021005206B3 (en) 2021-10-19 2022-11-03 Mercedes-Benz Group AG Method and device for determining a multi-part keyword
JP7267636B1 (en) 2021-10-21 2023-05-02 株式会社アートクリフ Information processing device, information processing system, information processing method and program
US20230267155A1 (en) * 2022-02-23 2023-08-24 The Knot Worldwide Inc. Matching online accounts with overlapping characteristics based on non-homogenous data types

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150154954A1 (en) * 2013-12-04 2015-06-04 Google Inc. Initiating actions based on partial hotwords
US20150340032A1 (en) * 2014-05-23 2015-11-26 Google Inc. Training multiple neural networks with different accuracy
US20160171975A1 (en) * 2014-12-11 2016-06-16 Mediatek Inc. Voice wakeup detecting device and method
US20160379635A1 (en) * 2013-12-18 2016-12-29 Cirrus Logic International Semiconductor Ltd. Activating speech process

Family Cites Families (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1207517B1 (en) * 2000-11-16 2007-01-03 Sony Deutschland GmbH Method for recognizing speech
US8838449B2 (en) * 2010-12-23 2014-09-16 Microsoft Corporation Word-dependent language model
JP5596649B2 (en) * 2011-09-26 2014-09-24 株式会社東芝 Document markup support apparatus, method, and program
WO2014039106A1 (en) * 2012-09-10 2014-03-13 Google Inc. Answering questions using environmental context
US8468023B1 (en) * 2012-10-01 2013-06-18 Google Inc. Handsfree device with countinuous keyword recognition
US9704486B2 (en) * 2012-12-11 2017-07-11 Amazon Technologies, Inc. Speech recognition power management
US20150279351A1 (en) 2012-12-19 2015-10-01 Google Inc. Keyword detection based on acoustic alignment
US9842489B2 (en) 2013-02-14 2017-12-12 Google Llc Waking other devices for additional data
US9361885B2 (en) 2013-03-12 2016-06-07 Nuance Communications, Inc. Methods and apparatus for detecting a voice command
JP2015011170A (en) * 2013-06-28 2015-01-19 株式会社ATR−Trek Voice recognition client device performing local voice recognition
US9202462B2 (en) 2013-09-30 2015-12-01 Google Inc. Key phrase detection
US20160055847A1 (en) * 2014-08-19 2016-02-25 Nuance Communications, Inc. System and method for speech validation
US9418656B2 (en) * 2014-10-29 2016-08-16 Google Inc. Multi-stage hotword detection
US9508340B2 (en) 2014-12-22 2016-11-29 Google Inc. User specified keyword spotting using long short term memory neural network feature extractor
EP3067884B1 (en) 2015-03-13 2019-05-08 Samsung Electronics Co., Ltd. Speech recognition system and speech recognition method thereof
US10311876B2 (en) * 2017-02-14 2019-06-04 Google Llc Server side hotwording
US10762903B1 (en) * 2017-11-07 2020-09-01 Amazon Technologies, Inc. Conversational recovery for voice user interface
US11017778B1 (en) * 2018-12-04 2021-05-25 Sorenson Ip Holdings, Llc Switching between speech recognition systems
KR20210110666A (en) 2019-04-01 2021-09-08 구글 엘엘씨 Adaptive management of casting requests and/or user input on rechargeable devices

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150154954A1 (en) * 2013-12-04 2015-06-04 Google Inc. Initiating actions based on partial hotwords
US20160379635A1 (en) * 2013-12-18 2016-12-29 Cirrus Logic International Semiconductor Ltd. Activating speech process
US20150340032A1 (en) * 2014-05-23 2015-11-26 Google Inc. Training multiple neural networks with different accuracy
US20160171975A1 (en) * 2014-12-11 2016-06-16 Mediatek Inc. Voice wakeup detecting device and method

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11699443B2 (en) 2017-02-14 2023-07-11 Google Llc Server side hotwording
US12094472B2 (en) 2017-02-14 2024-09-17 Google Llc Server side hotwording
WO2020073839A1 (en) * 2018-10-11 2020-04-16 阿里巴巴集团控股有限公司 Voice wake-up method, apparatus and system, and electronic device
WO2020204907A1 (en) * 2019-04-01 2020-10-08 Google Llc Adaptive management of casting requests and/or user inputs at a rechargeable device
US11120804B2 (en) 2019-04-01 2021-09-14 Google Llc Adaptive management of casting requests and/or user inputs at a rechargeable device
JP2022519344A (en) * 2019-04-01 2022-03-23 グーグル エルエルシー Adaptive management of casting requests and / or user input on rechargeable devices
JP7081054B2 (en) 2019-04-01 2022-06-06 グーグル エルエルシー Adaptive management of casting requests and / or user input on rechargeable devices

Also Published As

Publication number Publication date
US12094472B2 (en) 2024-09-17
CN110268469B (en) 2023-05-23
JP2020507815A (en) 2020-03-12
EP3559944A1 (en) 2019-10-30
EP3559944B1 (en) 2020-09-09
CN116504238A (en) 2023-07-28
US11699443B2 (en) 2023-07-11
JP7189248B2 (en) 2022-12-13
CN110268469A (en) 2019-09-20
US20190304465A1 (en) 2019-10-03
KR102332944B1 (en) 2021-11-30
JP2021107927A (en) 2021-07-29
US20230343340A1 (en) 2023-10-26
US20210287678A1 (en) 2021-09-16
US10311876B2 (en) 2019-06-04
US10706851B2 (en) 2020-07-07
EP3767623A1 (en) 2021-01-20
DE202017106606U1 (en) 2018-02-14
KR20190109532A (en) 2019-09-25
US20200365158A1 (en) 2020-11-19
US20180233150A1 (en) 2018-08-16
JP6855588B2 (en) 2021-04-07
US11049504B2 (en) 2021-06-29

Similar Documents

Publication Publication Date Title
US12094472B2 (en) Server side hotwording
JP6630765B2 (en) Individualized hotword detection model
US10269346B2 (en) Multiple speech locale-specific hotword classifiers for selection of a speech locale
US9293136B2 (en) Multiple recognizer speech recognition
US20150127345A1 (en) Name Based Initiation of Speech Recognition

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17804349

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2017804349

Country of ref document: EP

Effective date: 20190723

ENP Entry into the national phase

Ref document number: 2019543379

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 20197025555

Country of ref document: KR

Kind code of ref document: A