US20170194006A1 - Individualized hotword detection models - Google Patents
Individualized hotword detection models Download PDFInfo
- Publication number
- US20170194006A1 US20170194006A1 US15/462,160 US201715462160A US2017194006A1 US 20170194006 A1 US20170194006 A1 US 20170194006A1 US 201715462160 A US201715462160 A US 201715462160A US 2017194006 A1 US2017194006 A1 US 2017194006A1
- Authority
- US
- United States
- Prior art keywords
- user
- acoustic data
- hotword
- predefined hotword
- candidate acoustic
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 74
- 238000000034 method Methods 0.000 claims abstract description 34
- 238000012549 training Methods 0.000 claims description 34
- 230000008569 process Effects 0.000 claims description 20
- 230000004044 response Effects 0.000 claims description 15
- 238000012935 Averaging Methods 0.000 claims description 4
- 238000004590 computer program Methods 0.000 abstract description 12
- 230000009471 action Effects 0.000 abstract description 5
- 230000015654 memory Effects 0.000 description 37
- 238000004891 communication Methods 0.000 description 18
- 238000012545 processing Methods 0.000 description 14
- 230000003287 optical effect Effects 0.000 description 6
- 238000013528 artificial neural network Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 230000001413 cellular effect Effects 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 230000003993 interaction Effects 0.000 description 2
- 239000004973 liquid crystal related substance Substances 0.000 description 2
- 230000000644 propagated effect Effects 0.000 description 2
- 238000013515 script Methods 0.000 description 2
- 238000000926 separation method Methods 0.000 description 2
- 230000002730 additional effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000006266 hibernation Effects 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 230000006855 networking Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000001953 sensory effect Effects 0.000 description 1
- 230000007958 sleep Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 239000000758 substrate Substances 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
- 238000013518 transcription Methods 0.000 description 1
- 230000035897 transcription Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/60—Information retrieval; Database structures therefor; File system structures therefor of audio data
- G06F16/68—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/683—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
- G10L13/086—Detection of language
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/04—Segmentation; Word boundary detection
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/065—Adaptation
- G10L15/07—Adaptation to the speaker
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/1815—Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/06—Decision making techniques; Pattern matching strategies
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/06—Decision making techniques; Pattern matching strategies
- G10L17/08—Use of distortion metrics or a particular distance between probe pattern and reference templates
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/18—Artificial neural networks; Connectionist approaches
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/22—Interactive procedures; Man-machine interfaces
- G10L17/24—Interactive procedures; Man-machine interfaces the user being prompted to utter a password or a predefined phrase
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/065—Adaptation
- G10L15/07—Adaptation to the speaker
- G10L15/075—Adaptation to the speaker supervised, i.e. under machine guidance
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
- G10L2015/0638—Interactive procedures
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L2015/088—Word spotting
Definitions
- This disclosure generally relates to controlling computers using voice commands.
- a computer may analyze a user's utterance and may perform an action in response. For example, a user may say “DRIVE HOME” and a computer may respond with directions for the user to drive home from their current location.
- an aspect of the subject matter described in this specification may involve a process for generating an individualized hotword detection model.
- a “hotword” may refer to a term that wakes a device up from a sleep state or hibernation state, or a term that triggers semantic interpretation on the term or on one or more terms that follow the term, e.g., on voice commands that follow the hotword.
- the term “OK COMPUTER,” in the utterance “OK COMPUTER, DRIVE HOME,” may be a hotword that triggers semantic interpretation on the following term “DRIVE HOME,” and the term “DRIVE HOME” may correspond to a voice command for providing directions to the user's home.
- the system may determine that the utterance begins with the hotword, “OK COMPUTER,” in response, transcribe the sound, perform semantic interpretation on the transcription of the voice command “DRIVE HOME,” and output directions for the user to drive home.
- Hotwords may be useful for “always on” systems that may potentially pick up sounds that are not directed to the system. For example, the use of hotwords may help the system discern when a given utterance is directed at the system, as opposed to an utterance that is directed to another individual present in the environment. In doing so, the system may avoid computationally expensive processing, e.g., semantic interpretation, on sounds or utterances that do not include a hotword.
- a system may detect an utterance includes a hotword based on a hotword detection model. However, different users may pronounce the same hotword in different ways. Accordingly, the system may not detect when some users speak the hotword.
- the system may increase detection of hotwords based on generating individualized hotword detection models. However, generating a hotword detection model may use thousands of utterances and a user may not desire to provide thousands of enrollment utterances. Accordingly, after receiving one or more enrollment utterances by a user, the system may identify other utterances of the hotword by other users, select the utterances that are similar to the enrollment utterances by the user, and generate the individualized hotword detection model using the selected utterances and the enrollment utterances.
- the subject matter described in this specification may be embodied in methods that may include the actions of obtaining enrollment acoustic data representing an enrollment utterance spoken by a user, obtaining a set of candidate acoustic data representing utterances spoken by other users, determining, for each candidate acoustic data of the set of candidate acoustic data, a similarity score that represents a similarity between the enrollment acoustic data and the candidate acoustic data, selecting a subset of candidate acoustic data from the set of candidate acoustic data based at least on the similarity scores, generating a detection model based on the subset of candidate acoustic data, and providing the detection model for use in detecting an utterance spoken by the user.
- obtaining enrollment acoustic data representing an enrollment utterance spoken by a user includes obtaining enrollment acoustic data for multiple utterances of a predetermined phrase spoken by the user.
- obtaining a set of candidate acoustic data representing utterances spoken by other users includes determining the enrollment utterance is of a predetermined phrase and identifying candidate acoustic data representing utterances of the predetermined phrase spoken by other users.
- determining, for each candidate acoustic data of the set of candidate acoustic data, the similarity score includes determining a distance between the enrollment acoustic data and the candidate acoustic data and determining the similarity score based on the distance.
- determining, for each candidate acoustic data of the set of candidate acoustic data, the similarity score includes determining the similarity scores based on demographic information of the other user that spoke the utterance represented by the candidate acoustic data and demographic information of the user that spoke the enrollment utterance.
- selecting a subset of candidate acoustic data from the set of candidate acoustic data based at least on similarity scores of the candidate acoustic data that represent a similarity between the enrollment acoustic data and the candidate acoustic data is based on selecting a predetermined number of candidate acoustic data.
- generating a detection model based on the subset of candidate acoustic data includes training a neural network using the subset of candidate acoustic data.
- additional actions include detecting an utterance of a predetermined phrase using the detection model.
- FIGS. 1 and 2 are illustrations of block diagrams of example systems for generating an individualized hotword detection model.
- FIG. 3 is a flowchart of an example process for generating an individualized hotword detection model.
- FIG. 4 is a diagram of exemplary computing devices.
- FIG. 1 is a block diagram of an example system 100 for generating an individualized hotword detection model.
- the system 100 may include a client device 120 and a server 130 that includes a candidate acoustic data scorer 134 , candidate acoustic data selector 136 , and a hotword detection model generator 140 .
- the client device 120 may be a smart phone, a laptop computer, a tablet computer, a desktop computer, or some other computing device that is configured to detect when a user 110 says a hotword.
- the client device 120 may be configured to detect when the user 110 says “OK COMPUTER.”
- the client device 120 may detect when the user 110 speaks a hotword using a hotword detection model. For example, the client device 120 may detect a user is speaking “OK COMPUTER” using a hotword detection model that has been trained to detect sounds corresponding to when the hotword, “OK COMPUTER,” is spoken.
- the client device 120 may increase detection of hotwords spoken by the user 110 based on a personalized hotword detection model 152 that is trained to detect when the user 110 says the hotword.
- the personalized hotword detection model 152 may be trained to detect “OK COM-UT-ER” as a user's 110 pronunciation of the hotword “OK COMPUTER.”
- the client device 120 may prompt the user to provide enrollment utterances. For example, for obtaining a personalized hotword detection model for detecting the hotword “OK COMPUTER,” the client device 120 may provide the prompt “NOW PERSONALIZING HOTWORD DETECTION, SAY ‘OK COMPUTER’ THREE TIMES” to the user 110 .
- the client device 120 may include an acoustic data generator that captures sound as acoustic data.
- the client device 120 may include a microphone that captures the user 110 speaking “OK COMPUTER” as “OK COM-UT-ER” as signals, and encodes the signals as enrollment acoustic data 122 represented by mel-frequency cepstral coefficients.
- the client device 120 may provide the enrollment acoustic data 122 to a server 130 and in response receive the personalized hotword detection model 152 .
- the client device 120 may provide enrollment acoustic data 122 representing the user 110 speaking “OK COMPUTER” as “OK COM-UT-ER” to the server 130 , and in response, receive the personalized hotword detection model 152 trained based at least on the enrollment acoustic data.
- the client device 120 may then detect when the user speaks the hotword using the personalized hotword detection model 152 .
- the client device 120 may detect the user 110 is saying the hotword “OK COMPUTER” when the user says “OK COM-UT-ER.”
- the server 130 may be configured to generate a personalized hotword detection model based on enrollment acoustic data. For example, the server 130 may receive the enrollment acoustic data 122 representing the user 110 speaking “OK COMPUTER” as “OK COM-UT-ER” and train the personalized hotword detection model 152 based at least on the enrollment acoustic data.
- generating a hotword detection model may use thousands of utterances and a user may not want to personally provide thousands of enrollment utterances. Accordingly, after receiving one or more enrollment utterances by a user, the server 130 may identify other utterances of the hotword by other users, select the utterances that are similar to the enrollment utterances by the user, and generate the personalized hotword detection model 152 using the selected utterances and the enrollment utterances.
- the candidate acoustic database 132 of the server 130 may store acoustic data representing utterances of various users.
- the candidate acoustic database 132 of the server 130 may store acoustic data representing hundreds of thousands of utterances of different users.
- the candidate acoustic database 132 may store each acoustic data with data that indicates the hotword that was uttered.
- the candidate acoustic database 132 may store fifty thousand sets of acoustic data labeled as being an utterance of the hotword “OK COMPUTER” and fifty thousand sets of acoustic data labeled as being an utterance of a different hotword “MY BUTLER.”
- the candidate acoustic database 132 may associate the acoustic data with demographic data that describes a user. For example, the candidate acoustic database 132 may associate the acoustic data with a location that the user was in when the hotword was spoken by the user. In another example, the candidate acoustic database 132 may associate the acoustic data with a gender of the user, an age range of the user, or some other information describing the user.
- the candidate acoustic data scorer 134 of the server 130 may be configured to obtain the enrollment acoustic data 122 and the candidate acoustic data from the candidate acoustic database 132 and generate a similarity score that represents a similarity between the enrollment acoustic data 122 and the candidate acoustic data.
- the candidate acoustic data scorer 134 may receive enrollment acoustic data 122 of the user saying “OK COMPUTER” and candidate acoustic data representing another user saying “OK COMPUTER,” determine a 90% similarity, and associate a score of 0.9 with the candidate acoustic data.
- the candidate acoustic data scorer 134 may then obtain a second set of candidate acoustic data representing yet another user saying “OK COMPUTER,” determine a 30% similarity with the enrollment acoustic data 122 , and associate a score of 0.3 with the second set of candidate acoustic data.
- the similarity score of a candidate acoustic data representing a particular utterance may reflect an acoustic similarity between the particular utterance and an enrollment utterance.
- the similarity score may range from 0 to 1 where higher similarity scores reflect greater acoustic similarity and lower scores reflect lower acoustic similarity.
- other types of scores and ranges may be used, e.g., 1-5, A-F, or 0%-100%.
- the candidate acoustic data scorer 134 may generate the score based on a distance between the enrollment acoustic data and the candidate acoustic data. For example, the candidate acoustic data scorer 134 may aggregate a difference between mel-frequency cepstral coefficients of the enrollment acoustic data and the candidate acoustic data across multiple frames, and determine a similarity score where greater aggregate distances result in scores that reflect less similarity and lower aggregate distances result in scores that reflect more similarity.
- the candidate acoustic data scorer 134 may determine the score based on demographic information of the other user. For example, instead of selecting candidate acoustic data representing utterances of a user with the same gender, the candidate acoustic data scorer 134 may obtain candidate acoustic data representing utterances of users of different genders, determine whether the gender of a user speaking the utterance represented by the candidate acoustic data matches the gender of the user 110 , and in response to determining a match, assigning a higher similarity score to candidate acoustic data representing utterances of users of the same gender as the user 110 .
- the candidate acoustic data scorer 134 may select candidate acoustic data from among more candidate acoustic data stored in the candidate acoustic database 132 .
- the candidate acoustic data scorer 134 may select to receive the acoustic data from the candidate acoustic database 132 where the hotword “OK COMPUTER” is spoken.
- the candidate acoustic data scorer 134 may obtain, with the enrollment acoustic data, one or more of an indication of the hotword spoken or an indication of the type of user saying the hotword, and query the candidate acoustic database 132 for acoustic data of users saying the same hotword or a similar type of user to the user saying the hotword.
- the candidate acoustic data scorer 134 may obtain an indication that the hotword “OK COMPUTER” was spoken by a female user, and in response, query the candidate acoustic database 132 for acoustic data representing the hotword “OK COMPUTER” being spoken by a female user.
- the candidate acoustic data selector 136 may obtain the scored candidate acoustic data from the candidate acoustic data scorer 134 and the enrollment acoustic data 122 , and generate a training set 138 of acoustic data for training the personalized hotword detection model 152 .
- the candidate acoustic data selector 136 may obtain enrollment acoustic data representing the user 110 speaking “OK COMPUTER” and obtain fifty thousand of candidate acoustic data representing different other users saying “OK COMPUTER,” where each of the candidate acoustic data is associated with a similarity score reflecting a similarity between the candidate acoustic data and the enrollment acoustic data 122 , and generate a training set of acoustic data including ten thousand of the fifty thousand candidate acoustic data and the enrollment acoustic data 122 .
- the candidate acoustic data selector 136 may generate the training set 138 based on selecting a subset of the candidate acoustic data based at least on the similarity scores. For example, the candidate acoustic data selector 136 may obtain a set of fifty thousand candidate acoustic data and select a subset of ten thousand candidate acoustic data of the set with similarity scores that reflect higher similarities between the candidate acoustic data and the enrollment acoustic data 122 than the other candidate acoustic data.
- the candidate acoustic data selector 136 may select the subset of candidate acoustic data based on selecting a predetermined number, e.g., one thousand, three thousand, ten thousand, fifty thousand, of candidate acoustic data. For example, the candidate acoustic data selector 136 may obtain enrollment acoustic data representing a single utterance of “OK COMPUTER,” and select a subset of three thousand candidate acoustic data with similarity scores that reflect a higher similarity between the candidate acoustic data and the enrollment acoustic data.
- a predetermined number e.g., one thousand, three thousand, ten thousand, fifty thousand
- the candidate acoustic data selector 136 may select a subset of candidate acoustic data based on selecting candidate acoustic data that satisfies a threshold similarity score. For example, the candidate acoustic data selector 136 may select candidate acoustic data with similarity scores above a threshold similarity score of 0.8, 0.85, 0.9 from a score range of 0.0-1.0, and include the selected candidate acoustic data in the training set 138 .
- the candidate acoustic data selector 136 may weight the acoustic data in the training set 138 .
- the candidate acoustic data selector 136 may include an enrollment acoustic data multiple times in the training set 138 or associate the enrollment acoustic data in the training set 138 with a greater weight than candidate acoustic data.
- the candidate acoustic data selector 136 may select the subset of candidate acoustic data based on multiple enrollment acoustic data. For example, the candidate acoustic data selector 136 may receive enrollment acoustic data for three utterances of “OK COMPUTER” by the user 110 , and for each enrollment acoustic data, select three thousand of the candidate acoustic data with similarity scores that reflect the most similarity to include in the training set 138 . Accordingly, some candidate acoustic data may appear in the training set 138 multiple times if the candidate acoustic data is selected for multiple enrollment acoustic data. In some implementations, the candidate acoustic data selector 136 may remove duplicate candidate acoustic data from the training set 138 or prevent duplicate candidate acoustic data from being included in the training set 138 .
- the candidate acoustic data selector 136 may determine the number of candidate acoustic data to select for an enrollment acoustic data based on a number of enrollment acoustic data received by the candidate acoustic data selector 136 .
- the candidate acoustic data selector 136 may receive five enrollment acoustic data, determine that the hotword detection model generator should receive at a training set of least ten thousand acoustic data, and in response, for each enrollment acoustic data received, select at least one thousand nine hundred ninety-nine candidate acoustic data to include in the training set with the enrollment acoustic data.
- the candidate acoustic data selector 136 may receive ten enrollment acoustic data, determine that the hotword detection model generator should receive at a training set of least ten thousand acoustic data, and in response, for each enrollment acoustic data received, select at least nine hundred ninety nine candidate acoustic data to include in the training set with the enrollment acoustic data.
- the candidate acoustic data selector 136 may determine a similarity score for the candidate acoustic data based on determining sub-similarity scores for each of multiple enrollment acoustic data. For example, the candidate acoustic data selector 136 may receive three enrollment acoustic data, and for each candidate acoustic data, determine three sub-similarity scores each corresponding to one of the enrollment acoustic data, and determine the similarity score based on averaging the sub-similarity scores. In yet another example, the candidate acoustic data selector may take a median, floor, or ceiling of sub-similarity scores for a candidate acoustic data as the similarity score.
- the hotword detection model generator 140 may receive the training set 138 from the candidate acoustic data selector 136 and generate a personalized hotword detection model 152 .
- the hotword detection model generator 140 may receive a training set including nine thousand nine hundred and ninety-seven selected candidate acoustic data and three enrollment acoustic data, and generate a personalized hotword detection model 152 based on the training set.
- the hotword detection model generator 140 may generate the personalized hotword detection model 152 based on training a neural network to detect the acoustic data in the training set 138 as representing utterances of the hotword. For example, the hotword detection model generator 140 may generate the personalized hotword detection model 152 that detects the hotword “OK COMPUTER” based on the acoustic data in the training set 138 .
- system 100 may be used where functionality of the client device 120 and the server 130 that includes the candidate acoustic data scorer 134 , the candidate acoustic data selector 136 , and the hotword detection model generator 140 may be combined, further separated, distributed, or interchanged.
- the system 100 may be implemented in a single device or distributed across multiple devices.
- FIG. 2 is a block diagram of example server 130 for generating an individualized hotword detection model.
- the server 130 may be the server described in FIG. 1 .
- the server 130 may include the candidate acoustic database 132 , the candidate acoustic data scorer 134 , the candidate acoustic data selector 136 , and the hotword detection model generator 140 .
- the candidate acoustic database 132 may include multiple candidate acoustic data of various users saying the hotword “OK COMPUTER.”
- the candidate acoustic database 132 may include a candidate acoustic data of “User A” saying “OK COMPUTER” as “OK COM-PU-TER,” a candidate acoustic data of “User B” saying “OK COMPUTER” as “OOK COM-PU-TER”, a candidate acoustic data of “User C” saying “OK COMPUTER” as “OK COP-TER,” a candidate acoustic data of “User D” saying “OK COMPUTER” as “OK COM-U-TER,” a candidate acoustic data of “User E” saying “OK COMPUTER” as “OK COM-MUT-ER,” a candidate acoustic data of “User F” saying “OK COMPUTER” as “OK COM-PUT-EW,” and other candidate acous
- the candidate acoustic data scorer 134 may receive enrollment acoustic data 202 of a user and obtain a set of candidate acoustic data from the candidate acoustic database 132 .
- the candidate acoustic data scorer 134 may receive enrollment acoustic data 202 of the user saying “OK COMPUTER” as “OK COM-UT-ER,” and in response, obtain a set of candidate acoustic data from the candidate acoustic database 132 including the candidate acoustic data of “User A” saying “OK COMPUTER” as “OK COM-PU-TER,” the candidate acoustic data of “User B” saying “OK COMPUTER” as “OOK COM-PU-TER,” the candidate acoustic data of “User C” saying “OK COMPUTER” as “OK COP-TER,” the candidate acoustic data of “User D” saying “OK COMPUTER” as “OK COM-U-TER,” the
- the candidate acoustic data scorer 134 may generate similarity scores for each of the set of candidate acoustic data. For example, for an enrollment acoustic data of the user 110 saying “OK COMPUTER” as “OK COM-UT-ER,” the candidate acoustic data scorer 134 may generate a similarity score of 0.6 reflecting a moderate similarity for candidate acoustic data of “User A” saying “OK COMPUTER” as “OK COM-PU-TER,” a similarity score of 0.5 reflecting a moderate similarity for candidate acoustic data of “User B” saying “OK COMPUTER” as “OOK COM-PU-TER”, a similarity score of 0.3 reflecting a low similarity for candidate acoustic data of “User C” saying “OK COMPUTER” as “OK COP-TER,” a similarity score of 0.9 reflecting a high similarity for candidate acoustic data of “User D” saying “OK COMPUTER” as “OK
- the candidate acoustic data selector 136 may receive the scored candidate acoustic data 204 from the candidate acoustic data scorer 134 and generate the training set 138 of acoustic data. For example, the candidate acoustic data selector 136 may receive a similarity score of 0.6 reflecting a moderate similarity for candidate acoustic data of “User A” saying “OK COMPUTER” as “OK COM-PU-TER,” a similarity score of 0.5 reflecting a moderate similarity for candidate acoustic data of “User B” saying “OK COMPUTER” as “OOK COM-PU-TER”, a similarity score of 0.3 reflecting a low similarity for candidate acoustic data of “User C” saying “OK COMPUTER” as “OK COP-TER,” a similarity score of 0.9 reflecting a high similarity for candidate acoustic data of “User D” saying “OK COMPUTER” as “OK COM-U-TER,”
- the candidate acoustic data selector 136 may generate the training set by selecting a subset of the set of candidate acoustic data based on the similarity scores. For example, the candidate acoustic data selector 136 may determine that the hotword detection model generator should receive a training set of three acoustic data, determine there is one enrollment acoustic data, determine to select two candidate acoustic data to obtain three total acoustic data, and select the candidate acoustic data with the similarity scores of 0 . 9 and 0 . 8 that reflect the greatest similarity with the enrollment acoustic data out of all of the candidate acoustic data.
- the hotword detection model generator 140 may receive the training set 138 and generate a personalized hotword detection model 152 .
- the hotword detection model generator 140 may receive a training set including the candidate acoustic data of “User D” saying “OK COMPUTER” as “OK COM-U-TER,” the candidate acoustic data of “User E” saying “OK COMPUTER” as “OK COM-MUT-ER,” and the enrollment acoustic data of the user saying “OK COMPUTER” as “OK COM-UT-ER,” and train a neural network to detect those acoustic data as representing the hotword “OK COMPUTER” being spoken by the user 110 .
- FIG. 3 is a flowchart of an example process for generating an individualized hotword detection model. The following describes the processing 300 as being performed by components of the systems 100 that are described with reference to FIG. 1 . However, the process 300 may be performed by other systems or system configurations.
- the process 300 may include obtaining enrollment acoustic data representing an enrollment utterance spoken by a user ( 310 ).
- the candidate acoustic data scorer 134 may obtain enrollment acoustic data from the client device 120 representing the user saying a hotword, “MY BUTLER,” after being prompted by the client device 120 to provide a sample enrollment utterance for training the client device 120 to detect when the user says the hotword, “MY BUTLER.”
- the process 300 may include obtaining a set of candidate acoustic data representing utterances spoken by other users ( 320 ).
- the candidate acoustic data scorer 134 may determine that the enrollment acoustic data is for the hotword, “MY BUTLER,” spoken by a male between the ages of twenty to thirty, and in response, obtain, from the candidate acoustic database 132 , candidate acoustic data representing other male users between the ages of twenty to thirty saying the hotword, “MY BUTLER.”
- the process may include determining, for each candidate acoustic data of the set of candidate acoustic data, a similarity score that represents a similarity between the enrollment acoustic data and the candidate acoustic data ( 330 ). For example, for each candidate acoustic data obtained from the candidate acoustic database 132 , the candidate acoustic data scorer 134 may determine enrollment acoustic data representing the user saying the hotword, “MY BUTLER,” and the candidate acoustic data representing another user saying the hotword, “MY BUTLER.”
- the process may include selecting a subset of candidate acoustic data from the set of candidate acoustic data based at least on the similarity scores ( 340 ).
- the candidate acoustic data selector 136 may select a predetermined number, e.g., one thousand, five thousand, twenty thousand, or some other number, of candidate acoustic data with the similarity scores that reflect the most similarity with the enrollment acoustic data.
- the candidate acoustic data selector 136 may select candidate acoustic data with similarity scores that satisfy a threshold similarity score, e.g., 0.7, 0.8, 0.9, or some other amount.
- the process may include generating a detection model based on the subset of candidate acoustic data ( 350 ).
- the hotword detection model generator 140 may generate the personalized hotword detection model based on training a neural network to detect when the user speaks the hotword, “MY BUTLER,” using the selected candidate acoustic data of other users saying “MY BUTLER.”
- the process may include providing the detection model for use in detecting an utterance spoken by the user ( 360 ).
- the server 130 may provide the personalized hotword detection model 152 generated by the hotword detection model generator 140 to the client device 120 .
- the client device 120 may then use the personalized hotword detection model 152 for detecting when the user 110 says the hotword, “MY BUTLER.”
- FIG. 4 shows an example of a computing device 400 and a mobile computing device 450 that can be used to implement the techniques described here.
- the computing device 400 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers.
- the mobile computing device 450 is intended to represent various forms of mobile computing devices, such as personal digital assistants, cellular telephones, smart-phones, and other similar computing devices.
- the components shown here, their connections and relationships, and their functions, are meant to be examples only, and are not meant to be limiting.
- the computing device 400 includes a processor 402 , a memory 404 , a storage device 406 , a high-speed interface 408 connecting to the memory 404 and multiple high-speed expansion ports 410 , and a low-speed interface 412 connecting to a low-speed expansion port 414 and the storage device 406 .
- Each of the processor 402 , the memory 404 , the storage device 406 , the high-speed interface 408 , the high-speed expansion ports 410 , and the low-speed interface 412 are interconnected using various busses, and may be mounted on a common motherboard or in other manners as appropriate.
- the processor 402 can process instructions for execution within the computing device 400 , including instructions stored in the memory 404 or on the storage device 406 to display graphical information for a graphical user interface (GUI) on an external input/output device, such as a display 416 coupled to the high-speed interface 408 .
- GUI graphical user interface
- multiple processors and/or multiple buses may be used, as appropriate, along with multiple memories and types of memory.
- multiple computing devices may be connected, with each device providing portions of the necessary operations (e.g., as a server bank, a group of blade servers, or a multi-processor system).
- the memory 404 stores information within the computing device 400 .
- the memory 404 is a volatile memory unit or units.
- the memory 404 is a non-volatile memory unit or units.
- the memory 404 may also be another form of computer-readable medium, such as a magnetic or optical disk.
- the storage device 406 is capable of providing mass storage for the computing device 400 .
- the storage device 406 may be or contain a computer-readable medium, such as a floppy disk device, a hard disk device, an optical disk device, or a tape device, a flash memory or other similar solid state memory device, or an array of devices, including devices in a storage area network or other configurations.
- Instructions can be stored in an information carrier.
- the instructions when executed by one or more processing devices (for example, processor 402 ), perform one or more methods, such as those described above.
- the instructions can also be stored by one or more storage devices such as computer- or machine-readable mediums (for example, the memory 404 , the storage device 406 , or memory on the processor 402 ).
- the high-speed interface 408 manages bandwidth-intensive operations for the computing device 400 , while the low-speed interface 412 manages lower bandwidth-intensive operations.
- the high-speed interface 408 is coupled to the memory 404 , the display 416 (e.g., through a graphics processor or accelerator), and to the high-speed expansion ports 410 , which may accept various expansion cards (not shown).
- the low-speed interface 412 is coupled to the storage device 406 and the low-speed expansion port 414 .
- the low-speed expansion port 414 which may include various communication ports (e.g., USB, Bluetooth, Ethernet, wireless Ethernet) may be coupled to one or more input/output devices, such as a keyboard, a pointing device, a scanner, or a networking device such as a switch or router, e.g., through a network adapter.
- input/output devices such as a keyboard, a pointing device, a scanner, or a networking device such as a switch or router, e.g., through a network adapter.
- the computing device 400 may be implemented in a number of different forms, as shown in the figure. For example, it may be implemented as a standard server 420 , or multiple times in a group of such servers. In addition, it may be implemented in a personal computer such as a laptop computer 422 . It may also be implemented as part of a rack server system 424 . Alternatively, components from the computing device 400 may be combined with other components in a mobile computing device (not shown), such as a mobile computing device 450 . Each of such devices may contain one or more of the computing device 400 and the mobile computing device 450 , and an entire system may be made up of multiple computing devices communicating with each other.
- the mobile computing device 450 includes a processor 452 , a memory 464 , an input/output device such as a display 454 , a communication interface 466 , and a transceiver 468 , among other components.
- the mobile computing device 450 may also be provided with a storage device, such as a micro-drive or other device, to provide additional storage.
- a storage device such as a micro-drive or other device, to provide additional storage.
- Each of the processor 452 , the memory 464 , the display 454 , the communication interface 466 , and the transceiver 468 are interconnected using various buses, and several of the components may be mounted on a common motherboard or in other manners as appropriate.
- the processor 452 can execute instructions within the mobile computing device 450 , including instructions stored in the memory 464 .
- the processor 452 may be implemented as a chipset of chips that include separate and multiple analog and digital processors.
- the processor 452 may provide, for example, for coordination of the other components of the mobile computing device 450 , such as control of user interfaces, applications run by the mobile computing device 450 , and wireless communication by the mobile computing device 450 .
- the processor 452 may communicate with a user through a control interface 458 and a display interface 456 coupled to the display 454 .
- the display 454 may be, for example, a TFT (Thin-Film-Transistor Liquid Crystal Display) display or an OLED (Organic Light Emitting Diode) display, or other appropriate display technology.
- the display interface 456 may comprise appropriate circuitry for driving the display 454 to present graphical and other information to a user.
- the control interface 458 may receive commands from a user and convert them for submission to the processor 452 .
- an external interface 462 may provide communication with the processor 452 , so as to enable near area communication of the mobile computing device 450 with other devices.
- the external interface 462 may provide, for example, for wired communication in some implementations, or for wireless communication in other implementations, and multiple interfaces may also be used.
- the memory 464 stores information within the mobile computing device 450 .
- the memory 464 can be implemented as one or more of a computer-readable medium or media, a volatile memory unit or units, or a non-volatile memory unit or units.
- An expansion memory 474 may also be provided and connected to the mobile computing device 450 through an expansion interface 472 , which may include, for example, a SIMM (Single In Line Memory Module) card interface.
- SIMM Single In Line Memory Module
- the expansion memory 474 may provide extra storage space for the mobile computing device 450 , or may also store applications or other information for the mobile computing device 450 .
- the expansion memory 474 may include instructions to carry out or supplement the processes described above, and may include secure information also.
- the expansion memory 474 may be provided as a security module for the mobile computing device 450 , and may be programmed with instructions that permit secure use of the mobile computing device 450 .
- secure applications may be provided via the SIMM cards, along with additional information, such as placing identifying information on the SIMM card in a non-hackable manner.
- the memory may include, for example, flash memory and/or NVRAM memory (non-volatile random access memory), as discussed below.
- instructions are stored in an information carrier that the instructions, when executed by one or more processing devices (for example, processor 452 ), perform one or more methods, such as those described above.
- the instructions can also be stored by one or more storage devices, such as one or more computer- or machine-readable mediums (for example, the memory 464 , the expansion memory 474 , or memory on the processor 452 ).
- the instructions can be received in a propagated signal, for example, over the transceiver 468 or the external interface 462 .
- the mobile computing device 450 may communicate wirelessly through the communication interface 466 , which may include digital signal processing circuitry where necessary.
- the communication interface 466 may provide for communications under various modes or protocols, such as GSM voice calls (Global System for Mobile communications), SMS (Short Message Service), EMS (Enhanced Messaging Service), or MMS messaging (Multimedia Messaging Service), CDMA (code division multiple access), TDMA (time division multiple access), PDC (Personal Digital Cellular), WCDMA (Wideband Code Division Multiple Access), CDMA2000, or GPRS (General Packet Radio Service), among others.
- GSM voice calls Global System for Mobile communications
- SMS Short Message Service
- EMS Enhanced Messaging Service
- MMS messaging Multimedia Messaging Service
- CDMA code division multiple access
- TDMA time division multiple access
- PDC Personal Digital Cellular
- WCDMA Wideband Code Division Multiple Access
- CDMA2000 Code Division Multiple Access
- GPRS General Packet Radio Service
- a GPS (Global Positioning System) receiver module 470 may provide additional navigation- and location-related wireless data to the mobile computing device 450 , which may be used as appropriate by applications running on the mobile computing device 450 .
- the mobile computing device 450 may also communicate audibly using an audio codec 460 , which may receive spoken information from a user and convert it to usable digital information.
- the audio codec 460 may likewise generate audible sound for a user, such as through a speaker, e.g., in a handset of the mobile computing device 450 .
- Such sound may include sound from voice telephone calls, may include recorded sound (e.g., voice messages, music files, etc.) and may also include sound generated by applications operating on the mobile computing device 450 .
- the mobile computing device 450 may be implemented in a number of different forms, as shown in the figure. For example, it may be implemented as a cellular telephone 480 . It may also be implemented as part of a smart-phone 482 , personal digital assistant, or other similar mobile computing device.
- Embodiments of the subject matter, the functional operations and the processes described in this specification can be implemented in digital electronic circuitry, in tangibly-embodied computer software or firmware, in computer hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them.
- Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible nonvolatile program carrier for execution by, or to control the operation of, data processing apparatus.
- the program instructions can be encoded on an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal that is generated to encode information for transmission to suitable receiver apparatus for execution by a data processing apparatus.
- the computer storage medium can be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.
- data processing apparatus encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers.
- the apparatus can include special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit).
- the apparatus can also include, in addition to hardware, code that creates an execution environment for the computer program in question, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.
- a computer program (which may also be referred to or described as a program, software, a software application, a module, a software module, a script, or code) can be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, and it can be deployed in any form, including as a standalone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
- a computer program may, but need not, correspond to a file in a file system.
- a program can be stored in a portion of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub programs, or portions of code).
- a computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network.
- the processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform functions by operating on input data and generating output.
- the processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit).
- special purpose logic circuitry e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit).
- Computers suitable for the execution of a computer program include, by way of example, can be based on general or special purpose microprocessors or both, or any other kind of central processing unit.
- a central processing unit will receive instructions and data from a read-only memory or a random access memory or both.
- the essential elements of a computer are a central processing unit for performing or executing instructions and one or more memory devices for storing instructions and data.
- a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks.
- mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks.
- a computer need not have such devices.
- a computer can be embedded in another device, e.g., a mobile telephone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device (e.g., a universal serial bus (USB) flash drive), to name just a few.
- PDA personal digital assistant
- GPS Global Positioning System
- USB universal serial bus
- Computer readable media suitable for storing computer program instructions and data include all forms of nonvolatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto optical disks; and CD-ROM and DVD-ROM disks.
- semiconductor memory devices e.g., EPROM, EEPROM, and flash memory devices
- magnetic disks e.g., internal hard disks or removable disks
- magneto optical disks e.g., CD-ROM and DVD-ROM disks.
- the processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
- a computer having a display device, e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, for displaying information to the user and a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer.
- a display device e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor
- keyboard and a pointing device e.g., a mouse or a trackball
- Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input.
- a computer can interact with a user by sending documents to and receiving documents from a device that is used by the user; for example, by sending web pages to a
- Embodiments of the subject matter described in this specification can be implemented in a computing system that includes a back end component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a front end component, e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the subject matter described in this specification, or any combination of one or more such back end, middleware, or front end components.
- the components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (“LAN”) and a wide area network (“WAN”), e.g., the Internet.
- LAN local area network
- WAN wide area network
- the computing system can include clients and servers.
- a client and server are generally remote from each other and typically interact through a communication network.
- the relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Game Theory and Decision Science (AREA)
- Business, Economics & Management (AREA)
- Evolutionary Computation (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Library & Information Science (AREA)
- General Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for presenting notifications in an enterprise system. In one aspect, a method include actions of obtaining enrollment acoustic data representing an enrollment utterance spoken by a user, obtaining a set of candidate acoustic data representing utterances spoken by other users, determining, for each candidate acoustic data of the set of candidate acoustic data, a similarity score that represents a similarity between the enrollment acoustic data and the candidate acoustic data, selecting a subset of candidate acoustic data from the set of candidate acoustic data based at least on the similarity scores, generating a detection model based on the subset of candidate acoustic data, and providing the detection model for use in detecting an utterance spoken by the user.
Description
- This application is a continuation of U.S. application Ser. No. 15/197,268, filed on Jun. 29, 2016, which is a continuation of U.S. application Ser. No. 14/805,753, filed on Jul. 22, 2015. The disclosures of the prior applications are considered part of and are incorporated by reference in the disclosure of this application.
- This disclosure generally relates to controlling computers using voice commands.
- A computer may analyze a user's utterance and may perform an action in response. For example, a user may say “DRIVE HOME” and a computer may respond with directions for the user to drive home from their current location.
- In general, an aspect of the subject matter described in this specification may involve a process for generating an individualized hotword detection model. As used by this specification, a “hotword” may refer to a term that wakes a device up from a sleep state or hibernation state, or a term that triggers semantic interpretation on the term or on one or more terms that follow the term, e.g., on voice commands that follow the hotword.
- For example, in the utterance “OK COMPUTER, DRIVE HOME,” the term “OK COMPUTER,” may be a hotword that triggers semantic interpretation on the following term “DRIVE HOME,” and the term “DRIVE HOME” may correspond to a voice command for providing directions to the user's home. When the system receives sound corresponding to the utterance “OK COMPUTER, DRIVE HOME,” the system may determine that the utterance begins with the hotword, “OK COMPUTER,” in response, transcribe the sound, perform semantic interpretation on the transcription of the voice command “DRIVE HOME,” and output directions for the user to drive home.
- Hotwords may be useful for “always on” systems that may potentially pick up sounds that are not directed to the system. For example, the use of hotwords may help the system discern when a given utterance is directed at the system, as opposed to an utterance that is directed to another individual present in the environment. In doing so, the system may avoid computationally expensive processing, e.g., semantic interpretation, on sounds or utterances that do not include a hotword.
- A system may detect an utterance includes a hotword based on a hotword detection model. However, different users may pronounce the same hotword in different ways. Accordingly, the system may not detect when some users speak the hotword. The system may increase detection of hotwords based on generating individualized hotword detection models. However, generating a hotword detection model may use thousands of utterances and a user may not desire to provide thousands of enrollment utterances. Accordingly, after receiving one or more enrollment utterances by a user, the system may identify other utterances of the hotword by other users, select the utterances that are similar to the enrollment utterances by the user, and generate the individualized hotword detection model using the selected utterances and the enrollment utterances.
- In some aspects, the subject matter described in this specification may be embodied in methods that may include the actions of obtaining enrollment acoustic data representing an enrollment utterance spoken by a user, obtaining a set of candidate acoustic data representing utterances spoken by other users, determining, for each candidate acoustic data of the set of candidate acoustic data, a similarity score that represents a similarity between the enrollment acoustic data and the candidate acoustic data, selecting a subset of candidate acoustic data from the set of candidate acoustic data based at least on the similarity scores, generating a detection model based on the subset of candidate acoustic data, and providing the detection model for use in detecting an utterance spoken by the user.
- Other versions include corresponding systems, apparatus, and computer programs, configured to perform the actions of the methods, encoded on computer storage devices.
- These and other versions may each optionally include one or more of the following features. For instance, in some implementations obtaining enrollment acoustic data representing an enrollment utterance spoken by a user includes obtaining enrollment acoustic data for multiple utterances of a predetermined phrase spoken by the user.
- In certain aspects, obtaining a set of candidate acoustic data representing utterances spoken by other users includes determining the enrollment utterance is of a predetermined phrase and identifying candidate acoustic data representing utterances of the predetermined phrase spoken by other users.
- In some aspects, determining, for each candidate acoustic data of the set of candidate acoustic data, the similarity score includes determining a distance between the enrollment acoustic data and the candidate acoustic data and determining the similarity score based on the distance.
- In some implementations, determining, for each candidate acoustic data of the set of candidate acoustic data, the similarity score includes determining the similarity scores based on demographic information of the other user that spoke the utterance represented by the candidate acoustic data and demographic information of the user that spoke the enrollment utterance.
- In certain aspects, selecting a subset of candidate acoustic data from the set of candidate acoustic data based at least on similarity scores of the candidate acoustic data that represent a similarity between the enrollment acoustic data and the candidate acoustic data is based on selecting a predetermined number of candidate acoustic data.
- In some aspects, generating a detection model based on the subset of candidate acoustic data includes training a neural network using the subset of candidate acoustic data. In some implementations, additional actions include detecting an utterance of a predetermined phrase using the detection model.
- The details of one or more implementations of the subject matter described in this specification are set forth in the accompanying drawings and the description below. Other potential features, aspects, and advantages of the subject matter will become apparent from the description, the drawings, and the claims.
-
FIGS. 1 and 2 are illustrations of block diagrams of example systems for generating an individualized hotword detection model. -
FIG. 3 is a flowchart of an example process for generating an individualized hotword detection model. -
FIG. 4 is a diagram of exemplary computing devices. - Like reference symbols in the various drawings indicate like elements.
-
FIG. 1 is a block diagram of anexample system 100 for generating an individualized hotword detection model. Briefly, and as described in further detail below, thesystem 100 may include aclient device 120 and aserver 130 that includes a candidateacoustic data scorer 134, candidateacoustic data selector 136, and a hotworddetection model generator 140. - The
client device 120 may be a smart phone, a laptop computer, a tablet computer, a desktop computer, or some other computing device that is configured to detect when auser 110 says a hotword. For example, theclient device 120 may be configured to detect when theuser 110 says “OK COMPUTER.” - The
client device 120 may detect when theuser 110 speaks a hotword using a hotword detection model. For example, theclient device 120 may detect a user is speaking “OK COMPUTER” using a hotword detection model that has been trained to detect sounds corresponding to when the hotword, “OK COMPUTER,” is spoken. - However, different users may pronounce the same hotword in different ways. For example, the
user 110 may pronounce “OK COMPUTER” as “OK COM-UT-ER,” and a hotword detection model may not detect “OK COM-UT-ER” as “OK COMPUTER.” Accordingly, theclient device 120 may increase detection of hotwords spoken by theuser 110 based on a personalizedhotword detection model 152 that is trained to detect when theuser 110 says the hotword. For example, the personalizedhotword detection model 152 may be trained to detect “OK COM-UT-ER” as a user's 110 pronunciation of the hotword “OK COMPUTER.” - To obtain the personalized
hotword detection model 152, theclient device 120 may prompt the user to provide enrollment utterances. For example, for obtaining a personalized hotword detection model for detecting the hotword “OK COMPUTER,” theclient device 120 may provide the prompt “NOW PERSONALIZING HOTWORD DETECTION, SAY ‘OK COMPUTER’ THREE TIMES” to theuser 110. Theclient device 120 may include an acoustic data generator that captures sound as acoustic data. For example, theclient device 120 may include a microphone that captures theuser 110 speaking “OK COMPUTER” as “OK COM-UT-ER” as signals, and encodes the signals as enrollmentacoustic data 122 represented by mel-frequency cepstral coefficients. - The
client device 120 may provide the enrollmentacoustic data 122 to aserver 130 and in response receive the personalizedhotword detection model 152. For example, theclient device 120 may provide enrollmentacoustic data 122 representing theuser 110 speaking “OK COMPUTER” as “OK COM-UT-ER” to theserver 130, and in response, receive the personalizedhotword detection model 152 trained based at least on the enrollment acoustic data. - The
client device 120 may then detect when the user speaks the hotword using the personalizedhotword detection model 152. For example, using the personalizedhotword detection model 152 trained based on the enrollmentacoustic data 122 representing theuser 110 speaking “OK COMPUTER” as “OK COM-UT-ER,” theclient device 120 may detect theuser 110 is saying the hotword “OK COMPUTER” when the user says “OK COM-UT-ER.” - The
server 130 may be configured to generate a personalized hotword detection model based on enrollment acoustic data. For example, theserver 130 may receive the enrollmentacoustic data 122 representing theuser 110 speaking “OK COMPUTER” as “OK COM-UT-ER” and train the personalizedhotword detection model 152 based at least on the enrollment acoustic data. - However, generating a hotword detection model may use thousands of utterances and a user may not want to personally provide thousands of enrollment utterances. Accordingly, after receiving one or more enrollment utterances by a user, the
server 130 may identify other utterances of the hotword by other users, select the utterances that are similar to the enrollment utterances by the user, and generate the personalizedhotword detection model 152 using the selected utterances and the enrollment utterances. - In more detail, the candidate
acoustic database 132 of theserver 130 may store acoustic data representing utterances of various users. For example, the candidateacoustic database 132 of theserver 130 may store acoustic data representing hundreds of thousands of utterances of different users. The candidateacoustic database 132 may store each acoustic data with data that indicates the hotword that was uttered. For example, the candidateacoustic database 132 may store fifty thousand sets of acoustic data labeled as being an utterance of the hotword “OK COMPUTER” and fifty thousand sets of acoustic data labeled as being an utterance of a different hotword “MY BUTLER.” In some implementations, the candidateacoustic database 132 may associate the acoustic data with demographic data that describes a user. For example, the candidateacoustic database 132 may associate the acoustic data with a location that the user was in when the hotword was spoken by the user. In another example, the candidateacoustic database 132 may associate the acoustic data with a gender of the user, an age range of the user, or some other information describing the user. - The candidate
acoustic data scorer 134 of theserver 130 may be configured to obtain the enrollmentacoustic data 122 and the candidate acoustic data from the candidateacoustic database 132 and generate a similarity score that represents a similarity between the enrollmentacoustic data 122 and the candidate acoustic data. For example, the candidateacoustic data scorer 134 may receive enrollmentacoustic data 122 of the user saying “OK COMPUTER” and candidate acoustic data representing another user saying “OK COMPUTER,” determine a 90% similarity, and associate a score of 0.9 with the candidate acoustic data. In the example, the candidateacoustic data scorer 134 may then obtain a second set of candidate acoustic data representing yet another user saying “OK COMPUTER,” determine a 30% similarity with the enrollmentacoustic data 122, and associate a score of 0.3 with the second set of candidate acoustic data. - The similarity score of a candidate acoustic data representing a particular utterance may reflect an acoustic similarity between the particular utterance and an enrollment utterance. For example, the similarity score may range from 0 to 1 where higher similarity scores reflect greater acoustic similarity and lower scores reflect lower acoustic similarity. In other examples other types of scores and ranges may be used, e.g., 1-5, A-F, or 0%-100%.
- The candidate
acoustic data scorer 134 may generate the score based on a distance between the enrollment acoustic data and the candidate acoustic data. For example, the candidateacoustic data scorer 134 may aggregate a difference between mel-frequency cepstral coefficients of the enrollment acoustic data and the candidate acoustic data across multiple frames, and determine a similarity score where greater aggregate distances result in scores that reflect less similarity and lower aggregate distances result in scores that reflect more similarity. - In some implementations, the candidate
acoustic data scorer 134 may determine the score based on demographic information of the other user. For example, instead of selecting candidate acoustic data representing utterances of a user with the same gender, the candidateacoustic data scorer 134 may obtain candidate acoustic data representing utterances of users of different genders, determine whether the gender of a user speaking the utterance represented by the candidate acoustic data matches the gender of theuser 110, and in response to determining a match, assigning a higher similarity score to candidate acoustic data representing utterances of users of the same gender as theuser 110. - In some implementations, the candidate
acoustic data scorer 134 may select candidate acoustic data from among more candidate acoustic data stored in the candidateacoustic database 132. For example, the candidateacoustic data scorer 134 may select to receive the acoustic data from the candidateacoustic database 132 where the hotword “OK COMPUTER” is spoken. The candidateacoustic data scorer 134 may obtain, with the enrollment acoustic data, one or more of an indication of the hotword spoken or an indication of the type of user saying the hotword, and query the candidateacoustic database 132 for acoustic data of users saying the same hotword or a similar type of user to the user saying the hotword. For example, the candidateacoustic data scorer 134 may obtain an indication that the hotword “OK COMPUTER” was spoken by a female user, and in response, query the candidateacoustic database 132 for acoustic data representing the hotword “OK COMPUTER” being spoken by a female user. - The candidate
acoustic data selector 136 may obtain the scored candidate acoustic data from the candidateacoustic data scorer 134 and the enrollmentacoustic data 122, and generate atraining set 138 of acoustic data for training the personalizedhotword detection model 152. For example, the candidateacoustic data selector 136 may obtain enrollment acoustic data representing theuser 110 speaking “OK COMPUTER” and obtain fifty thousand of candidate acoustic data representing different other users saying “OK COMPUTER,” where each of the candidate acoustic data is associated with a similarity score reflecting a similarity between the candidate acoustic data and the enrollmentacoustic data 122, and generate a training set of acoustic data including ten thousand of the fifty thousand candidate acoustic data and the enrollmentacoustic data 122. - The candidate
acoustic data selector 136 may generate the training set 138 based on selecting a subset of the candidate acoustic data based at least on the similarity scores. For example, the candidateacoustic data selector 136 may obtain a set of fifty thousand candidate acoustic data and select a subset of ten thousand candidate acoustic data of the set with similarity scores that reflect higher similarities between the candidate acoustic data and the enrollmentacoustic data 122 than the other candidate acoustic data. - The candidate
acoustic data selector 136 may select the subset of candidate acoustic data based on selecting a predetermined number, e.g., one thousand, three thousand, ten thousand, fifty thousand, of candidate acoustic data. For example, the candidateacoustic data selector 136 may obtain enrollment acoustic data representing a single utterance of “OK COMPUTER,” and select a subset of three thousand candidate acoustic data with similarity scores that reflect a higher similarity between the candidate acoustic data and the enrollment acoustic data. - Additionally or alternatively, the candidate
acoustic data selector 136 may select a subset of candidate acoustic data based on selecting candidate acoustic data that satisfies a threshold similarity score. For example, the candidateacoustic data selector 136 may select candidate acoustic data with similarity scores above a threshold similarity score of 0.8, 0.85, 0.9 from a score range of 0.0-1.0, and include the selected candidate acoustic data in thetraining set 138. - In some implementations, the candidate
acoustic data selector 136 may weight the acoustic data in thetraining set 138. For example, the candidateacoustic data selector 136 may include an enrollment acoustic data multiple times in the training set 138 or associate the enrollment acoustic data in the training set 138 with a greater weight than candidate acoustic data. - In some implementations, the candidate
acoustic data selector 136 may select the subset of candidate acoustic data based on multiple enrollment acoustic data. For example, the candidateacoustic data selector 136 may receive enrollment acoustic data for three utterances of “OK COMPUTER” by theuser 110, and for each enrollment acoustic data, select three thousand of the candidate acoustic data with similarity scores that reflect the most similarity to include in thetraining set 138. Accordingly, some candidate acoustic data may appear in the training set 138 multiple times if the candidate acoustic data is selected for multiple enrollment acoustic data. In some implementations, the candidateacoustic data selector 136 may remove duplicate candidate acoustic data from the training set 138 or prevent duplicate candidate acoustic data from being included in thetraining set 138. - In some implementations, the candidate
acoustic data selector 136 may determine the number of candidate acoustic data to select for an enrollment acoustic data based on a number of enrollment acoustic data received by the candidateacoustic data selector 136. For example, the candidateacoustic data selector 136 may receive five enrollment acoustic data, determine that the hotword detection model generator should receive at a training set of least ten thousand acoustic data, and in response, for each enrollment acoustic data received, select at least one thousand nine hundred ninety-nine candidate acoustic data to include in the training set with the enrollment acoustic data. In another example, the candidateacoustic data selector 136 may receive ten enrollment acoustic data, determine that the hotword detection model generator should receive at a training set of least ten thousand acoustic data, and in response, for each enrollment acoustic data received, select at least nine hundred ninety nine candidate acoustic data to include in the training set with the enrollment acoustic data. - In another example, the candidate
acoustic data selector 136 may determine a similarity score for the candidate acoustic data based on determining sub-similarity scores for each of multiple enrollment acoustic data. For example, the candidateacoustic data selector 136 may receive three enrollment acoustic data, and for each candidate acoustic data, determine three sub-similarity scores each corresponding to one of the enrollment acoustic data, and determine the similarity score based on averaging the sub-similarity scores. In yet another example, the candidate acoustic data selector may take a median, floor, or ceiling of sub-similarity scores for a candidate acoustic data as the similarity score. - The hotword
detection model generator 140 may receive the training set 138 from the candidateacoustic data selector 136 and generate a personalizedhotword detection model 152. For example, the hotworddetection model generator 140 may receive a training set including nine thousand nine hundred and ninety-seven selected candidate acoustic data and three enrollment acoustic data, and generate a personalizedhotword detection model 152 based on the training set. - The hotword
detection model generator 140 may generate the personalizedhotword detection model 152 based on training a neural network to detect the acoustic data in the training set 138 as representing utterances of the hotword. For example, the hotworddetection model generator 140 may generate the personalizedhotword detection model 152 that detects the hotword “OK COMPUTER” based on the acoustic data in thetraining set 138. - Different configurations of the
system 100 may be used where functionality of theclient device 120 and theserver 130 that includes the candidateacoustic data scorer 134, the candidateacoustic data selector 136, and the hotworddetection model generator 140 may be combined, further separated, distributed, or interchanged. Thesystem 100 may be implemented in a single device or distributed across multiple devices. -
FIG. 2 is a block diagram ofexample server 130 for generating an individualized hotword detection model. Theserver 130 may be the server described inFIG. 1 . As described above, theserver 130 may include the candidateacoustic database 132, the candidateacoustic data scorer 134, the candidateacoustic data selector 136, and the hotworddetection model generator 140. - The candidate
acoustic database 132 may include multiple candidate acoustic data of various users saying the hotword “OK COMPUTER.” For example, the candidateacoustic database 132 may include a candidate acoustic data of “User A” saying “OK COMPUTER” as “OK COM-PU-TER,” a candidate acoustic data of “User B” saying “OK COMPUTER” as “OOK COM-PU-TER”, a candidate acoustic data of “User C” saying “OK COMPUTER” as “OK COP-TER,” a candidate acoustic data of “User D” saying “OK COMPUTER” as “OK COM-U-TER,” a candidate acoustic data of “User E” saying “OK COMPUTER” as “OK COM-MUT-ER,” a candidate acoustic data of “User F” saying “OK COMPUTER” as “OK COM-PUT-EW,” and other candidate acoustic data of other users saying “OK COMPUTER.” - The candidate
acoustic data scorer 134 may receive enrollmentacoustic data 202 of a user and obtain a set of candidate acoustic data from the candidateacoustic database 132. For example, the candidateacoustic data scorer 134 may receive enrollmentacoustic data 202 of the user saying “OK COMPUTER” as “OK COM-UT-ER,” and in response, obtain a set of candidate acoustic data from the candidateacoustic database 132 including the candidate acoustic data of “User A” saying “OK COMPUTER” as “OK COM-PU-TER,” the candidate acoustic data of “User B” saying “OK COMPUTER” as “OOK COM-PU-TER,” the candidate acoustic data of “User C” saying “OK COMPUTER” as “OK COP-TER,” the candidate acoustic data of “User D” saying “OK COMPUTER” as “OK COM-U-TER,” the candidate acoustic data of “User E” saying “OK COMPUTER” as “OK COM-MUT-ER,” the candidate acoustic data of “User F” saying “OK COMPUTER” as “OK COM-PUT-EW,” and the other candidate acoustic data of other users saying “OK COMPUTER.” - The candidate
acoustic data scorer 134 may generate similarity scores for each of the set of candidate acoustic data. For example, for an enrollment acoustic data of theuser 110 saying “OK COMPUTER” as “OK COM-UT-ER,” the candidateacoustic data scorer 134 may generate a similarity score of 0.6 reflecting a moderate similarity for candidate acoustic data of “User A” saying “OK COMPUTER” as “OK COM-PU-TER,” a similarity score of 0.5 reflecting a moderate similarity for candidate acoustic data of “User B” saying “OK COMPUTER” as “OOK COM-PU-TER”, a similarity score of 0.3 reflecting a low similarity for candidate acoustic data of “User C” saying “OK COMPUTER” as “OK COP-TER,” a similarity score of 0.9 reflecting a high similarity for candidate acoustic data of “User D” saying “OK COMPUTER” as “OK COM-U-TER,” a similarity score of 0.8 reflecting a high similarity for candidate acoustic data of “User E” saying “OK COMPUTER” as “OK COM-MUT-ER,” and a similarity score of 0.5 reflecting a moderate similarity for candidate acoustic data of “User F” saying “OK COMPUTER” as “OK COM-PUT-EW.” - The candidate
acoustic data selector 136 may receive the scored candidateacoustic data 204 from the candidateacoustic data scorer 134 and generate the training set 138 of acoustic data. For example, the candidate acoustic data selector 136 may receive a similarity score of 0.6 reflecting a moderate similarity for candidate acoustic data of “User A” saying “OK COMPUTER” as “OK COM-PU-TER,” a similarity score of 0.5 reflecting a moderate similarity for candidate acoustic data of “User B” saying “OK COMPUTER” as “OOK COM-PU-TER”, a similarity score of 0.3 reflecting a low similarity for candidate acoustic data of “User C” saying “OK COMPUTER” as “OK COP-TER,” a similarity score of 0.9 reflecting a high similarity for candidate acoustic data of “User D” saying “OK COMPUTER” as “OK COM-U-TER,” a similarity score of 0.8 reflecting a high similarity for candidate acoustic data of “User E” saying “OK COMPUTER” as “OK COM-MUT-ER,” a similarity score of 0.5 reflecting a moderate similarity for candidate acoustic data of “User F” saying “OK COMPUTER” as “OK COM-PUT-EW,” the corresponding candidate acoustic data, and the enrollment acoustic data, and in response may generate a training set of acoustic data including the candidate acoustic data of “User D” saying “OK COMPUTER” as “OK COM-U-TER,” the candidate acoustic data of “User E” saying “OK COMPUTER” as “OK COM-MUT-ER,” and the enrollment acoustic data of the user saying “OK COMPUTER” as “OK COM-UT-ER.” - The candidate
acoustic data selector 136 may generate the training set by selecting a subset of the set of candidate acoustic data based on the similarity scores. For example, the candidateacoustic data selector 136 may determine that the hotword detection model generator should receive a training set of three acoustic data, determine there is one enrollment acoustic data, determine to select two candidate acoustic data to obtain three total acoustic data, and select the candidate acoustic data with the similarity scores of 0.9 and 0.8 that reflect the greatest similarity with the enrollment acoustic data out of all of the candidate acoustic data. - The hotword
detection model generator 140 may receive the training set 138 and generate a personalizedhotword detection model 152. For example, the hotworddetection model generator 140 may receive a training set including the candidate acoustic data of “User D” saying “OK COMPUTER” as “OK COM-U-TER,” the candidate acoustic data of “User E” saying “OK COMPUTER” as “OK COM-MUT-ER,” and the enrollment acoustic data of the user saying “OK COMPUTER” as “OK COM-UT-ER,” and train a neural network to detect those acoustic data as representing the hotword “OK COMPUTER” being spoken by theuser 110. -
FIG. 3 is a flowchart of an example process for generating an individualized hotword detection model. The following describes the processing 300 as being performed by components of thesystems 100 that are described with reference toFIG. 1 . However, the process 300 may be performed by other systems or system configurations. - The process 300 may include obtaining enrollment acoustic data representing an enrollment utterance spoken by a user (310). For example, the candidate
acoustic data scorer 134 may obtain enrollment acoustic data from theclient device 120 representing the user saying a hotword, “MY BUTLER,” after being prompted by theclient device 120 to provide a sample enrollment utterance for training theclient device 120 to detect when the user says the hotword, “MY BUTLER.” - The process 300 may include obtaining a set of candidate acoustic data representing utterances spoken by other users (320). For example, the candidate
acoustic data scorer 134 may determine that the enrollment acoustic data is for the hotword, “MY BUTLER,” spoken by a male between the ages of twenty to thirty, and in response, obtain, from the candidateacoustic database 132, candidate acoustic data representing other male users between the ages of twenty to thirty saying the hotword, “MY BUTLER.” - The process may include determining, for each candidate acoustic data of the set of candidate acoustic data, a similarity score that represents a similarity between the enrollment acoustic data and the candidate acoustic data (330). For example, for each candidate acoustic data obtained from the candidate
acoustic database 132, the candidateacoustic data scorer 134 may determine enrollment acoustic data representing the user saying the hotword, “MY BUTLER,” and the candidate acoustic data representing another user saying the hotword, “MY BUTLER.” - The process may include selecting a subset of candidate acoustic data from the set of candidate acoustic data based at least on the similarity scores (340). For example, the candidate
acoustic data selector 136 may select a predetermined number, e.g., one thousand, five thousand, twenty thousand, or some other number, of candidate acoustic data with the similarity scores that reflect the most similarity with the enrollment acoustic data. In another example, the candidateacoustic data selector 136 may select candidate acoustic data with similarity scores that satisfy a threshold similarity score, e.g., 0.7, 0.8, 0.9, or some other amount. - The process may include generating a detection model based on the subset of candidate acoustic data (350). For example, the hotword
detection model generator 140 may generate the personalized hotword detection model based on training a neural network to detect when the user speaks the hotword, “MY BUTLER,” using the selected candidate acoustic data of other users saying “MY BUTLER.” - The process may include providing the detection model for use in detecting an utterance spoken by the user (360). For example, the
server 130 may provide the personalizedhotword detection model 152 generated by the hotworddetection model generator 140 to theclient device 120. Theclient device 120 may then use the personalizedhotword detection model 152 for detecting when theuser 110 says the hotword, “MY BUTLER.” -
FIG. 4 shows an example of acomputing device 400 and amobile computing device 450 that can be used to implement the techniques described here. Thecomputing device 400 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Themobile computing device 450 is intended to represent various forms of mobile computing devices, such as personal digital assistants, cellular telephones, smart-phones, and other similar computing devices. The components shown here, their connections and relationships, and their functions, are meant to be examples only, and are not meant to be limiting. - The
computing device 400 includes aprocessor 402, amemory 404, astorage device 406, a high-speed interface 408 connecting to thememory 404 and multiple high-speed expansion ports 410, and a low-speed interface 412 connecting to a low-speed expansion port 414 and thestorage device 406. Each of theprocessor 402, thememory 404, thestorage device 406, the high-speed interface 408, the high-speed expansion ports 410, and the low-speed interface 412, are interconnected using various busses, and may be mounted on a common motherboard or in other manners as appropriate. Theprocessor 402 can process instructions for execution within thecomputing device 400, including instructions stored in thememory 404 or on thestorage device 406 to display graphical information for a graphical user interface (GUI) on an external input/output device, such as adisplay 416 coupled to the high-speed interface 408. In other implementations, multiple processors and/or multiple buses may be used, as appropriate, along with multiple memories and types of memory. Also, multiple computing devices may be connected, with each device providing portions of the necessary operations (e.g., as a server bank, a group of blade servers, or a multi-processor system). - The
memory 404 stores information within thecomputing device 400. In some implementations, thememory 404 is a volatile memory unit or units. In some implementations, thememory 404 is a non-volatile memory unit or units. Thememory 404 may also be another form of computer-readable medium, such as a magnetic or optical disk. - The
storage device 406 is capable of providing mass storage for thecomputing device 400. In some implementations, thestorage device 406 may be or contain a computer-readable medium, such as a floppy disk device, a hard disk device, an optical disk device, or a tape device, a flash memory or other similar solid state memory device, or an array of devices, including devices in a storage area network or other configurations. Instructions can be stored in an information carrier. The instructions, when executed by one or more processing devices (for example, processor 402), perform one or more methods, such as those described above. The instructions can also be stored by one or more storage devices such as computer- or machine-readable mediums (for example, thememory 404, thestorage device 406, or memory on the processor 402). - The high-
speed interface 408 manages bandwidth-intensive operations for thecomputing device 400, while the low-speed interface 412 manages lower bandwidth-intensive operations. Such allocation of functions is an example only. In some implementations, the high-speed interface 408 is coupled to thememory 404, the display 416 (e.g., through a graphics processor or accelerator), and to the high-speed expansion ports 410, which may accept various expansion cards (not shown). In the implementation, the low-speed interface 412 is coupled to thestorage device 406 and the low-speed expansion port 414. The low-speed expansion port 414, which may include various communication ports (e.g., USB, Bluetooth, Ethernet, wireless Ethernet) may be coupled to one or more input/output devices, such as a keyboard, a pointing device, a scanner, or a networking device such as a switch or router, e.g., through a network adapter. - The
computing device 400 may be implemented in a number of different forms, as shown in the figure. For example, it may be implemented as astandard server 420, or multiple times in a group of such servers. In addition, it may be implemented in a personal computer such as a laptop computer 422. It may also be implemented as part of arack server system 424. Alternatively, components from thecomputing device 400 may be combined with other components in a mobile computing device (not shown), such as amobile computing device 450. Each of such devices may contain one or more of thecomputing device 400 and themobile computing device 450, and an entire system may be made up of multiple computing devices communicating with each other. - The
mobile computing device 450 includes aprocessor 452, amemory 464, an input/output device such as adisplay 454, acommunication interface 466, and atransceiver 468, among other components. Themobile computing device 450 may also be provided with a storage device, such as a micro-drive or other device, to provide additional storage. Each of theprocessor 452, thememory 464, thedisplay 454, thecommunication interface 466, and thetransceiver 468, are interconnected using various buses, and several of the components may be mounted on a common motherboard or in other manners as appropriate. - The
processor 452 can execute instructions within themobile computing device 450, including instructions stored in thememory 464. Theprocessor 452 may be implemented as a chipset of chips that include separate and multiple analog and digital processors. Theprocessor 452 may provide, for example, for coordination of the other components of themobile computing device 450, such as control of user interfaces, applications run by themobile computing device 450, and wireless communication by themobile computing device 450. - The
processor 452 may communicate with a user through acontrol interface 458 and adisplay interface 456 coupled to thedisplay 454. Thedisplay 454 may be, for example, a TFT (Thin-Film-Transistor Liquid Crystal Display) display or an OLED (Organic Light Emitting Diode) display, or other appropriate display technology. Thedisplay interface 456 may comprise appropriate circuitry for driving thedisplay 454 to present graphical and other information to a user. Thecontrol interface 458 may receive commands from a user and convert them for submission to theprocessor 452. In addition, anexternal interface 462 may provide communication with theprocessor 452, so as to enable near area communication of themobile computing device 450 with other devices. Theexternal interface 462 may provide, for example, for wired communication in some implementations, or for wireless communication in other implementations, and multiple interfaces may also be used. - The
memory 464 stores information within themobile computing device 450. Thememory 464 can be implemented as one or more of a computer-readable medium or media, a volatile memory unit or units, or a non-volatile memory unit or units. An expansion memory 474 may also be provided and connected to themobile computing device 450 through an expansion interface 472, which may include, for example, a SIMM (Single In Line Memory Module) card interface. The expansion memory 474 may provide extra storage space for themobile computing device 450, or may also store applications or other information for themobile computing device 450. Specifically, the expansion memory 474 may include instructions to carry out or supplement the processes described above, and may include secure information also. Thus, for example, the expansion memory 474 may be provided as a security module for themobile computing device 450, and may be programmed with instructions that permit secure use of themobile computing device 450. In addition, secure applications may be provided via the SIMM cards, along with additional information, such as placing identifying information on the SIMM card in a non-hackable manner. - The memory may include, for example, flash memory and/or NVRAM memory (non-volatile random access memory), as discussed below. In some implementations, instructions are stored in an information carrier that the instructions, when executed by one or more processing devices (for example, processor 452), perform one or more methods, such as those described above. The instructions can also be stored by one or more storage devices, such as one or more computer- or machine-readable mediums (for example, the
memory 464, the expansion memory 474, or memory on the processor 452). In some implementations, the instructions can be received in a propagated signal, for example, over thetransceiver 468 or theexternal interface 462. - The
mobile computing device 450 may communicate wirelessly through thecommunication interface 466, which may include digital signal processing circuitry where necessary. Thecommunication interface 466 may provide for communications under various modes or protocols, such as GSM voice calls (Global System for Mobile communications), SMS (Short Message Service), EMS (Enhanced Messaging Service), or MMS messaging (Multimedia Messaging Service), CDMA (code division multiple access), TDMA (time division multiple access), PDC (Personal Digital Cellular), WCDMA (Wideband Code Division Multiple Access), CDMA2000, or GPRS (General Packet Radio Service), among others. Such communication may occur, for example, through thetransceiver 468 using a radio-frequency. In addition, short-range communication may occur, such as using a Bluetooth, WiFi, or other such transceiver (not shown). In addition, a GPS (Global Positioning System)receiver module 470 may provide additional navigation- and location-related wireless data to themobile computing device 450, which may be used as appropriate by applications running on themobile computing device 450. - The
mobile computing device 450 may also communicate audibly using anaudio codec 460, which may receive spoken information from a user and convert it to usable digital information. Theaudio codec 460 may likewise generate audible sound for a user, such as through a speaker, e.g., in a handset of themobile computing device 450. Such sound may include sound from voice telephone calls, may include recorded sound (e.g., voice messages, music files, etc.) and may also include sound generated by applications operating on themobile computing device 450. - The
mobile computing device 450 may be implemented in a number of different forms, as shown in the figure. For example, it may be implemented as acellular telephone 480. It may also be implemented as part of a smart-phone 482, personal digital assistant, or other similar mobile computing device. - Embodiments of the subject matter, the functional operations and the processes described in this specification can be implemented in digital electronic circuitry, in tangibly-embodied computer software or firmware, in computer hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible nonvolatile program carrier for execution by, or to control the operation of, data processing apparatus. Alternatively or in addition, the program instructions can be encoded on an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal that is generated to encode information for transmission to suitable receiver apparatus for execution by a data processing apparatus. The computer storage medium can be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.
- The term “data processing apparatus” encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers. The apparatus can include special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit). The apparatus can also include, in addition to hardware, code that creates an execution environment for the computer program in question, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.
- A computer program (which may also be referred to or described as a program, software, a software application, a module, a software module, a script, or code) can be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, and it can be deployed in any form, including as a standalone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program may, but need not, correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub programs, or portions of code). A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network.
- The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit).
- Computers suitable for the execution of a computer program include, by way of example, can be based on general or special purpose microprocessors or both, or any other kind of central processing unit. Generally, a central processing unit will receive instructions and data from a read-only memory or a random access memory or both. The essential elements of a computer are a central processing unit for performing or executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks. However, a computer need not have such devices. Moreover, a computer can be embedded in another device, e.g., a mobile telephone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device (e.g., a universal serial bus (USB) flash drive), to name just a few.
- Computer readable media suitable for storing computer program instructions and data include all forms of nonvolatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto optical disks; and CD-ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
- To provide for interaction with a user, embodiments of the subject matter described in this specification can be implemented on a computer having a display device, e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, for displaying information to the user and a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer. Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input. In addition, a computer can interact with a user by sending documents to and receiving documents from a device that is used by the user; for example, by sending web pages to a web browser on a user's client device in response to requests received from the web browser.
- Embodiments of the subject matter described in this specification can be implemented in a computing system that includes a back end component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a front end component, e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the subject matter described in this specification, or any combination of one or more such back end, middleware, or front end components. The components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (“LAN”) and a wide area network (“WAN”), e.g., the Internet.
- The computing system can include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
- While this specification contains many specific implementation details, these should not be construed as limitations on the scope of what may be claimed, but rather as descriptions of features that may be specific to particular embodiments. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.
- Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Moreover, the separation of various system components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.
- Particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. For example, the actions recited in the claims can be performed in a different order and still achieve desirable results. As one example, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In certain implementations, multitasking and parallel processing may be advantageous. Other steps may be provided, or steps may be eliminated, from the described processes. Accordingly, other implementations are within the scope of the following claims.
Claims (21)
1. (canceled)
2. A computer-implemented method comprising:
receiving audio data corresponding to a single utterance by a user of a predefined hotword, wherein the predefined hotword is pronounced by the user using a personalized, non-standard pronunciation;
in response to receiving the audio data corresponding to the single utterance by the user of the predefined hotword, downloading audio features corresponding to other users' utterances of the same, predefined hotword in a manner that is indicated as similar to the personalized, non-standard pronunciation;
dynamically generating a hotword detection model for the personalized, non-standard pronunciation of the predefined hotword using (i) the audio data corresponding to the single utterance by the user of the predefined hotword, and (ii) the downloaded audio features corresponding to other users' utterances of the same, predefined hotword; and
using the dynamically generated hotword detection model to detect a likely utterance of the predefined hotword in subsequently received audio data.
3. The computer implemented method of claim 2 , comprising:
during an enrollment process, prompting, by a client device, the user to speak the predefined hotword; and
generating enrollment acoustic data using the received audio data from the user, wherein the audio data comprises the predefined hotword pronounced by the user using the personalized, non-standard pronunciation and additional one or more terms spoken by the user that trigger semantic interpretation of the one or more terms that follow the predefined hotword.
4. The computer implemented method of claim 3 , comprising:
obtaining a set of candidate acoustic data representing utterances that were previously-spoken by the other users, wherein the other users are of a similar type of user to the user.
5. The computer implemented method of claim 4 , wherein obtaining the set of candidate acoustic data representing the utterances that were spoken by the other users comprises:
determining, for each candidate acoustic data of the set of candidate acoustic data, a similarity score that represents an acoustic similarity between the enrollment acoustic data and the candidate acoustic data.
6. The computer implemented method of claim 5 , wherein determining the similarity score that represents an acoustic similarity between the enrollment acoustic data and the candidate acoustic data comprises:
determining a plurality of sub-similarity scores between the enrollment acoustic data and the candidate acoustic data; and
determining the similarity score based on an averaging of the plurality of sub-similarity scores.
7. The computer implemented method of claim 2 , wherein dynamically generating the hotword detection model for the personalized, non-standard pronunciation of the predefined hotword using (i) the audio data corresponding to the single utterance by the user of the predefined hotword, and (ii) the downloaded audio features corresponding to other users' utterances of the same, predefined hotword comprises:
training the hotword detection model to detect the likely utterance of the predefined hotword by the user in the subsequently received audio data corresponding to the single utterance of the predefined hotword by the user and without requiring the user to speak additional utterances of the predefined hotword.
8. The computer implemented method of claim 2 , wherein the hotword detection model is based at least on the single utterance and not based on another utterance of the predefined hotword.
9. A system comprising:
one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising:
receiving audio data corresponding to a single utterance by a user of a predefined hotword, wherein the predefined hotword is pronounced by the user using a personalized, non-standard pronunciation;
in response to receiving the audio data corresponding to the single utterance by the user of the predefined hotword, downloading audio features corresponding to other users' utterances of the same, predefined hotword in a manner that is indicated as similar to the personalized, non-standard pronunciation;
dynamically generating a hotword detection model for the personalized, non-standard pronunciation of the predefined hotword using (i) the audio data corresponding to the single utterance by the user of the predefined hotword, and (ii) the downloaded audio features corresponding to other users' utterances of the same, predefined hotword; and
using the dynamically generated hotword detection model to detect a likely utterance of the hotword in subsequently received audio data.
10. The system of claim 9 , the operations further comprise:
during an enrollment process, prompting, by a client device, the user to speak the predefined hotword; and
generating enrollment acoustic data using the received audio data from the user, wherein the audio data comprises the predefined hotword pronounced by the user using the personalized, non-standard pronunciation and additional one or more terms spoken by the user that trigger semantic interpretation of the one or more terms that follow the predefined hotword.
11. The system of claim 10 , the operations further comprise:
obtaining a set of candidate acoustic data representing utterances that were previously-spoken by the other users, wherein the other users are of a similar type of user to the user.
12. The system of claim 11 , wherein obtaining the set of candidate acoustic data representing the utterances that were spoken by the other users the operations further comprise:
determining, for each candidate acoustic data of the set of candidate acoustic data, a similarity score that represents an acoustic similarity between the enrollment acoustic data and the candidate acoustic data.
13. The system of claim 12 , wherein determining the similarity score that represents an acoustic similarity between the enrollment acoustic data and the candidate acoustic data the operations further comprise:
determining a plurality of sub-similarity scores between the enrollment acoustic data and the candidate acoustic data; and
determining the similarity score based on an averaging of the plurality of sub-similarity scores.
14. The system of claim 9 , wherein dynamically generating the hotword detection model for the personalized, non-standard pronunciation of the predefined hotword using (i) the audio data corresponding to the single utterance by the user of the predefined hotword, and (ii) the downloaded audio features corresponding to other users' utterances of the same, predefined hotword the operations further comprise:
training the hotword detection model to detect the likely utterance of the predefined hotword by the user in the subsequently received audio data corresponding to the single utterance of the predefined hotword by the user and without requiring the user to speak additional utterances of the predefined hotword.
15. The system of claim 9 , wherein the hotword detection model is based at least on the single utterance and not based on another utterance of the predefined hotword.
16. A non-transitory computer-readable medium storing software comprising instructions executable by one or more computers which, upon such execution, cause the one or more computers to perform operations comprising:
one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising:
receiving audio data corresponding to a single utterance by a user of a predefined hotword, wherein the predefined hotword is pronounced by the user using a personalized, non-standard pronunciation;
in response to receiving the audio data corresponding to the single utterance by the user of the predefined hotword, downloading audio features corresponding to other users' utterances of the same, predefined hotword in a manner that is indicated as similar to the personalized, non-standard pronunciation;
dynamically generating a hotword detection model for the personalized, non-standard pronunciation of the predefined hotword using (i) the audio data corresponding to the single utterance by the user of the predefined hotword, and (ii) the downloaded audio features corresponding to other users' utterances of the same, predefined hotword; and
using the dynamically generated hotword detection model to detect a likely utterance of the hotword in subsequently received audio data.
17. The computer-readable medium of claim 16 , the operations comprising:
during an enrollment process, prompting, by a client device, the user to speak the predefined hotword; and
generating enrollment acoustic data using the received audio data from the user, wherein the audio data comprises the predefined hotword pronounced by the user using the personalized, non-standard pronunciation and additional one or more terms spoken by the user that trigger semantic interpretation of the one or more terms that follow the predefined hotword.
18. The computer-readable medium of claim 17 , the operations comprising:
obtaining a set of candidate acoustic data representing utterances that were previously-spoken by the other users, wherein the other users are of a similar type of user to the user.
19. The computer-readable medium of claim 18 , wherein obtaining the set of candidate acoustic data representing the utterances that were spoken by the other users the operations comprising:
determining, for each candidate acoustic data of the set of candidate acoustic data, a similarity score that represents an acoustic similarity between the enrollment acoustic data and the candidate acoustic data.
20. The computer-readable medium of claim 19 , wherein determining the similarity score that represents an acoustic similarity between the enrollment acoustic data and the candidate acoustic data the operations comprising:
determining a plurality of sub-similarity scores between the enrollment acoustic data and the candidate acoustic data; and
determining the similarity score based on an averaging of the plurality of sub-similarity scores.
21. The computer-readable medium of claim 16 , wherein dynamically generating the hotword detection model for the personalized, non-standard pronunciation of the predefined hotword using (i) the audio data corresponding to the single utterance by the user of the predefined hotword, and (ii) the downloaded audio features corresponding to other users' utterances of the same, predefined hotword the operations comprising:
training the hotword detection model to detect the likely utterance of the predefined hotword by the user in the subsequently received audio data corresponding to the single utterance of the predefined hotword by the user and without requiring the user to speak additional utterances of the predefined hotword.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US15/462,160 US20170194006A1 (en) | 2015-07-22 | 2017-03-17 | Individualized hotword detection models |
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US14/805,753 US10438593B2 (en) | 2015-07-22 | 2015-07-22 | Individualized hotword detection models |
US15/197,268 US10535354B2 (en) | 2015-07-22 | 2016-06-29 | Individualized hotword detection models |
US15/462,160 US20170194006A1 (en) | 2015-07-22 | 2017-03-17 | Individualized hotword detection models |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US15/197,268 Continuation US10535354B2 (en) | 2015-07-22 | 2016-06-29 | Individualized hotword detection models |
Publications (1)
Publication Number | Publication Date |
---|---|
US20170194006A1 true US20170194006A1 (en) | 2017-07-06 |
Family
ID=56204080
Family Applications (3)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US14/805,753 Active US10438593B2 (en) | 2015-07-22 | 2015-07-22 | Individualized hotword detection models |
US15/197,268 Active US10535354B2 (en) | 2015-07-22 | 2016-06-29 | Individualized hotword detection models |
US15/462,160 Abandoned US20170194006A1 (en) | 2015-07-22 | 2017-03-17 | Individualized hotword detection models |
Family Applications Before (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US14/805,753 Active US10438593B2 (en) | 2015-07-22 | 2015-07-22 | Individualized hotword detection models |
US15/197,268 Active US10535354B2 (en) | 2015-07-22 | 2016-06-29 | Individualized hotword detection models |
Country Status (5)
Country | Link |
---|---|
US (3) | US10438593B2 (en) |
EP (2) | EP3125234B1 (en) |
JP (2) | JP6316884B2 (en) |
KR (2) | KR101859708B1 (en) |
CN (1) | CN106373564B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP3692522A4 (en) * | 2017-12-31 | 2020-11-11 | Midea Group Co., Ltd. | Method and system for controlling home assistant devices |
US20200365138A1 (en) * | 2019-05-16 | 2020-11-19 | Samsung Electronics Co., Ltd. | Method and device for providing voice recognition service |
GB2588689A (en) * | 2019-11-04 | 2021-05-05 | Nokia Technologies Oy | Personalized models |
Families Citing this family (28)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10437837B2 (en) * | 2015-10-09 | 2019-10-08 | Fujitsu Limited | Generating descriptive topic labels |
WO2017151443A1 (en) * | 2016-02-29 | 2017-09-08 | Myteamcalls Llc | Systems and methods for customized live-streaming commentary |
US9990926B1 (en) * | 2017-03-13 | 2018-06-05 | Intel Corporation | Passive enrollment method for speaker identification systems |
CN117577099A (en) * | 2017-04-20 | 2024-02-20 | 谷歌有限责任公司 | Method, system and medium for multi-user authentication on a device |
CN109213777A (en) * | 2017-06-29 | 2019-01-15 | 杭州九阳小家电有限公司 | A kind of voice-based recipe processing method and system |
US10504511B2 (en) * | 2017-07-24 | 2019-12-10 | Midea Group Co., Ltd. | Customizable wake-up voice commands |
JP2019066702A (en) | 2017-10-02 | 2019-04-25 | 東芝映像ソリューション株式会社 | Interactive electronic device control system, interactive electronic device, and interactive electronic device control method |
JP2019086903A (en) | 2017-11-02 | 2019-06-06 | 東芝映像ソリューション株式会社 | Speech interaction terminal and speech interaction terminal control method |
US10244286B1 (en) * | 2018-01-30 | 2019-03-26 | Fmr Llc | Recommending digital content objects in a network environment |
JP2019210197A (en) | 2018-06-07 | 2019-12-12 | 株式会社Ihi | Ceramic matrix composite |
WO2019246239A1 (en) | 2018-06-19 | 2019-12-26 | Ellipsis Health, Inc. | Systems and methods for mental health assessment |
US20190385711A1 (en) | 2018-06-19 | 2019-12-19 | Ellipsis Health, Inc. | Systems and methods for mental health assessment |
WO2020005202A1 (en) * | 2018-06-25 | 2020-01-02 | Google Llc | Hotword-aware speech synthesis |
CN118737132A (en) * | 2018-07-13 | 2024-10-01 | 谷歌有限责任公司 | End-to-end stream keyword detection |
KR102563817B1 (en) | 2018-07-13 | 2023-08-07 | 삼성전자주식회사 | Method for processing user voice input and electronic device supporting the same |
KR20200023088A (en) * | 2018-08-24 | 2020-03-04 | 삼성전자주식회사 | Electronic apparatus for processing user utterance and controlling method thereof |
EP3667512A1 (en) * | 2018-12-11 | 2020-06-17 | Siemens Aktiengesellschaft | A cloud platform and method for efficient processing of pooled data |
US10964324B2 (en) | 2019-04-26 | 2021-03-30 | Rovi Guides, Inc. | Systems and methods for enabling topic-based verbal interaction with a virtual assistant |
US11158305B2 (en) * | 2019-05-05 | 2021-10-26 | Microsoft Technology Licensing, Llc | Online verification of custom wake word |
US11132992B2 (en) | 2019-05-05 | 2021-09-28 | Microsoft Technology Licensing, Llc | On-device custom wake word detection |
US11222622B2 (en) | 2019-05-05 | 2022-01-11 | Microsoft Technology Licensing, Llc | Wake word selection assistance architectures and methods |
EP3857544B1 (en) | 2019-12-04 | 2022-06-29 | Google LLC | Speaker awareness using speaker dependent speech model(s) |
US11341954B2 (en) * | 2019-12-17 | 2022-05-24 | Google Llc | Training keyword spotters |
CN111105788B (en) * | 2019-12-20 | 2023-03-24 | 北京三快在线科技有限公司 | Sensitive word score detection method and device, electronic equipment and storage medium |
JP7274441B2 (en) * | 2020-04-02 | 2023-05-16 | 日本電信電話株式会社 | LEARNING DEVICE, LEARNING METHOD AND LEARNING PROGRAM |
US11315575B1 (en) * | 2020-10-13 | 2022-04-26 | Google Llc | Automatic generation and/or use of text-dependent speaker verification features |
US11798530B2 (en) * | 2020-10-30 | 2023-10-24 | Google Llc | Simultaneous acoustic event detection across multiple assistant devices |
US11620993B2 (en) * | 2021-06-09 | 2023-04-04 | Merlyn Mind, Inc. | Multimodal intent entity resolver |
Family Cites Families (48)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5465318A (en) * | 1991-03-28 | 1995-11-07 | Kurzweil Applied Intelligence, Inc. | Method for generating a speech recognition model for a non-vocabulary utterance |
US5199077A (en) * | 1991-09-19 | 1993-03-30 | Xerox Corporation | Wordspotting for voice editing and indexing |
US5913192A (en) * | 1997-08-22 | 1999-06-15 | At&T Corp | Speaker identification with user-selected password phrases |
US6073096A (en) | 1998-02-04 | 2000-06-06 | International Business Machines Corporation | Speaker adaptation system and method based on class-specific pre-clustering training speakers |
JP2000089780A (en) | 1998-09-08 | 2000-03-31 | Seiko Epson Corp | Speech recognition method and device therefor |
US6978238B2 (en) | 1999-07-12 | 2005-12-20 | Charles Schwab & Co., Inc. | Method and system for identifying a user by voice |
US6405168B1 (en) * | 1999-09-30 | 2002-06-11 | Conexant Systems, Inc. | Speaker dependent speech recognition training using simplified hidden markov modeling and robust end-point detection |
EP1399915B1 (en) * | 2001-06-19 | 2009-03-18 | Speech Sentinel Limited | Speaker verification |
CN1409527A (en) | 2001-09-13 | 2003-04-09 | 松下电器产业株式会社 | Terminal device, server and voice identification method |
JP2005107550A (en) | 2001-09-13 | 2005-04-21 | Matsushita Electric Ind Co Ltd | Terminal device, server device and speech recognition method |
US7203652B1 (en) * | 2002-02-21 | 2007-04-10 | Nuance Communications | Method and system for improving robustness in a speech system |
EP1376537B1 (en) | 2002-05-27 | 2009-04-08 | Pioneer Corporation | Apparatus, method, and computer-readable recording medium for recognition of keywords from spontaneous speech |
US7212613B2 (en) | 2003-09-18 | 2007-05-01 | International Business Machines Corporation | System and method for telephonic voice authentication |
US7552055B2 (en) * | 2004-01-10 | 2009-06-23 | Microsoft Corporation | Dialog component re-use in recognition systems |
US7386448B1 (en) | 2004-06-24 | 2008-06-10 | T-Netix, Inc. | Biometric voice authentication |
US20070055517A1 (en) | 2005-08-30 | 2007-03-08 | Brian Spector | Multi-factor biometric authentication |
JP2007111169A (en) * | 2005-10-19 | 2007-05-10 | Nelson Precision Casting Co Ltd | Method for manufacturing wax pattern of golf club head |
US20090106025A1 (en) | 2006-03-24 | 2009-04-23 | Pioneer Corporation | Speaker model registering apparatus and method, and computer program |
CA2680210A1 (en) * | 2007-03-05 | 2008-09-12 | Paxfire, Inc. | Internet lookup engine |
US8635243B2 (en) * | 2007-03-07 | 2014-01-21 | Research In Motion Limited | Sending a communications header with voice recording to send metadata for use in speech recognition, formatting, and search mobile search application |
US9092781B2 (en) | 2007-06-27 | 2015-07-28 | Verizon Patent And Licensing Inc. | Methods and systems for secure voice-authenticated electronic payment |
CN101465123B (en) * | 2007-12-20 | 2011-07-06 | 株式会社东芝 | Verification method and device for speaker authentication and speaker authentication system |
CN101593519B (en) | 2008-05-29 | 2012-09-19 | 夏普株式会社 | Method and device for detecting speech keywords as well as retrieval method and system thereof |
WO2010008722A1 (en) * | 2008-06-23 | 2010-01-21 | John Nicholas Gross | Captcha system optimized for distinguishing between humans and machines |
US8676904B2 (en) * | 2008-10-02 | 2014-03-18 | Apple Inc. | Electronic devices with voice command and contextual data processing capabilities |
US8332223B2 (en) * | 2008-10-24 | 2012-12-11 | Nuance Communications, Inc. | Speaker verification methods and apparatus |
CN101447185B (en) | 2008-12-08 | 2012-08-08 | 深圳市北科瑞声科技有限公司 | Audio frequency rapid classification method based on content |
JP5610304B2 (en) | 2011-06-24 | 2014-10-22 | 日本電信電話株式会社 | Model parameter array device, method and program thereof |
US8924219B1 (en) * | 2011-09-30 | 2014-12-30 | Google Inc. | Multi hotword robust continuous voice command detection in mobile devices |
US8818810B2 (en) | 2011-12-29 | 2014-08-26 | Robert Bosch Gmbh | Speaker verification in a health monitoring system |
GB2514943A (en) * | 2012-01-24 | 2014-12-10 | Auraya Pty Ltd | Voice authentication and speech recognition system and method |
US9323912B2 (en) | 2012-02-28 | 2016-04-26 | Verizon Patent And Licensing Inc. | Method and system for multi-factor biometric authentication |
US9646610B2 (en) | 2012-10-30 | 2017-05-09 | Motorola Solutions, Inc. | Method and apparatus for activating a particular wireless communication device to accept speech and/or voice commands using identification data consisting of speech, voice, image recognition |
US9378733B1 (en) * | 2012-12-19 | 2016-06-28 | Google Inc. | Keyword detection without decoding |
KR20240132105A (en) | 2013-02-07 | 2024-09-02 | 애플 인크. | Voice trigger for a digital assistant |
US9361885B2 (en) | 2013-03-12 | 2016-06-07 | Nuance Communications, Inc. | Methods and apparatus for detecting a voice command |
US9123330B1 (en) * | 2013-05-01 | 2015-09-01 | Google Inc. | Large-scale speaker identification |
US9620123B2 (en) * | 2013-05-02 | 2017-04-11 | Nice Ltd. | Seamless authentication and enrollment |
JP2014232258A (en) * | 2013-05-30 | 2014-12-11 | 株式会社東芝 | Coordination business supporting device, method and program |
US9336781B2 (en) * | 2013-10-17 | 2016-05-10 | Sri International | Content-aware speaker recognition |
US10019985B2 (en) * | 2013-11-04 | 2018-07-10 | Google Llc | Asynchronous optimization for sequence training of neural networks |
CN103559881B (en) | 2013-11-08 | 2016-08-31 | 科大讯飞股份有限公司 | Keyword recognition method that languages are unrelated and system |
US8768712B1 (en) * | 2013-12-04 | 2014-07-01 | Google Inc. | Initiating actions based on partial hotwords |
US9589564B2 (en) * | 2014-02-05 | 2017-03-07 | Google Inc. | Multiple speech locale-specific hotword classifiers for selection of a speech locale |
US9542948B2 (en) * | 2014-04-09 | 2017-01-10 | Google Inc. | Text-dependent speaker identification |
US10540979B2 (en) * | 2014-04-17 | 2020-01-21 | Qualcomm Incorporated | User interface for secure access to a device using speaker verification |
US9548979B1 (en) * | 2014-09-19 | 2017-01-17 | United Services Automobile Association (Usaa) | Systems and methods for authentication program enrollment |
US20160189730A1 (en) * | 2014-12-30 | 2016-06-30 | Iflytek Co., Ltd. | Speech separation method and system |
-
2015
- 2015-07-22 US US14/805,753 patent/US10438593B2/en active Active
-
2016
- 2016-06-29 US US15/197,268 patent/US10535354B2/en active Active
- 2016-07-12 EP EP16186281.8A patent/EP3125234B1/en active Active
- 2016-07-12 EP EP16179113.2A patent/EP3121809B1/en active Active
- 2016-07-21 KR KR1020160092851A patent/KR101859708B1/en active IP Right Grant
- 2016-07-21 JP JP2016143155A patent/JP6316884B2/en active Active
- 2016-07-22 CN CN201610586197.0A patent/CN106373564B/en active Active
- 2016-08-04 KR KR1020160099402A patent/KR102205371B1/en active IP Right Grant
-
2017
- 2017-03-17 US US15/462,160 patent/US20170194006A1/en not_active Abandoned
-
2018
- 2018-03-28 JP JP2018061958A patent/JP6630765B2/en active Active
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP3692522A4 (en) * | 2017-12-31 | 2020-11-11 | Midea Group Co., Ltd. | Method and system for controlling home assistant devices |
US20200365138A1 (en) * | 2019-05-16 | 2020-11-19 | Samsung Electronics Co., Ltd. | Method and device for providing voice recognition service |
US11605374B2 (en) * | 2019-05-16 | 2023-03-14 | Samsung Electronics Co., Ltd. | Method and device for providing voice recognition service |
GB2588689A (en) * | 2019-11-04 | 2021-05-05 | Nokia Technologies Oy | Personalized models |
GB2588689B (en) * | 2019-11-04 | 2024-04-24 | Nokia Technologies Oy | Personalized models |
Also Published As
Publication number | Publication date |
---|---|
US10535354B2 (en) | 2020-01-14 |
CN106373564B (en) | 2019-11-22 |
JP2018109789A (en) | 2018-07-12 |
EP3121809A1 (en) | 2017-01-25 |
EP3125234B1 (en) | 2019-05-15 |
CN106373564A (en) | 2017-02-01 |
KR20170012112A (en) | 2017-02-02 |
JP6316884B2 (en) | 2018-04-25 |
US10438593B2 (en) | 2019-10-08 |
US20170186433A1 (en) | 2017-06-29 |
JP2017027049A (en) | 2017-02-02 |
KR102205371B1 (en) | 2021-01-20 |
EP3121809B1 (en) | 2018-06-06 |
KR20180010923A (en) | 2018-01-31 |
JP6630765B2 (en) | 2020-01-15 |
KR101859708B1 (en) | 2018-05-18 |
EP3125234A1 (en) | 2017-02-01 |
US20170025125A1 (en) | 2017-01-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10535354B2 (en) | Individualized hotword detection models | |
US12094471B2 (en) | Providing answers to voice queries using user feedback | |
US9741339B2 (en) | Data driven word pronunciation learning and scoring with crowd sourcing based on the word's phonemes pronunciation scores | |
US10269346B2 (en) | Multiple speech locale-specific hotword classifiers for selection of a speech locale | |
JP6474827B2 (en) | Dynamic threshold for speaker verification | |
JP6507316B2 (en) | Speech re-recognition using an external data source | |
US9299347B1 (en) | Speech recognition using associative mapping | |
CN110349591B (en) | Automatic voice pronunciation attribution | |
US9576578B1 (en) | Contextual improvement of voice query recognition | |
US9401146B2 (en) | Identification of communication-related voice commands | |
US10102852B2 (en) | Personalized speech synthesis for acknowledging voice actions | |
CN107066494B (en) | Search result pre-fetching of voice queries | |
US9263033B2 (en) | Utterance selection for automated speech recognizer training | |
US10026396B2 (en) | Frequency warping in a speech recognition system | |
US20150006169A1 (en) | Factor graph for semantic parsing |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: GOOGLE INC., CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:GUEVARA, RAZIEL ALVAREZ;REEL/FRAME:041623/0552 Effective date: 20150720 |
|
AS | Assignment |
Owner name: GOOGLE LLC, CALIFORNIA Free format text: CHANGE OF NAME;ASSIGNOR:GOOGLE INC.;REEL/FRAME:044129/0001 Effective date: 20170929 |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |