US20210050003A1 - Custom Wake Phrase Training - Google Patents

Custom Wake Phrase Training Download PDF

Info

Publication number
US20210050003A1
US20210050003A1 US16/541,995 US201916541995A US2021050003A1 US 20210050003 A1 US20210050003 A1 US 20210050003A1 US 201916541995 A US201916541995 A US 201916541995A US 2021050003 A1 US2021050003 A1 US 2021050003A1
Authority
US
United States
Prior art keywords
phrase
audio
custom
model
audio samples
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US16/541,995
Inventor
Sameer Syed Zaheer
Newton Jain
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SoundHound Inc
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to US16/541,995 priority Critical patent/US20210050003A1/en
Priority to CN202010158115.9A priority patent/CN112447171A/en
Publication of US20210050003A1 publication Critical patent/US20210050003A1/en
Priority to US17/584,780 priority patent/US20220148572A1/en
Assigned to SOUNDHOUND, INC. reassignment SOUNDHOUND, INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: Zaheer, Sameer Syed, Jain, Newton
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0638Interactive procedures
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Definitions

  • the present disclosure relates to systems and methods for providing custom wake phrase training, and more particularly, providing systems and methods for automatically generating an executable from the training for deployment on a device to recognize the custom wake phrase.
  • Wake phrase recognition generally involves spotting the occurrence of a particular spoken wake phrase in a continuous and potentially noisy audio signal, while ignoring all other words, phrases, sounds, noises and other acoustic events in the audio signal.
  • a user may utilize a wake word or phrase to voice a command to control the operation of a device such as a mobile device, an appliance, a car, a robot or other device.
  • natural language processing is used to understand and act upon the speech input commands associated with the wake phrase.
  • speech recognition can be used to recognize the input audio as having corresponding text such that the text can then be analyzed to determine a specific command to be performed on the device.
  • a developer has a proprietary device design (e.g. a car with “smart” capabilities) and wants to create a custom wake phrase that is associated with their brand and easily spotted by their smart devices.
  • a proprietary device design e.g. a car with “smart” capabilities
  • such manual customization even if implemented, can also lead to inaccuracies in the custom wake phrase recognition in the virtual assistant device, which is generally unacceptable by a user.
  • the developer is forced to utilize a pre-defined custom wake phrase provided by the platform, e.g. “OK ALEXA”.
  • Disclosed embodiments provide systems and methods for providing custom wake phrase training for a computer-implemented model for subsequently spotting the custom wake phrase when deployed on a voice enabled computing device such as a natural language controlled virtual assistant device.
  • a computing system for training custom phrase spotter executables for virtual assistants comprising a processor and a memory in communication with the processor, the memory storing instructions that, when executed by the processor, configure the computing system to: receive a request for training a custom phrase spotter executable and an identification of a specific virtual assistant; responsive to receiving the request, receive: one or more positive audio samples corresponding to spoken audio of a custom wake phrase; train, using the positive audio samples, a model for the custom wake phrase audio; and compile the executable, including the model, such that, when deployed on the specific virtual assistant as identified by the identification, the executable recognizes the custom wake phrase.
  • a computer implemented method for training a custom phrase spotter executable comprising: receiving a request for training a custom phrase spotter executable; receiving one or more positive audio samples corresponding to spoken audio of a custom wake phrase; training, using the positive audio samples, a model for the custom wake phrase; and compiling the executable, including the model, such that, when deployed for a virtual assistant, the executable recognizes the custom wake phrase.
  • a non-transitory computer readable medium storing code for a software development kit (SDK) for training a custom phrase spotter executable for a virtual assistant, the code is executable by a processor and that, when executed by the processor, causes the SDK to: receive a request for training a custom phrase spotter executable; receive one or more positive audio samples corresponding to spoken audio of a custom wake phrase; train, using the positive audio samples, a model for the custom phrase spotter executable; and compile the phrase spotter executable, including the model, such that, when deployed on the virtual assistant, the executable recognizes the custom wake phrase.
  • SDK software development kit
  • a computing system for training custom phrase spotter executables for virtual assistants comprising a processor and a memory in communication with the processor, the memory storing instructions that, when executed by the processor, configure the computing system to: receive a request for training a custom phrase spotter executable and an identification of a specific virtual assistant; responsive to receiving the request, receive: text corresponding to the custom wake phrase; search within a corpus of audio samples, stored on a database of the computing system, for one or more stored positive audio samples corresponding to the text; and train, using the positive audio samples, a model for the custom wake phrase audio; and compile the executable, including the model, such that, when deployed on the specific virtual assistant as identified by the identification, the executable recognizes the custom wake phrase.
  • FIG. 1A is a block diagram illustrating portions of a representative computer environment, in accordance with one embodiment
  • FIG. 1B illustrates details of an example computing device in the computer environment of FIG. 1A , in accordance with one embodiment
  • FIG. 2 is a block diagram illustrating example computer components of a training data processing module of FIG. 1B for generating training data, in accordance with one embodiment
  • FIG. 3 is a block diagram illustrating example computer components of a phrase spotter executable generation module of FIG. 1B , in accordance with one embodiment
  • FIG. 4 is a flowchart of an exemplary process for dynamically training a computerized model (e.g. a computerized machine learning model) for an executable for spotting a custom wake phrase, in accordance with one embodiment
  • FIGS. 5A-5C are schematic diagrams illustrating example functionality of neural-network based wake phrase detectors 500 , 510 , 520 provided by the phrase spotter executable 308 of FIG. 3 , in accordance with one or more embodiments.
  • the present disclosure is directed to systems and methods for allowing customization of wake phrases for recognition and control of virtual assistant computing devices by training of computerized models using input positive and/or negative audio and/or text samples to improve accuracy of the training.
  • virtual assistant computing devices are passive listening devices that are configured to understand natural language and are pre-configured to “wake up” or activate upon hearing their name such as “Hey Siri”, “OK Google”, or “Alexa”.
  • “Hey Siri”, “OK Google”, or “Alexa” the name of a user
  • “Alexa” the name of a user
  • FIG. 1A illustrates a block diagram of a high-level architecture of a representative computer environment 100 .
  • the computer environment 100 comprises an example computing device 102 communicating with a network 104 and configured to dynamically train a computerized model for a phrase spotter executable operable for spotting a custom wake phrase as selected by a developer of a requesting computing device 106 .
  • the phrase spotter executable generated by the computing device 102 is for deployment on a voice-enabled virtual assistant computing device 108 , in accordance with one or more aspects of the presented disclosure.
  • Simplified component details of computing device 102 are illustrated and enlarged in FIG. 1B . Referring to FIGS.
  • computing device 102 is communicating using one or more communication networks 104 with one or more other requesting computing devices 106 .
  • Computing device 102 receives respective request data 107 for custom wake phrase training including custom wake phrase particulars (e.g. in audio and/or text format) from the requesting computing device 106 .
  • computing device 102 is configured to generate and output a respective phrase spotter executable to the requesting computing device 106 and/or a virtual assistant computing device 108 for subsequent deployment of the phrase spotter executable (e.g. see 308 in FIG. 3 ) on the virtual assistant computing device 108 .
  • a particular request 107 may include a request for wake phrase customization and developer-provided details for the custom wake phrase (e.g. see input data 214 in FIG. 2 ), which can comprise audio and/or text data relating to positive and/or negative samples for use in training of the computerized model for wake phrase recognition.
  • the details can be provided via a developer interface on the requesting computing device 106 displayed in response to the initiation of the request 107 .
  • Computing device 102 may further comprise one or more servers.
  • the computing device 102 is a single computer, cloud computing services, a laptop computer, a desktop computer, a touch computing device or another type of computing device.
  • the requesting computing device 106 is a laptop computer but may also be any computing device such as a cell phone, a desktop computer, or another type of computing device comprising at least a processor, a memory and a communication interface capable of communicating custom wake phrase requests (e.g. via a graphical interface) and receiving responses from computing device 102 .
  • virtual assistant computing device 108 is a cell phone but may be any voice enabled computing device capable of voice interaction and control of its computer services and/or one or more associated smart devices.
  • Example existing virtual assistant computing devices 108 include mobile phones, automobiles, smart speakers, appliances, kiosks, vending machines, and helper robots.
  • Network 104 which may be a wide area network (WAN) such as the Internet. Additional networks may also be coupled to the WAN of network 104 such as a wireless network and/or a local area network (LAN) between the WAN and computing device 102 or between the WAN and any of requesting computing devices 106 and 108 .
  • WAN wide area network
  • LAN local area network
  • FIG. 1B shows example computer components of device 102 , in accordance with one or more aspects of the present disclosure, for example, to provide a system and perform a method to train a model for a custom wake phrase spotter executable, which is operable to spot spoken instances of the custom wake phrase (e.g. once the executable is deployed on device 108 ), as selected by a developer of the requesting computing device 106 of FIG. 1A .
  • Computing device 102 comprises one or more processors 122 , one or more input devices 124 , one or more communication units 126 and one or more output devices 128 .
  • Computing device 102 also comprises one or more storage devices 130 storing one or more computer modules such as graphical interface 110 , operating system module 112 , phrase spotter executable generation module 114 , training data processing module 116 , audio data repository 118 (e.g. a corpus of audio samples labelled with their associated transcriptions) and negative audio data repository 120 (e.g. a corpus of negative audio data such as spoken phrases that are similar but different from custom wake phrase audio data, background noise, environmental sounds, non-speech music, etc.).
  • audio data repository 118 e.g. a corpus of audio samples labelled with their associated transcriptions
  • negative audio data repository 120 e.g. a corpus of negative audio data such as spoken phrases that are similar but different from custom wake phrase audio data, background noise, environmental sounds, non-speech music, etc.
  • Communication channel 144 may couple each of the components 122 , 124 , 126 , 128 , and 130 (and the computer modules contained therein), for inter-component communications.
  • communications channels 144 may include a system bus, a network connection, and inter-process communication data structure, or any other method for communicating data.
  • Processor(s) 122 may implement functionality and/or execute computer instructions within computing device 102 .
  • processors 122 may be configured to receive instructions and/or data from storage devices 130 to execute the functionality of the modules 110 , 112 , 114 , 116 , 118 , and 120 shown in FIG. 1B .
  • One or more communication units 126 are operable to allow communications with external computing devices including requesting computing device 106 and virtual assistant computing device 108 via one or more networks 104 by transmitting and/or receiving network signals on the one or more networks.
  • the communication units 126 may include various antennae and/or network interface cards, etc. for wireless and/or wired communications with external computing devices and network 104 .
  • Input devices 124 and output devices 128 may include any of one or more buttons, switches, pointing devices, one or more cameras, a keyboard, a pointing device, a microphone, one or more sensors (e.g., biometric, etc.), a speaker, a bell, one or more lights, a display screen (which may be a touchscreen device providing I/O capabilities), etc.
  • One or more of same may be coupled via a universal serial bus (USB), Bluetooth or other communication units (e.g., 126 ). That is, input 124 and output 128 devices may be on device 102 or coupled thereto via wired or wireless communication.
  • USB universal serial bus
  • the computing device 102 may store data/information to storage devices 130 , which may comprise, for example, input data (e.g. 214 see FIG. 2 ) providing particulars of custom wake phrase (e.g. a positive sample audio 208 of the custom wake phrase), output training data (e.g. 216 in FIG. 2 ) generated by the training data processing module, machine learning trained processes, audio data repository with transcriptions (e.g. storing a mapping of audio data to transcriptions) used to label input data and negative audio data repository (e.g. containing samples of negative data) used to generate the training data 216 to train a model for generating a phrase spotter executable by module 114 for computing device 108 .
  • input data e.g. 214 see FIG. 2
  • output training data e.g. 216 in FIG. 2
  • machine learning trained processes e.g. storing a mapping of audio data to transcriptions
  • audio data repository with transcriptions e.g. storing a mapping of audio data to transcriptions
  • negative audio data repository
  • the one or more storage devices 130 may store instructions and/or data for processing during operation of computing device 102 .
  • the one or more storage devices 130 may take different forms and/or configurations, for example, as short-term memory or long-term memory.
  • Storage devices 130 may be configured for short-term storage of information as volatile memory, which does not retain stored contents when power is removed.
  • Volatile memory examples include random access memory (RAM), dynamic random access memory (DRAM), static random access memory (SRAM), etc.
  • Storage devices 130 in some examples, also include one or more computer-readable storage media, for example, to store larger amounts of information than volatile memory and/or to store such information for long-term, retaining information when power is removed.
  • Non-volatile memory examples include magnetic hard discs, optical discs, floppy discs, flash memories, or forms of electrically programmable memory (EPROM) or electrically erasable and programmable (EEPROM) memory.
  • computing device 102 upon receiving the request 107 at the computing device 102 , is configured to provide display of a developer interface on the requesting computing device 106 for collecting particulars of the custom wake phrase (e.g. input data 214 ).
  • the input data 214 is in the form of one or more of input positive audio samples 208 . Additionally, the input data 214 can comprise one or more of: input positive text sample 210 , and input negative audio sample 212 .
  • the input data 214 is received and processed at the training data processing module 116 in communication with data repositories 118 and 120 to generate a training data 216 for subsequent use in training a computerized model for a phrase spotter executable as generated by the module 114 .
  • the training data processing module 116 is generally configured to generate a set of positive and/or negative audio samples as training data 216 for subsequent training of a machine learning model (see FIG. 3 ) and generating an executable for spotting the custom wake phrase on a particular virtual assistant device (e.g. 108 ).
  • the training data processing module 116 comprises a text to audio conversion module 202 , a repository search module 204 (for searching within associated repositories 118 and 120 ) and a data manipulation module 206 for facilitating in generating the positive and negative samples in the training data 216 .
  • any audio that contains the exact phrase of the custom wake phrase is considered a positive sample (e.g. 208 , 210 ).
  • Each of the computing devices described herein may comprise a processor and non-transitory computer readable medium.
  • the computer readable medium may store computer program code that is executable by each of the device's processor and that, when executed by the processor, causes each of the computing devices (e.g. 102 , 106 , and 108 ) to perform the functionality attributed to them therein.
  • Example embodiments depicting example input data 214 scenarios and processing of such input data 214 to generate the training data 216 is described below with reference to FIG. 2 .
  • the embodiments below may also be combined such that various combination of input data 214 may be envisaged for improving accuracy of the training data 216 .
  • the training data processing module 116 is configured to receive input data 214 comprising input positive audio sample 208 data corresponding to audio of a person speaking the custom wake phrase (e.g. a person saying “Hey, Mercedes” or “Hey, Pandora”) from the requesting computing device 106 .
  • Such input positive audio sample 208 may be provided via a developer interface of the requesting computing device 106 (e.g. by uploading a pre-existing audio file or by speaking into a microphone of the requesting computing device 106 and recording an input positive sample).
  • the input positive audio sample 208 is output as training data 216 for training of the model (see FIG. 3 ).
  • the input positive text sample 210 may be obtained by developer input (e.g. keyboard input via an interface for custom wake phrase data collection) or previously stored on the computing device 106 and/or uploaded onto the computing device 106 .
  • the training data processing module 116 is then configured to process the text to generate a corresponding stored positive audio sample signal.
  • this includes searching, via the repository search module 204 , within the audio data repository with text transcriptions 118 to determine a stored positive audio sample corresponding to the text provided in the sample 210 and provide that as training data 216 .
  • generating audio from the input positive text sample 210 comprises using the text to audio conversion module 202 , which applies text-to-speech conversion (TTS) to generate an audio signal of the custom wake phrase and output as training data 216 .
  • TTS text-to-speech conversion
  • both the output from the text to audio conversion module 202 and the output from the repository search module 204 providing a mapping of the text to audio via the audio data repository 118 is included as training data 216 for the training of the computerized model in FIG. 3 .
  • the audio data repository 118 contains a collection of audio data labelled or associated with text transcriptions and can be augmented to include additional audio data as the developer provides further input positive audio samples 208 and/or input positive text samples 210 .
  • the training data processing module 116 is configured to receive, from the requesting computing device 106 , the input data 214 comprising input negative audio sample 212 data corresponding to audio sounding similar to a person speaking the custom wake phrase (e.g. a person saying “OK Alexandra” instead of “OK Alexa”; or “Hey Doodle” instead of “Hey Google”) from the requesting computing device 106 .
  • the custom wake phrase e.g. a person saying “OK Alexandra” instead of “OK Alexa”; or “Hey Doodle” instead of “Hey Google
  • This input negative audio sample 212 being phonetically (as represented by phonemes) and/or acoustically (as represented by a time sequence of samples or a spectrogram) similar to the input positive audio sample 208 but an incorrect representation of the custom wake phrase is considered to be a “bait” phrase and output as a negative sample in the training data 216 .
  • the bait data in the training data 216 is used to train the computerized model to differentiate bait phrases from actual spoken instances of the custom wake phrase.
  • Such input negative audio sample 208 including bait samples, may be provided via an interface of the requesting computing device 106 (e.g. by uploading a pre-existing audio file or by speaking into a microphone of the requesting computing device 106 and recording an input negative sample).
  • the training data processing module 116 uses the input positive audio sample 208 to search, via the repository search module 204 , repositories (e.g. 118 and 120 ) for a stored negative audio sample having audible similarities to the sample 208 but an incorrect representation of the custom wake phrase.
  • the stored negative audio sample is then provided as training data 216 .
  • the input positive text sample 210 is used to retrieve a stored negative audio sample.
  • the input positive text sample 210 provided to the processing module 116 is used by the repository search module 204 to search within the audio data repository 118 for a stored negative audio sample having a transcription that is different but phonetically and/or acoustically similar to the text sample 210 representing the custom wake phrase.
  • the stored negative audio sample is then provided as training data 216 (e.g. as an example of an incorrect representation of the custom wake phrase).
  • the repository search module 204 searches negative audio data repository 120 for a stored negative audio sample representative of non-speech audio.
  • the data repository 120 contains for example: non-speech audio of typical environmental noise, and a collection of speech independent negative audio sample such as music.
  • the requests 107 relate to multiple different virtual assistant computing devices 108 .
  • each of the virtual assistant computing devices 108 is provided, via the computing device 102 , with a corresponding custom phrase spotter executable 308 such as to respectively recognize custom wake phrases intended for a particular device 108 .
  • the input data 214 provided subsequent to the request 107 further comprises virtual assistant device identification 218 which uniquely identifies each of the different virtual assistant computing devices 108 in the environment 100 . Accordingly, once the phrase spotter executable 308 is generated, it is linked by the phrase spotter generation module 114 with the particular device 108 via the identification 218 .
  • the training data processing module 116 is configured to access, via the repository search module 204 , negative audio samples within repository 120 that is specific to the particular device 108 (e.g. its environment, type of background sounds expected . . . ). For example, if the identification 218 identifies device 108 as being within a car, then the data manipulation module 206 augments the training data 216 with non-speech audio of typical environmental noise of a car as stored in the repository 120 .
  • the positive and negative audio samples processed by the training data processing module 116 can further be augmented by adding sound effects such as white noise, babble noise, radio/car noise to the audio samples and providing as training data 216 such as to improve the accuracy of the trained model 306 and diversify the types of audio data that a phrase spotter executable 308 can discriminate.
  • sound effects such as white noise, babble noise, radio/car noise
  • the processor 122 of the computing device 102 co-ordinates for the training data 216 to be input into the phrase spotter executable generation module 114 .
  • the phrase spotter executable generation module 114 is configured to perform training via the training module 302 and generate a phrase spotter executable 308 that can run on a virtual assistant computing device (e.g. 108 ) for detecting custom wake phrase to “wake up” the device 108 .
  • Training module 302 in FIG. 3 uses the training data 216 and generates a trained model 306 for use by the phrase spotter executable generation module 114 to generate the phrase spotter executable 308 .
  • Training module 302 receives training data 216 as input to machine learning algorithm 304 .
  • Training data 216 is a collection of positive audio samples (indicating correct audio representations of the custom wake phrase) and negative audio sample (indicating incorrect representations of the custom wake phrase).
  • Machine learning algorithm 304 can be a regression or a classification method. The machine learning algorithm 304 utilizes the positive and/or negative sample sets and generates an optimal trained model 306 .
  • the trained model 306 provides functionality that recognizes positive audio samples in the training data 216 as the custom wake phrase and disregards negative audio samples in the training data 216 as not matching the custom wake phrase.
  • the trained model 306 is then used by the processor 122 to generate the phrase spotter executable 308 to be deployed locally on the virtual assistant computing device 108 (or alternatively as part of a software development kit (SDK) platform, which is a collection of software used for developing applications for a specific device or operating system.)
  • SDK software development kit
  • the computing device 102 may receive feedback from a developer of the requesting computing device 106 , indicative of the model for the wake phrase spotter executable 308 recognizing incorrect audio samples as the custom wake phrase.
  • the processor 122 is configured to include the incorrect audio sample as a negative sample in the training data 216 and dynamically re-train the model 306 to generate a new trained process and new phrase spotter executable for subsequent deployment.
  • a developer interface such as a web interface is provided on the requesting computing device 106 in response to the request 107 for uploading positive and/or negative audio sample data (likely a .zip file of .wav or .mp3 files) and downloading a phrase spotter executable (e.g. 308 in FIG. 3 ) onto a requesting computing device 106 and/or virtual assistant computing device 108 , in one embodiment.
  • a software development kit SDK
  • SDK software development kit
  • training data processing module 116 phrase spotter executable generation module 114 , audio data repository with transcriptions 118 , negative audio data repository 120 ) to allow developers to run the training process provided in the training module 302 of FIG. 3 independently on their desired computing device (e.g. 106 ).
  • FIG. 4 is a flowchart illustrating example operations 400 of the computing device 102 of FIGS. 1A, 1B-3 , in accordance with an aspect of the present disclosure.
  • Operations 400 receive a developer-defined custom wake phrase for awakening a specific voice controlled virtual assistant device and train a model of a custom wake phrase to spot the custom wake phrase in an audio signal.
  • the computing environment 100 of FIG. 1A comprises a plurality of different virtual assistant computing devices 108 , each requiring a unique custom phrase spotter executable 308 for recognizing a particular custom wake phrase as selected by a developer(e.g., via a developer interface of the requesting computing device 106 ).
  • computing device 102 receives a request from a requesting computing device (e.g. 106 ) for training a custom phrase spotter executable for spotting a custom wake phrase.
  • a developer interface wake phrase customization particulars e.g. input data 214
  • the input data 214 comprises identification data 218 for identifying a specific virtual assistant device 108 for deploying the custom phrase spotter executable 308 thereon.
  • the particulars (input data 214 ) further comprise at least one positive audio sample corresponding to at least one person speaking the custom wake phrase.
  • the input positive audio sample may be a sound recording or characters uploaded and provided to the computing device 102 .
  • the computing device 102 may further process the received customization particulars (e.g. providing positive and/or negative text and/or audio data samples of the custom wake phrase) to generate positive and/or negative training data for training module 302 .
  • Such processing may include for example, manipulating the positive sample(s) with environmental sounds or noise characteristics associated with the specific virtual assistant device 108 .
  • Some particulars e.g. input data 214
  • training of the model for the custom wake phrase audio is performed using the positive audio sample(s) (e.g. 208 ), and if applicable, other positive and/or negative audio/text samples in the input data 214 .
  • the trained model is for generating the phrase spotter executable 308 that, when deployed on a specific virtual assistant computing device (e.g. 108 ), recognizes subsequent audio input instances of the custom wake phrase based upon the training.
  • the phrase spotter executable e.g. 308 in FIG. 3
  • FIGS. 5A-5C shown are schematic diagrams illustrating example functionality of wake phrase detectors 500 , 510 , 520 provided by the phrase spotter executable 308 of FIG. 3 , which can simultaneously detect one or more custom wake phrases to “wake up” the device 108 of FIG. 1A , in accordance with one or more embodiments.
  • FIG. 5A shows a neural network based wake phrase detector 500 that recognizes a single custom wake phrase.
  • FIGS. 5B and 5C show neural network based wake phrase detectors 510 and 520 respectively, configured to listen for and recognize multiple wake phrases such that the same model is triggered to indicate a positive detection with any one of the multiple wake phrases (e.g. “Hey Hound” and “Hi Hound”).
  • the detector 500 comprises a set of input audio features 501 and a set of outputs 502 that are the most likely sub phrase units or partial-phrase units to have been spoken in a recent sequence of speech.
  • Sub phrase units can be phonemes or sequences of multiple phonemes, including words of multi-word wake phrases. Additionally, the sub phrase units may be a unit smaller than a phoneme, such as a senone, or a unit larger than a phoneme such as a di-phone or tri-phone or multiple phonemes.
  • the detector 500 may have one or more hidden layers between the input nodes 501 and output nodes 502 .
  • the detector 500 comprises a recurrent neural network, such as a long short-term memory (LSTM) neural network for recognizing a time-varying sequence.
  • the single wake phrase detector 500 further comprises a matching block 503 that identifies when the sequence of sub phrase units provided at output 502 matches a pre-defined sequence for the wake phrase and provides an indication 504 of whether a match exists (e.g. the single custom wake phrase has been detected).
  • a multiple wake phrase detector 510 comprising a neural network similar to the neural network-based wake phrase detector of FIG. 5A but configured to recognize two wake phrases.
  • the detector 510 comprises a set of input audio features 511 and a set of outputs 512 that are the most likely sub phrase units to have been spoken in a recent sequence of sub phrase units for each of the two wake phrases.
  • a first matching block 513 identifies when the sequence of sub phrase units matches the sequence for the first wake phrase and provides a first indication 515 of whether a match exists.
  • a second matching block 514 identifies when the sequence of sub phrase units matches the sequence for the second wake phrase and provides a second indication 516 of whether a match exists for the second wake phrase.
  • the first and second matching blocks 513 and 514 share some but also have some unique sub phrase units as input.
  • the matching blocks 513 and 514 are configured to support, for example, two wake phrases having some overlapping sub phrase units such as “hey hound” and “okay hound”, which have some common sub phrase units (e.g., “hound”).
  • the detector 510 further comprises a decision block 517 which performs a logical OR operation to determine whether either of the first indication 515 OR the second indication 516 indicate a positive match, and output a final indication 518 indicative of whether the positive match exists for at least one of the two wake phrases.
  • the positive match 518 for indicating to the associated virtual assistant device 108 to “wake up” and respond to subsequent spoken requests.
  • the detector 520 comprises a set of input audio features 521 , a hidden layer of features 522 , and a final output node 523 that indicates when the phrase detector 520 has recognized any one of the number of spoken wake phrases and provides an indication of recognition as output 524 .
  • the neural network of FIG. 5C is recurrent to capture the time-varying nature of speech audio for a wake phrase.
  • the hidden layer might not be necessary, but might represent features analogous to sub phrase units.
  • Such a neural network depicted in FIG. 5C can be trained for any number of wake phrases by using a training data set of positive examples of each desired wake phrase. This approach works for wake phrases that are similar and wake phrases that are very different such as ones that are meaningful in different human languages. For very large numbers of wake phrases, good accuracy would benefit from a somewhat larger number of hidden nodes or hidden layers shown as hidden features 522 .

Abstract

There is provided a computing system for training a model for spotting of a custom wake phrase, the system comprising a processor and a memory in communication with the processor, the memory storing instructions that, when executed by the processor, configure the computing system to: receive a request for training the model for a phrase spotter executable for spotting the custom wake phrase within audio input. The computing system is configured to: responsive to receiving the request, receive, at a user interface of the computing system: an input positive audio sample corresponding to spoken audio of the custom wake phrase; and train using the input positive audio sample, the model for the phrase spotter executable for a wake phrase recognition subsystem that, when deployed on a voice enabled computing device, recognizes subsequent audio input instances of the custom wake phrase based upon the training.

Description

    FIELD
  • The present disclosure relates to systems and methods for providing custom wake phrase training, and more particularly, providing systems and methods for automatically generating an executable from the training for deployment on a device to recognize the custom wake phrase.
  • BACKGROUND
  • Wake phrase recognition generally involves spotting the occurrence of a particular spoken wake phrase in a continuous and potentially noisy audio signal, while ignoring all other words, phrases, sounds, noises and other acoustic events in the audio signal. For example, a user may utilize a wake word or phrase to voice a command to control the operation of a device such as a mobile device, an appliance, a car, a robot or other device. Generally, natural language processing is used to understand and act upon the speech input commands associated with the wake phrase. For example, speech recognition can be used to recognize the input audio as having corresponding text such that the text can then be analyzed to determine a specific command to be performed on the device.
  • In some cases, a developer has a proprietary device design (e.g. a car with “smart” capabilities) and wants to create a custom wake phrase that is associated with their brand and easily spotted by their smart devices. However, it is not feasible for providers of speech enabled virtual assistant platforms to manually customize a wake phrase for each developer's device design. Additionally, such manual customization, even if implemented, can also lead to inaccuracies in the custom wake phrase recognition in the virtual assistant device, which is generally unacceptable by a user. Instead, the developer is forced to utilize a pre-defined custom wake phrase provided by the platform, e.g. “OK ALEXA”.
  • Accordingly, there exists a need to obviate or mitigate at least some of the above-mentioned disadvantages of existing systems and methods for custom wake phrase training and spotting. Notably, there is a need for an automated process and system of training custom wake phrase recognition for use on a speech enabled virtual assistant device to recognize developer-defined custom wake phrases. Embodiments of the present disclosure are directed to this and other considerations.
  • SUMMARY
  • Disclosed embodiments provide systems and methods for providing custom wake phrase training for a computer-implemented model for subsequently spotting the custom wake phrase when deployed on a voice enabled computing device such as a natural language controlled virtual assistant device.
  • In one aspect, there is provided a computing system for training custom phrase spotter executables for virtual assistants, the system comprising a processor and a memory in communication with the processor, the memory storing instructions that, when executed by the processor, configure the computing system to: receive a request for training a custom phrase spotter executable and an identification of a specific virtual assistant; responsive to receiving the request, receive: one or more positive audio samples corresponding to spoken audio of a custom wake phrase; train, using the positive audio samples, a model for the custom wake phrase audio; and compile the executable, including the model, such that, when deployed on the specific virtual assistant as identified by the identification, the executable recognizes the custom wake phrase.
  • In another aspect, there is provided a computer implemented method for training a custom phrase spotter executable, the method comprising: receiving a request for training a custom phrase spotter executable; receiving one or more positive audio samples corresponding to spoken audio of a custom wake phrase; training, using the positive audio samples, a model for the custom wake phrase; and compiling the executable, including the model, such that, when deployed for a virtual assistant, the executable recognizes the custom wake phrase.
  • In yet another aspect, there is provided a non-transitory computer readable medium storing code for a software development kit (SDK) for training a custom phrase spotter executable for a virtual assistant, the code is executable by a processor and that, when executed by the processor, causes the SDK to: receive a request for training a custom phrase spotter executable; receive one or more positive audio samples corresponding to spoken audio of a custom wake phrase; train, using the positive audio samples, a model for the custom phrase spotter executable; and compile the phrase spotter executable, including the model, such that, when deployed on the virtual assistant, the executable recognizes the custom wake phrase.
  • In yet another aspect, there is provided a computing system for training custom phrase spotter executables for virtual assistants, the system comprising a processor and a memory in communication with the processor, the memory storing instructions that, when executed by the processor, configure the computing system to: receive a request for training a custom phrase spotter executable and an identification of a specific virtual assistant; responsive to receiving the request, receive: text corresponding to the custom wake phrase; search within a corpus of audio samples, stored on a database of the computing system, for one or more stored positive audio samples corresponding to the text; and train, using the positive audio samples, a model for the custom wake phrase audio; and compile the executable, including the model, such that, when deployed on the specific virtual assistant as identified by the identification, the executable recognizes the custom wake phrase.
  • These and other aspects will be apparent including computer program products that store instructions in a non-transitory manner (e.g. in a storage device) that, when executed by a computing device, configure the device to perform operations as described herein.
  • Further features of the disclosed systems and methods and the advantages offered thereby, are explained in greater detail hereinafter with reference to specific embodiments illustrated in the accompanying drawings, wherein like elements are indicated be like reference numbers and designators.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • Reference is made to the accompanying drawings, which are not necessarily drawn to scale, and which are incorporated into and constitute a portion of this disclosure, illustrate various implementations and aspects of the disclosed technology and, together with the description, serve to explain the principles of the disclosed technology. In the drawings:
  • FIG. 1A is a block diagram illustrating portions of a representative computer environment, in accordance with one embodiment;
  • FIG. 1B illustrates details of an example computing device in the computer environment of FIG. 1A, in accordance with one embodiment;
  • FIG. 2 is a block diagram illustrating example computer components of a training data processing module of FIG. 1B for generating training data, in accordance with one embodiment;
  • FIG. 3 is a block diagram illustrating example computer components of a phrase spotter executable generation module of FIG. 1B, in accordance with one embodiment;
  • FIG. 4 is a flowchart of an exemplary process for dynamically training a computerized model (e.g. a computerized machine learning model) for an executable for spotting a custom wake phrase, in accordance with one embodiment; and,
  • FIGS. 5A-5C are schematic diagrams illustrating example functionality of neural-network based wake phrase detectors 500, 510, 520 provided by the phrase spotter executable 308 of FIG. 3, in accordance with one or more embodiments.
  • While references to “an embodiment” are used herein, nothing should be implied or understood that features of one embodiment cannot be used or combined with features of another embodiment unless otherwise stated. The various systems, methods and devices shown and described herein may be used together unless otherwise stated.
  • DETAILED DESCRIPTION
  • Generally, in at least some embodiments, the present disclosure is directed to systems and methods for allowing customization of wake phrases for recognition and control of virtual assistant computing devices by training of computerized models using input positive and/or negative audio and/or text samples to improve accuracy of the training.
  • By way of background, virtual assistant computing devices are passive listening devices that are configured to understand natural language and are pre-configured to “wake up” or activate upon hearing their name such as “Hey Siri”, “OK Google”, or “Alexa”. However, as discussed herein, there is a need to train and automatically configure virtual assistants to recognize custom wake phrases.
  • In accordance with one embodiment, FIG. 1A illustrates a block diagram of a high-level architecture of a representative computer environment 100. The computer environment 100 comprises an example computing device 102 communicating with a network 104 and configured to dynamically train a computerized model for a phrase spotter executable operable for spotting a custom wake phrase as selected by a developer of a requesting computing device 106. The phrase spotter executable generated by the computing device 102 is for deployment on a voice-enabled virtual assistant computing device 108, in accordance with one or more aspects of the presented disclosure. Simplified component details of computing device 102 are illustrated and enlarged in FIG. 1B. Referring to FIGS. 1A and 1B, computing device 102 is communicating using one or more communication networks 104 with one or more other requesting computing devices 106. Computing device 102 receives respective request data 107 for custom wake phrase training including custom wake phrase particulars (e.g. in audio and/or text format) from the requesting computing device 106. In response, computing device 102 is configured to generate and output a respective phrase spotter executable to the requesting computing device 106 and/or a virtual assistant computing device 108 for subsequent deployment of the phrase spotter executable (e.g. see 308 in FIG. 3) on the virtual assistant computing device 108.
  • Referring again to FIGS. 1A and 1B, a particular request 107 may include a request for wake phrase customization and developer-provided details for the custom wake phrase (e.g. see input data 214 in FIG. 2), which can comprise audio and/or text data relating to positive and/or negative samples for use in training of the computerized model for wake phrase recognition. The details can be provided via a developer interface on the requesting computing device 106 displayed in response to the initiation of the request 107. Computing device 102 may further comprise one or more servers.
  • Other examples of the computing device 102 is a single computer, cloud computing services, a laptop computer, a desktop computer, a touch computing device or another type of computing device. In the example of FIG. 1A, the requesting computing device 106 is a laptop computer but may also be any computing device such as a cell phone, a desktop computer, or another type of computing device comprising at least a processor, a memory and a communication interface capable of communicating custom wake phrase requests (e.g. via a graphical interface) and receiving responses from computing device 102. Further, in the example embodiment of FIG. 1A, virtual assistant computing device 108 is a cell phone but may be any voice enabled computing device capable of voice interaction and control of its computer services and/or one or more associated smart devices. Example existing virtual assistant computing devices 108 include mobile phones, automobiles, smart speakers, appliances, kiosks, vending machines, and helper robots.
  • Computing device 102, requesting computing device 106 and virtual assistant computing device 108 are coupled for communication to one another via network 104, which may be a wide area network (WAN) such as the Internet. Additional networks may also be coupled to the WAN of network 104 such as a wireless network and/or a local area network (LAN) between the WAN and computing device 102 or between the WAN and any of requesting computing devices 106 and 108.
  • FIG. 1B shows example computer components of device 102, in accordance with one or more aspects of the present disclosure, for example, to provide a system and perform a method to train a model for a custom wake phrase spotter executable, which is operable to spot spoken instances of the custom wake phrase (e.g. once the executable is deployed on device 108), as selected by a developer of the requesting computing device 106 of FIG. 1A.
  • Computing device 102 comprises one or more processors 122, one or more input devices 124, one or more communication units 126 and one or more output devices 128. Computing device 102 also comprises one or more storage devices 130 storing one or more computer modules such as graphical interface 110, operating system module 112, phrase spotter executable generation module 114, training data processing module 116, audio data repository 118 (e.g. a corpus of audio samples labelled with their associated transcriptions) and negative audio data repository 120 (e.g. a corpus of negative audio data such as spoken phrases that are similar but different from custom wake phrase audio data, background noise, environmental sounds, non-speech music, etc.).
  • Communication channel 144 may couple each of the components 122, 124, 126, 128, and 130 (and the computer modules contained therein), for inter-component communications. In some examples, communications channels 144 may include a system bus, a network connection, and inter-process communication data structure, or any other method for communicating data.
  • Processor(s) 122 may implement functionality and/or execute computer instructions within computing device 102. For example, processors 122 may be configured to receive instructions and/or data from storage devices 130 to execute the functionality of the modules 110, 112, 114, 116, 118, and 120 shown in FIG. 1B.
  • One or more communication units 126 are operable to allow communications with external computing devices including requesting computing device 106 and virtual assistant computing device 108 via one or more networks 104 by transmitting and/or receiving network signals on the one or more networks. The communication units 126 may include various antennae and/or network interface cards, etc. for wireless and/or wired communications with external computing devices and network 104.
  • Input devices 124 and output devices 128 may include any of one or more buttons, switches, pointing devices, one or more cameras, a keyboard, a pointing device, a microphone, one or more sensors (e.g., biometric, etc.), a speaker, a bell, one or more lights, a display screen (which may be a touchscreen device providing I/O capabilities), etc. One or more of same may be coupled via a universal serial bus (USB), Bluetooth or other communication units (e.g., 126). That is, input 124 and output 128 devices may be on device 102 or coupled thereto via wired or wireless communication.
  • Referring to FIGS. 1A, 1B and 2, the computing device 102 may store data/information to storage devices 130, which may comprise, for example, input data (e.g. 214 see FIG. 2) providing particulars of custom wake phrase (e.g. a positive sample audio 208 of the custom wake phrase), output training data (e.g. 216 in FIG. 2) generated by the training data processing module, machine learning trained processes, audio data repository with transcriptions (e.g. storing a mapping of audio data to transcriptions) used to label input data and negative audio data repository (e.g. containing samples of negative data) used to generate the training data 216 to train a model for generating a phrase spotter executable by module 114 for computing device 108. Some of the functionality is described further herein below. The one or more storage devices 130 may store instructions and/or data for processing during operation of computing device 102. The one or more storage devices 130 may take different forms and/or configurations, for example, as short-term memory or long-term memory. Storage devices 130 may be configured for short-term storage of information as volatile memory, which does not retain stored contents when power is removed. Volatile memory examples include random access memory (RAM), dynamic random access memory (DRAM), static random access memory (SRAM), etc. Storage devices 130, in some examples, also include one or more computer-readable storage media, for example, to store larger amounts of information than volatile memory and/or to store such information for long-term, retaining information when power is removed. Non-volatile memory examples include magnetic hard discs, optical discs, floppy discs, flash memories, or forms of electrically programmable memory (EPROM) or electrically erasable and programmable (EEPROM) memory.
  • Referring again to FIGS. 1A, 1B and 2, upon receiving the request 107 at the computing device 102, computing device 102 is configured to provide display of a developer interface on the requesting computing device 106 for collecting particulars of the custom wake phrase (e.g. input data 214). The input data 214 is in the form of one or more of input positive audio samples 208. Additionally, the input data 214 can comprise one or more of: input positive text sample 210, and input negative audio sample 212. The input data 214 is received and processed at the training data processing module 116 in communication with data repositories 118 and 120 to generate a training data 216 for subsequent use in training a computerized model for a phrase spotter executable as generated by the module 114.
  • The training data processing module 116 is generally configured to generate a set of positive and/or negative audio samples as training data 216 for subsequent training of a machine learning model (see FIG. 3) and generating an executable for spotting the custom wake phrase on a particular virtual assistant device (e.g. 108). The training data processing module 116 comprises a text to audio conversion module 202, a repository search module 204 (for searching within associated repositories 118 and 120) and a data manipulation module 206 for facilitating in generating the positive and negative samples in the training data 216. Generally, any audio that contains the exact phrase of the custom wake phrase is considered a positive sample (e.g. 208, 210).
  • Each of the computing devices described herein (e.g. 102, 106, 108) may comprise a processor and non-transitory computer readable medium. The computer readable medium may store computer program code that is executable by each of the device's processor and that, when executed by the processor, causes each of the computing devices (e.g. 102, 106, and 108) to perform the functionality attributed to them therein.
  • Example embodiments depicting example input data 214 scenarios and processing of such input data 214 to generate the training data 216 is described below with reference to FIG. 2. The embodiments below may also be combined such that various combination of input data 214 may be envisaged for improving accuracy of the training data 216.
  • EXAMPLE 1 Input Data Comprising Input Positive Audio Sample 208
  • In one embodiment, the training data processing module 116 is configured to receive input data 214 comprising input positive audio sample 208 data corresponding to audio of a person speaking the custom wake phrase (e.g. a person saying “Hey, Mercedes” or “Hey, Pandora”) from the requesting computing device 106. Such input positive audio sample 208 may be provided via a developer interface of the requesting computing device 106 (e.g. by uploading a pre-existing audio file or by speaking into a microphone of the requesting computing device 106 and recording an input positive sample). In this embodiment, the input positive audio sample 208 is output as training data 216 for training of the model (see FIG. 3).
  • EXAMPLE 2 Input Data Comprising Input Positive Text Sample 210
  • In another embodiment, the training data processing module 116 is further configured to receive input data 214 comprising input positive text sample 210 corresponding to a text representation or transcription of the custom wake phrase (e.g. text=“Hey Mercedes” or “Hey Pandora”). The input positive text sample 210 may be obtained by developer input (e.g. keyboard input via an interface for custom wake phrase data collection) or previously stored on the computing device 106 and/or uploaded onto the computing device 106. The training data processing module 116 is then configured to process the text to generate a corresponding stored positive audio sample signal. In one aspect, this includes searching, via the repository search module 204, within the audio data repository with text transcriptions 118 to determine a stored positive audio sample corresponding to the text provided in the sample 210 and provide that as training data 216. Alternatively, generating audio from the input positive text sample 210 comprises using the text to audio conversion module 202, which applies text-to-speech conversion (TTS) to generate an audio signal of the custom wake phrase and output as training data 216. In yet another aspect both the output from the text to audio conversion module 202 and the output from the repository search module 204 providing a mapping of the text to audio via the audio data repository 118 is included as training data 216 for the training of the computerized model in FIG. 3. Notably, the audio data repository 118 contains a collection of audio data labelled or associated with text transcriptions and can be augmented to include additional audio data as the developer provides further input positive audio samples 208 and/or input positive text samples 210.
  • EXAMPLE 3 Input Data Comprising Input Negative Audio Sample 212
  • In yet another embodiment, the training data processing module 116 is configured to receive, from the requesting computing device 106, the input data 214 comprising input negative audio sample 212 data corresponding to audio sounding similar to a person speaking the custom wake phrase (e.g. a person saying “OK Alexandra” instead of “OK Alexa”; or “Hey Doodle” instead of “Hey Google”) from the requesting computing device 106. This input negative audio sample 212 being phonetically (as represented by phonemes) and/or acoustically (as represented by a time sequence of samples or a spectrogram) similar to the input positive audio sample 208 but an incorrect representation of the custom wake phrase is considered to be a “bait” phrase and output as a negative sample in the training data 216. The bait data in the training data 216 is used to train the computerized model to differentiate bait phrases from actual spoken instances of the custom wake phrase. Such input negative audio sample 208, including bait samples, may be provided via an interface of the requesting computing device 106 (e.g. by uploading a pre-existing audio file or by speaking into a microphone of the requesting computing device 106 and recording an input negative sample).
  • EXAMPLE 4 Training Data Comprising Stored Negative Audio Sample
  • In yet another embodiment, in addition to using the input positive audio sample 208 as training data 216, the training data processing module 116 uses the input positive audio sample 208 to search, via the repository search module 204, repositories (e.g. 118 and 120) for a stored negative audio sample having audible similarities to the sample 208 but an incorrect representation of the custom wake phrase. The stored negative audio sample is then provided as training data 216.
  • EXAMPLE 5 Training Data Comprising Stored Negative Audio Sample Generated from Input Text Sample 210
  • In yet another embodiment, in addition to using the input positive audio sample 208 as training data 216, the input positive text sample 210 is used to retrieve a stored negative audio sample. Specifically, the input positive text sample 210 provided to the processing module 116 is used by the repository search module 204 to search within the audio data repository 118 for a stored negative audio sample having a transcription that is different but phonetically and/or acoustically similar to the text sample 210 representing the custom wake phrase. In this embodiment, the stored negative audio sample is then provided as training data 216 (e.g. as an example of an incorrect representation of the custom wake phrase).
  • EXAMPLE 6 Training Data Comprising Stored Negative Audio Sample Generated from Negative Audio Data Repository
  • In yet another embodiment, in addition to using the input positive audio sample 208 as training data 216, the repository search module 204 searches negative audio data repository 120 for a stored negative audio sample representative of non-speech audio. The data repository 120 contains for example: non-speech audio of typical environmental noise, and a collection of speech independent negative audio sample such as music.
  • EXAMPLE 7 Input Data Comprising Virtual Assistant Device Identification 218
  • In at least some embodiments and referring to FIGS. 1A, 1B, 2 and 3, the requests 107 relate to multiple different virtual assistant computing devices 108. Accordingly, each of the virtual assistant computing devices 108 is provided, via the computing device 102, with a corresponding custom phrase spotter executable 308 such as to respectively recognize custom wake phrases intended for a particular device 108. Accordingly, the input data 214 provided subsequent to the request 107, further comprises virtual assistant device identification 218 which uniquely identifies each of the different virtual assistant computing devices 108 in the environment 100. Accordingly, once the phrase spotter executable 308 is generated, it is linked by the phrase spotter generation module 114 with the particular device 108 via the identification 218. Additionally, in some aspects, in the case of multiple different computing devices 108, the training data processing module 116 is configured to access, via the repository search module 204, negative audio samples within repository 120 that is specific to the particular device 108 (e.g. its environment, type of background sounds expected . . . ). For example, if the identification 218 identifies device 108 as being within a car, then the data manipulation module 206 augments the training data 216 with non-speech audio of typical environmental noise of a car as stored in the repository 120.
  • In the example illustrated in FIG. 2 and FIG. 3, the positive and negative audio samples processed by the training data processing module 116 can further be augmented by adding sound effects such as white noise, babble noise, radio/car noise to the audio samples and providing as training data 216 such as to improve the accuracy of the trained model 306 and diversify the types of audio data that a phrase spotter executable 308 can discriminate.
  • Referring to FIGS. 2 and 3, subsequent to generating the training data 216 by the various positive and/or negative audio samples (including bait samples), the processor 122 of the computing device 102 co-ordinates for the training data 216 to be input into the phrase spotter executable generation module 114. The phrase spotter executable generation module 114 is configured to perform training via the training module 302 and generate a phrase spotter executable 308 that can run on a virtual assistant computing device (e.g. 108) for detecting custom wake phrase to “wake up” the device 108.
  • Training module 302 in FIG. 3 uses the training data 216 and generates a trained model 306 for use by the phrase spotter executable generation module 114 to generate the phrase spotter executable 308. Training module 302 receives training data 216 as input to machine learning algorithm 304. Training data 216 is a collection of positive audio samples (indicating correct audio representations of the custom wake phrase) and negative audio sample (indicating incorrect representations of the custom wake phrase). Machine learning algorithm 304 can be a regression or a classification method. The machine learning algorithm 304 utilizes the positive and/or negative sample sets and generates an optimal trained model 306. The trained model 306 provides functionality that recognizes positive audio samples in the training data 216 as the custom wake phrase and disregards negative audio samples in the training data 216 as not matching the custom wake phrase. The trained model 306 is then used by the processor 122 to generate the phrase spotter executable 308 to be deployed locally on the virtual assistant computing device 108 (or alternatively as part of a software development kit (SDK) platform, which is a collection of software used for developing applications for a specific device or operating system.)
  • In an alternative embodiment of FIG. 3, subsequent to the deploying of the phrase spotter executable 308, the computing device 102 may receive feedback from a developer of the requesting computing device 106, indicative of the model for the wake phrase spotter executable 308 recognizing incorrect audio samples as the custom wake phrase. In response to this feedback, the processor 122 is configured to include the incorrect audio sample as a negative sample in the training data 216 and dynamically re-train the model 306 to generate a new trained process and new phrase spotter executable for subsequent deployment.
  • Referring again to FIGS. 1A, 1B-3, in the examples illustrated, a developer interface such as a web interface is provided on the requesting computing device 106 in response to the request 107 for uploading positive and/or negative audio sample data (likely a .zip file of .wav or .mp3 files) and downloading a phrase spotter executable (e.g. 308 in FIG. 3) onto a requesting computing device 106 and/or virtual assistant computing device 108, in one embodiment. In an alternate embodiment (not illustrated), in response to the request 107, a software development kit (SDK) is provided to the requesting computing device 106 (e.g. containing one or more of the training data processing module 116, phrase spotter executable generation module 114, audio data repository with transcriptions 118, negative audio data repository 120) to allow developers to run the training process provided in the training module 302 of FIG. 3 independently on their desired computing device (e.g. 106).
  • FIG. 4 is a flowchart illustrating example operations 400 of the computing device 102 of FIGS. 1A, 1B-3, in accordance with an aspect of the present disclosure. Operations 400 receive a developer-defined custom wake phrase for awakening a specific voice controlled virtual assistant device and train a model of a custom wake phrase to spot the custom wake phrase in an audio signal. Notably, in at least some embodiments, the computing environment 100 of FIG. 1A comprises a plurality of different virtual assistant computing devices 108, each requiring a unique custom phrase spotter executable 308 for recognizing a particular custom wake phrase as selected by a developer(e.g., via a developer interface of the requesting computing device 106).
  • At 402, computing device 102 receives a request from a requesting computing device (e.g. 106) for training a custom phrase spotter executable for spotting a custom wake phrase. At step 404, responsive to the request, receive at a developer interface wake phrase customization particulars (e.g. input data 214) for training a model 306 using a machine learning algorithm 304. The input data 214 comprises identification data 218 for identifying a specific virtual assistant device 108 for deploying the custom phrase spotter executable 308 thereon. At step 404, the particulars (input data 214) further comprise at least one positive audio sample corresponding to at least one person speaking the custom wake phrase. The input positive audio sample may be a sound recording or characters uploaded and provided to the computing device 102. In some cases, the computing device 102 may further process the received customization particulars (e.g. providing positive and/or negative text and/or audio data samples of the custom wake phrase) to generate positive and/or negative training data for training module 302. Such processing may include for example, manipulating the positive sample(s) with environmental sounds or noise characteristics associated with the specific virtual assistant device 108. Some particulars (e.g. input data 214) may not need any transformation.
  • At 406, training of the model for the custom wake phrase audio is performed using the positive audio sample(s) (e.g. 208), and if applicable, other positive and/or negative audio/text samples in the input data 214. The trained model is for generating the phrase spotter executable 308 that, when deployed on a specific virtual assistant computing device (e.g. 108), recognizes subsequent audio input instances of the custom wake phrase based upon the training. Preferably, at step 408, the phrase spotter executable (e.g. 308 in FIG. 3) is then sent to the requesting computing device 106.
  • Referring now to FIGS. 5A-5C, shown are schematic diagrams illustrating example functionality of wake phrase detectors 500, 510, 520 provided by the phrase spotter executable 308 of FIG. 3, which can simultaneously detect one or more custom wake phrases to “wake up” the device 108 of FIG. 1A, in accordance with one or more embodiments.
  • FIG. 5A shows a neural network based wake phrase detector 500 that recognizes a single custom wake phrase. FIGS. 5B and 5C show neural network based wake phrase detectors 510 and 520 respectively, configured to listen for and recognize multiple wake phrases such that the same model is triggered to indicate a positive detection with any one of the multiple wake phrases (e.g. “Hey Hound” and “Hi Hound”).
  • Referring now to FIG. 5A, shown is a single wake phrase detector 500 with a neural network for detecting a single custom wake phrase. The detector 500 comprises a set of input audio features 501 and a set of outputs 502 that are the most likely sub phrase units or partial-phrase units to have been spoken in a recent sequence of speech. Sub phrase units can be phonemes or sequences of multiple phonemes, including words of multi-word wake phrases. Additionally, the sub phrase units may be a unit smaller than a phoneme, such as a senone, or a unit larger than a phoneme such as a di-phone or tri-phone or multiple phonemes. In one aspect, the neural network depicted in FIG. 5A may have one or more hidden layers between the input nodes 501 and output nodes 502. Preferably, the detector 500 comprises a recurrent neural network, such as a long short-term memory (LSTM) neural network for recognizing a time-varying sequence. The single wake phrase detector 500 further comprises a matching block 503 that identifies when the sequence of sub phrase units provided at output 502 matches a pre-defined sequence for the wake phrase and provides an indication 504 of whether a match exists (e.g. the single custom wake phrase has been detected).
  • Referring to FIG. 5B, shown is a multiple wake phrase detector 510 comprising a neural network similar to the neural network-based wake phrase detector of FIG. 5A but configured to recognize two wake phrases. The detector 510 comprises a set of input audio features 511 and a set of outputs 512 that are the most likely sub phrase units to have been spoken in a recent sequence of sub phrase units for each of the two wake phrases. A first matching block 513 identifies when the sequence of sub phrase units matches the sequence for the first wake phrase and provides a first indication 515 of whether a match exists. A second matching block 514 identifies when the sequence of sub phrase units matches the sequence for the second wake phrase and provides a second indication 516 of whether a match exists for the second wake phrase. The first and second matching blocks 513 and 514 share some but also have some unique sub phrase units as input. Thus, the matching blocks 513 and 514 are configured to support, for example, two wake phrases having some overlapping sub phrase units such as “hey hound” and “okay hound”, which have some common sub phrase units (e.g., “hound”). The detector 510 further comprises a decision block 517 which performs a logical OR operation to determine whether either of the first indication 515 OR the second indication 516 indicate a positive match, and output a final indication 518 indicative of whether the positive match exists for at least one of the two wake phrases. The positive match 518 for indicating to the associated virtual assistant device 108 to “wake up” and respond to subsequent spoken requests.
  • Referring to FIG. 5C shown is a fully neural-network based wake phrase detector 520 for spotting any number of spoken wake phrases. The detector 520 comprises a set of input audio features 521, a hidden layer of features 522, and a final output node 523 that indicates when the phrase detector 520 has recognized any one of the number of spoken wake phrases and provides an indication of recognition as output 524. Preferably the neural network of FIG. 5C is recurrent to capture the time-varying nature of speech audio for a wake phrase. In some aspects, the hidden layer might not be necessary, but might represent features analogous to sub phrase units. Such a neural network depicted in FIG. 5C can be trained for any number of wake phrases by using a training data set of positive examples of each desired wake phrase. This approach works for wake phrases that are similar and wake phrases that are very different such as ones that are meaningful in different human languages. For very large numbers of wake phrases, good accuracy would benefit from a somewhat larger number of hidden nodes or hidden layers shown as hidden features 522.
  • While this specification contains many specifics, these should not be construed as limitations, but rather as descriptions of features specific to particular implementations. Certain features that are described in this specification in the context of separate implementations may also be implemented in combination in a single implementation. Conversely, various features that are described in the context of a single implementation may also be implemented in multiple implementations separately or in any suitable sub-combination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination may in some cases be excised from the combination, and the claimed combination may be directed to a sub-combination or variation of a sub-combination.
  • Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Moreover, the separation of various system components in the implementations described above should not be understood as requiring such separation in all implementations, and it should be understood that the described program components and systems may generally be integrated together in a single software product or packaged into multiple software products.
  • Various embodiments have been described herein with reference to the accompanying drawings. It will, however, be evident that various modifications and changes may be made thereto, and additional embodiments may be implemented, without departing from the broader scope of the disclosed embodiments as set forth in the claims that follow. Further, other embodiments will be apparent to those skilled in the art from consideration of the specification and practice of one or more embodiments of the present disclosure. It is intended, therefore, that this disclosure and the examples herein be considered as exemplary only, with a true scope and spirit of the disclosed embodiments being indicated by the following listing of exemplary claims.

Claims (28)

1. A computing system for training custom phrase spotter executables for virtual assistants, the system comprising a processor and a memory in communication with the processor, the memory storing instructions that, when executed by the processor, configure the computing system to:
receive a request for training a custom phrase spotter executable and an identification of a specific virtual assistant;
responsive to receiving the request, receive:
one or more positive audio samples corresponding to spoken audio of a custom wake phrase;
train, using the positive audio samples, a model for the custom wake phrase audio; and
compile the executable, including the model, such that, when deployed on the specific virtual assistant as identified by the identification, the executable recognizes the custom wake phrase.
2. The computing system of claim 1 further configured to:
receive text corresponding to the custom wake phrase;
search within a corpus of audio samples, stored on a database of the computing system, for one or more stored positive audio samples corresponding to the text; and
include the stored positive audio samples in the training of the model.
3. The computing system of claim 1 further configured to:
receive text corresponding to the custom wake phrase;
apply text-to-speech (TTS) to the text to generate a synthesized positive audio sample of the custom wake phrase; and
include the synthesized positive audio sample in the training of the model.
4. The computing system of claim 1, further configured to:
responsive to receiving the request, receive one or more negative audio samples having audible similarities to the positive audio samples but that are not the custom wake phrase; and
include the negative audio samples in the training of the model.
5. The computing system of claim 1, further configured to:
search within a corpus of audio samples, stored on a database of the computing device, for one or more stored negative audio samples having audible similarities to the positive audio samples but that are not the custom wake phrase; and
include the stored negative audio sample in the training of the model as a negative sample.
6. The computing system of claim 2, further configured to:
generate a phoneme representation for the custom wake phrase, in dependence upon the text;
search, within the database, for a phonetically similar wake phrase sharing phonetic features with the phoneme representation and retrieve, from the database a stored positive audio sample corresponding to the phonetically similar wake phrase; and,
utilize the stored positive audio sample in the training of the model.
7. The computing system of claim 1, further configured to:
search, within a corpus of audio samples, stored in a database on the computing system for a stored positive audio sample having an alternate pronunciation of the custom wake phrase but that is an accurate representation of the custom wake phrase; and,
include the stored positive audio sample in the training of the model.
8. The computing system of claim 1, wherein the positive audio samples comprise one of: a spoken input provided directly via a developer interface of the computing system; and an audio file provided to the developer interface.
9. The computing system of claim 1, wherein the model for the custom wake phrase audio comprises a neural network receiving input audio features of the positive audio samples and outputting one or more sub phrase units for the input audio features, and the model further comprises a sub phrase unit sequence detector for detecting the custom wake phrase within the one or more output sub phrase units.
10. The computing system of claim 1, wherein the custom wake phrase audio comprises a first wake phrase audio and a second wake phrase audio, the model comprising a neural network receiving input audio features of the positive audio samples of both the first and the second wake phrase audio and outputting one or more sub phrase units for the input audio features, and the model further comprises a first and a second sub phrase unit sequence detector each for respectively detecting a presence of either one of the first and the second wake phrase audio within the one or more output sub phrase units.
11. The computing system of claim 1, wherein the custom wake phrase audio comprises a plurality of wake phrase audio, the model comprising a recurrent neural network receiving input audio features of the positive audio samples of each of the plurality of wake phrase audio and outputting one or more hidden audio features, the model configured to detect a presence of any of the plurality of wake phrase audio.
12. A computer implemented method for training a custom phrase spotter executable, the method comprising:
receiving a request for training a custom phrase spotter executable;
receiving one or more positive audio samples corresponding to spoken audio of a custom wake phrase;
training, using the positive audio samples, a model for the custom wake phrase; and
compiling the executable, including the model, such that, when deployed for a virtual assistant, the executable recognizes the custom wake phrase.
13. The method of claim 12 further comprising:
receiving text corresponding to the custom wake phrase;
searching within a corpus of audio samples for stored positive audio samples corresponding to the text; and
including the stored positive audio samples in the training of the model.
14. The method of claim 12 further comprising:
receiving text corresponding to the custom wake phrase;
applying text-to-speech to the text to generate a synthesized positive audio sample of the custom wake phrase; and
including the synthesized positive audio sample in the training of the model.
15. The method of claim 12, further comprising:
receiving one or more negative audio samples having audible similarities to the positive audio samples but that are not the custom wake phrase; and
including the negative audio samples in the training of the model.
16. The method of claim 12, further comprising:
searching within a corpus of audio samples for negative audio samples having audible similarities to the positive audio samples but that are not the custom wake phrase; and
including the stored negative audio samples in the training of the model.
17. The method of claim 12, further comprising:
searching within a corpus of audio samples for stored positive audio samples acoustically similar to the received one or more positive audio samples; and
including the stored positive audio samples in the training of the model.
18. The method of claim 12, wherein the model for the custom wake phrase comprises:
a neural network receiving input audio features of the positive audio samples and outputting one or more sub phrase units for the input audio features; and
a sub phrase unit sequence detector for detecting the custom wake phrase within the one or more output sub phrase units.
19. The method of claim 12, wherein the positive audio samples comprise audio samples of a first wake phrase and audio samples of a second wake phrase, the model comprising:
a neural network receiving input audio features of the positive audio samples of both the first and the second wake phrase audio and outputting one or more sub phrase units for the input audio features; and
a first and a second sub phrase unit sequence detector each for respectively detecting a presence of either one of the first and the second wake phrase audio within the one or more output sub phrase units.
20. The method of claim 12, wherein the positive audio samples comprise audio samples of a plurality of wake phrases, the model comprising a recurrent neural network configured to audio of any of the plurality of wake phrases.
21. A non-transitory computer readable medium storing code for a software development kit (SDK) for training a custom phrase spotter executable for a virtual assistant, the code is executable by a processor and that, when executed by the processor, causes the SDK to:
receive a request for training a custom phrase spotter executable;
receive one or more positive audio samples corresponding to spoken audio of a custom wake phrase;
train, using the positive audio samples, a model for the custom phrase spotter executable; and
compile the phrase spotter executable, including the model, such that, when deployed on the virtual assistant, the executable recognizes the custom wake phrase.
22. A computing system for training custom phrase spotter executables for virtual assistants, the system comprising a processor and a memory in communication with the processor, the memory storing instructions that, when executed by the processor, configure the computing system to:
receive a request for training a custom phrase spotter executable and an identification of a specific virtual assistant;
responsive to receiving the request, receive:
text corresponding to the custom wake phrase;
search within a corpus of audio samples, stored on a database of the computing system, for one or more stored positive audio samples corresponding to the text; and
train, using the positive audio samples, a model for the custom wake phrase audio; and
compile the executable, including the model, such that, when deployed on the specific virtual assistant as identified by the identification, the executable recognizes the custom wake phrase.
23. The computing system of claim 22 further configured to:
apply text-to-speech (TTS) to the text to generate a synthesized positive audio sample of the custom wake phrase; and
include the synthesized positive audio sample in the training of the model.
24. The computing system of claim 22, further configured to:
search within a corpus of audio samples, stored on a database of the computing system, for one or more negative audio samples having audible similarities to the positive audio samples but that are not the custom wake phrase; and
include the negative audio samples in the training of the model.
25. The computing system of claim 22, further configured to:
receive input, from a developer indicating a modification request to modify the model;
responsive to the modification request, search within the corpus of audio samples, stored on the database of the computing system, for one or more additional stored positive audio samples corresponding to an additional custom wake phrase; and
include the additional positive audio sample in the training of the model.
26. The computing system of claim 22 further configured to:
subsequent to the deploying of the model, receive feedback from a developer, indicative of the model for the phrase spotter executable recognizing incorrect audio samples as the custom wake phrase;
dynamically re-train the model by including the incorrect audio samples as negative samples to generate an updated model.
27. The computing system of claim 22, wherein the model for the custom wake phrase audio comprises a neural network receiving input audio features of the positive audio samples and outputting one or more sub phrase units for the input audio features, the model further comprises a sub phrase unit sequence detector for detecting the custom wake phrase within the one or more output sub phrase units.
28. The computing system of claim 22, wherein the custom wake phrase audio comprises a plurality of wake phrase audio, the model comprising a recurrent neural network receiving input audio features of the positive audio samples of each of the plurality of wake phrase audio and outputting one or more hidden audio features, the model configured to detect a presence of any of the plurality of wake phrase audio.
US16/541,995 2019-08-15 2019-08-15 Custom Wake Phrase Training Abandoned US20210050003A1 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
US16/541,995 US20210050003A1 (en) 2019-08-15 2019-08-15 Custom Wake Phrase Training
CN202010158115.9A CN112447171A (en) 2019-08-15 2020-03-09 System and method for providing customized wake phrase training
US17/584,780 US20220148572A1 (en) 2019-08-15 2022-01-26 Server supported recognition of wake phrases

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US16/541,995 US20210050003A1 (en) 2019-08-15 2019-08-15 Custom Wake Phrase Training

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/584,780 Continuation US20220148572A1 (en) 2019-08-15 2022-01-26 Server supported recognition of wake phrases

Publications (1)

Publication Number Publication Date
US20210050003A1 true US20210050003A1 (en) 2021-02-18

Family

ID=74568422

Family Applications (2)

Application Number Title Priority Date Filing Date
US16/541,995 Abandoned US20210050003A1 (en) 2019-08-15 2019-08-15 Custom Wake Phrase Training
US17/584,780 Pending US20220148572A1 (en) 2019-08-15 2022-01-26 Server supported recognition of wake phrases

Family Applications After (1)

Application Number Title Priority Date Filing Date
US17/584,780 Pending US20220148572A1 (en) 2019-08-15 2022-01-26 Server supported recognition of wake phrases

Country Status (2)

Country Link
US (2) US20210050003A1 (en)
CN (1) CN112447171A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200090657A1 (en) * 2019-11-22 2020-03-19 Intel Corporation Adaptively recognizing speech using key phrases
CN113012682A (en) * 2021-03-24 2021-06-22 北京百度网讯科技有限公司 False wake-up rate determination method, device, apparatus, storage medium, and program product
CN113223499A (en) * 2021-04-12 2021-08-06 青岛信芯微电子科技股份有限公司 Audio negative sample generation method and device
US20220013111A1 (en) * 2019-11-14 2022-01-13 Tencent Technology (Shenzhen) Company Limited Artificial intelligence-based wakeup word detection method and apparatus, device, and medium
CN115116442A (en) * 2022-08-30 2022-09-27 荣耀终端有限公司 Voice interaction method and electronic equipment

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11689868B2 (en) * 2021-04-26 2023-06-27 Mun Hoong Leong Machine learning based hearing assistance system

Family Cites Families (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9536528B2 (en) * 2012-07-03 2017-01-03 Google Inc. Determining hotword suitability
US9640194B1 (en) * 2012-10-04 2017-05-02 Knowles Electronics, Llc Noise suppression for speech processing based on machine-learning mask estimation
US9600231B1 (en) * 2015-03-13 2017-03-21 Amazon Technologies, Inc. Model shrinking for embedded keyword spotting
US9443517B1 (en) * 2015-05-12 2016-09-13 Google Inc. Generating sounds for detectability by neural networks
CN106098059B (en) * 2016-06-23 2019-06-18 上海交通大学 Customizable voice awakening method and system
US11545146B2 (en) * 2016-11-10 2023-01-03 Cerence Operating Company Techniques for language independent wake-up word detection
US10311876B2 (en) * 2017-02-14 2019-06-04 Google Llc Server side hotwording
CN107134279B (en) * 2017-06-30 2020-06-19 百度在线网络技术(北京)有限公司 Voice awakening method, device, terminal and storage medium
US11037555B2 (en) * 2017-12-08 2021-06-15 Google Llc Signal processing coordination among digital voice assistant computing devices
US10991367B2 (en) * 2017-12-28 2021-04-27 Paypal, Inc. Voice activated assistant activation prevention system
US11145298B2 (en) * 2018-02-13 2021-10-12 Roku, Inc. Trigger word detection with multiple digital assistants
US10959029B2 (en) * 2018-05-25 2021-03-23 Sonos, Inc. Determining and adapting to changes in microphone performance of playback devices
DE102018212902A1 (en) * 2018-08-02 2020-02-06 Bayerische Motoren Werke Aktiengesellschaft Method for determining a digital assistant for performing a vehicle function from a multiplicity of digital assistants in a vehicle, computer-readable medium, system, and vehicle
CN109147779A (en) * 2018-08-14 2019-01-04 苏州思必驰信息科技有限公司 Voice data processing method and device
CN110288978B (en) * 2018-10-25 2022-08-30 腾讯科技(深圳)有限公司 Speech recognition model training method and device
CN109448725A (en) * 2019-01-11 2019-03-08 百度在线网络技术(北京)有限公司 A kind of interactive voice equipment awakening method, device, equipment and storage medium
EP3906549B1 (en) * 2019-02-06 2022-12-28 Google LLC Voice query qos based on client-computed content metadata
US11158305B2 (en) * 2019-05-05 2021-10-26 Microsoft Technology Licensing, Llc Online verification of custom wake word
US11282500B2 (en) * 2019-07-19 2022-03-22 Cisco Technology, Inc. Generating and training new wake words

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220013111A1 (en) * 2019-11-14 2022-01-13 Tencent Technology (Shenzhen) Company Limited Artificial intelligence-based wakeup word detection method and apparatus, device, and medium
US11848008B2 (en) * 2019-11-14 2023-12-19 Tencent Technology (Shenzhen) Company Limited Artificial intelligence-based wakeup word detection method and apparatus, device, and medium
US20200090657A1 (en) * 2019-11-22 2020-03-19 Intel Corporation Adaptively recognizing speech using key phrases
CN113012682A (en) * 2021-03-24 2021-06-22 北京百度网讯科技有限公司 False wake-up rate determination method, device, apparatus, storage medium, and program product
CN113223499A (en) * 2021-04-12 2021-08-06 青岛信芯微电子科技股份有限公司 Audio negative sample generation method and device
CN115116442A (en) * 2022-08-30 2022-09-27 荣耀终端有限公司 Voice interaction method and electronic equipment

Also Published As

Publication number Publication date
US20220148572A1 (en) 2022-05-12
CN112447171A (en) 2021-03-05

Similar Documents

Publication Publication Date Title
US20220148572A1 (en) Server supported recognition of wake phrases
US20220156039A1 (en) Voice Control of Computing Devices
US10884701B2 (en) Voice enabling applications
US10878807B2 (en) System and method for implementing a vocal user interface by combining a speech to text system and a speech to intent system
US11170776B1 (en) Speech-processing system
KR102582291B1 (en) Emotion information-based voice synthesis method and device
US10777193B2 (en) System and device for selecting speech recognition model
US20200184967A1 (en) Speech processing system
US10199034B2 (en) System and method for unified normalization in text-to-speech and automatic speech recognition
CN109196495A (en) Fine granularity natural language understanding
CN106558307A (en) Intelligent dialogue processing equipment, method and system
JP6812843B2 (en) Computer program for voice recognition, voice recognition device and voice recognition method
US11705105B2 (en) Speech synthesizer for evaluating quality of synthesized speech using artificial intelligence and method of operating the same
US11568863B1 (en) Skill shortlister for natural language processing
US11289075B1 (en) Routing of natural language inputs to speech processing applications
US11715472B2 (en) Speech-processing system
US11615787B2 (en) Dialogue system and method of controlling the same
JP2014164261A (en) Information processor and information processing method
US11756538B1 (en) Lower latency speech processing
US11564194B1 (en) Device communication
US11328713B1 (en) On-device contextual understanding
US11176930B1 (en) Storing audio commands for time-delayed execution
WO2019113516A1 (en) Voice control of computing devices
US11380308B1 (en) Natural language processing
US20230115538A1 (en) Speech recognition device and operating method thereof

Legal Events

Date Code Title Description
STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER

STPP Information on status: patent application and granting procedure in general

Free format text: FINAL REJECTION MAILED

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION

AS Assignment

Owner name: SOUNDHOUND, INC., CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:JAIN, NEWTON;ZAHEER, SAMEER SYED;SIGNING DATES FROM 20170828 TO 20220920;REEL/FRAME:061260/0689