CN118339609A

CN118339609A - Warm word arbitration between automated assistant devices

Info

Publication number: CN118339609A
Application number: CN202280082492.6A
Authority: CN
Inventors: 马修·谢里菲; 维克托·克尔布内
Original assignee: Google LLC
Current assignee: Google LLC
Priority date: 2021-12-17
Filing date: 2022-12-08
Publication date: 2024-07-12

Abstract

Techniques for warm word arbitration between automated assistant devices are described herein. A method comprising: determining that warm word arbitration is to be initiated between the first assistant device and one or more additional assistant devices including the second assistant device; broadcasting, by the first assistant device, an active set of warm words of the first assistant device to the one or more additional assistant devices; for each of the one or more additional assistant devices, receiving an active set of warm words for the additional assistant device from the additional assistant device; identifying matching warm words included in the active warm word set of the first assistant device and included in the active warm word set of the second assistant device; and enabling or disabling detection of the matching warm word by the first assistant device in response to identifying the matching warm word.

Description

Warm word arbitration between automated assistant devices

Background

Humans may conduct human-machine conversations with interactive software applications referred to herein as "automated assistants" (also referred to as "digital assistants," "digital agents," "interactive personal assistants," "intelligent personal assistants," "assistant applications," "conversation agents," etc.). For example, humans (which may be referred to as "users" when they interact with an automated assistant) may provide commands and/or requests to the automated assistant using spoken natural language input (i.e., spoken utterances), which in some cases may be converted to text by providing text (e.g., typed) natural language input, and/or by touching and/or physical movements without utterances (e.g., gestures, eye gaze, facial movements, etc.), and then processed. An automated assistant typically responds to a request by providing responsive user interface output (e.g., auditory and/or visual user interface output), controlling one or more smart devices, and/or controlling one or more functions of a device implementing the automated assistant (e.g., other applications controlling the device).

The automated assistant may be a software application executing on the client device. The client device may be a stand-alone interactive speaker, a stand-alone interactive display device (which may also include a speaker and/or a camera), a smart appliance (such as a smart television (or a standard television equipped with a networked dongle with automated assistant functionality)), a desktop computing device, a notebook computing device, a tablet computing device, a mobile phone computing device, a computing device of a user vehicle (e.g., an in-vehicle communication system, an in-vehicle entertainment system, an in-vehicle navigation system), and/or a user wearable apparatus that includes a computing device (e.g., a watch of a user with a computing device, glasses of a user with a computing device, a virtual or augmented reality computing device).

Automated assistants typically rely on component pipelines to process user requests. For example, a hotword detection engine may be used to process audio data by monitoring for the occurrence of a spoken hotword (wake word), such as "OK Assistant", and causing other components to process in response to detecting the occurrence. As another example, audio data including spoken utterances may be processed using an Automatic Speech Recognition (ASR) engine to generate a transcription of the user utterance (i.e., a sequence of terms (terms) and/or other tokens). The ASR engine may process the audio data based on the occurrence of subsequent spoken hotwords detected by the hotword detection engine, and/or in response to other calls by the automated assistant. As another example, the requested text (e.g., text converted from a spoken utterance using ASR) may be processed using a Natural Language Understanding (NLU) engine to generate a symbolic representation or belief state that is a semantic representation of the text. For example, the belief state may include an intent corresponding to text and optionally an intent parameter (e.g., a slot value). Once the belief state is fully formed by one or more dialog turns (e.g., all mandatory parameters have been resolved), it represents an action to be performed in response to the spoken utterance. The individual fulfillment components may then utilize the fully formed belief states to perform actions corresponding to the belief states.

The automated assistant may also include one or more warm word (warmword) detection engines operable to process audio data by monitoring for the occurrence of a particular spoken warm word (e.g., "stop," "volume up," "volume down," or "next") and, in response to detecting the occurrence of a particular warm word, cause execution of a command mapped to the particular warm word. An environment (e.g., a particular location, such as a room in a user's home) may include multiple client devices that perform an automated assistant (i.e., an automated assistant device) that is in proximity to the location. In one environment, each of the plurality of automated assistant devices can detect (e.g., by microphone) the same spoken utterance from the user due to the proximity of the location. For a particular warm word, multiple automated assistant devices in the environment may be able to detect the particular warm word. Thus, in one environment, two or more automated assistant devices may be able to respond to a spoken utterance that includes a particular warm word.

Due to resource limitations (e.g., memory limitations and processing limitations), a particular automated assistant may only be able to use a limited number of warm word detection engines and/or may only be able to detect a limited number of warm words at any given time. This is especially true for older and/or less resource assistant devices, which may lack: (a) Processing power and/or memory capacity to execute the various components and/or utilize their associated models; (b) And/or disk space capacity storing various associated models. In addition, computing resources may be wasted in the event that each of two or more automated assistant devices detects the same spoken utterance, each automated assistant device detects a particular warm word in the spoken utterance, and/or each automated assistant device responds such that a command mapped to a particular warm word is executed.

Disclosure of Invention

Some implementations disclosed herein relate to warm word arbitration between automated assistant devices. As described in more detail herein, two or more automated assistant devices located in proximity within an environment may perform warm word arbitration to determine a warm word that each automated assistant device in the environment will detect. Implementations may reduce overall processing costs by reducing or avoiding the occurrence of two or more automated assistant devices detecting the same warm word. Additionally, implementations may allow for detection of a larger set of warm words among multiple automated assistant devices by more efficiently utilizing the overall memory and processing resources among the multiple automated assistant devices within a particular environment.

In various implementations, a method implemented by one or more processors may include: determining to initiate warm word arbitration between the first assistant device and one or more additional assistant devices, the one or more additional assistant devices including the second assistant device, and the first assistant device and the one or more additional assistant devices included in the group of assistant devices; in response to determining to initiate warm word arbitration, performing warm word arbitration, the warm word arbitration including: broadcasting, by the first assistant device, an active warm word set of the first assistant device to one or more additional assistant devices in the set of assistant devices; for each of one or more additional assistant devices in the set of assistant devices, receiving an active set of warm words for the additional assistant device from the additional assistant device; identifying the matching warm word based on the matching warm word being included in the active warm word set of the first assistant device and included in the active warm word set of the second assistant device; and enabling or disabling detection of the matching warm word by the first assistant device in response to identifying the matching warm word.

In some embodiments, the method may further include discovering additional assistant devices in the set of assistant devices using a wireless protocol. In some implementations, determining to initiate warm word arbitration may be based on discovering a new assistant device in the group of assistant devices. In some implementations, determining to initiate warm word arbitration may be based on adding a warm word or removing a warm word in the active set of warm words of the first assistant device. In some implementations, determining to initiate warm word arbitration may be based on a change in the surrounding context detected by the first assistant device. In some implementations, determining to initiate warm word arbitration may be based on determining that an assistant device has been removed from the set of assistant devices.

In some implementations, the method can further include: broadcasting, by the first assistant device, for each warm word in the active set of warm words of the first assistant device, an affinity score for the warm word to one or more additional assistant devices in the set of assistant devices; and for each of one or more additional assistant devices in the set of assistant devices, for each warm word in the active set of additional assistant devices, receiving an affinity score for the warm word from the additional assistant device. In some implementations, in response to identifying a matching warm word, enabling or disabling detection of the matching warm word by the first assistant device may be based on an affinity score of the matching warm word of the first assistant device and an affinity score of the matching warm word of the second assistant device.

In some implementations, for each warm word in the active set of warm words of the first assistant device, an affinity score for the warm word may be determined based on a frequency of detection of the warm word by the first assistant device; and for each of the one or more additional assistant devices: for each warm word in the active set of warm words for the additional assistant device, an affinity score for the warm word may be determined based on a frequency of detection of the warm word by the additional assistant device.

In some implementations, for each warm word in the active set of warm words of the first assistant device, an affinity score for the warm word may be determined based on a time when the warm word was last detected by the first assistant device; and for each of the one or more additional assistant devices: for each warm word in the active set of warm words for the additional assistant device, an affinity score for the warm word may be determined based on a time at which the warm word was last detected by the additional assistant device.

In some implementations, for each warm word in the active set of warm words for the first assistant device, an affinity score for the warm word may be determined based on a device characteristic of the first assistant device; and for each of the one or more additional assistant devices: for each warm word in the active set of warm words for the additional assistant device, an affinity score for the warm word may be determined based on the device characteristics of the additional assistant device.

In some implementations, the first assistant device may disable detection of the warm word when enabling or disabling detection of the matching warm word by the first assistant device. In some implementations, the method can further include: in response to disabling the detection of the warm word by the first assistant device, a new warm word is added to the active set of warm words of the first assistant device. In some implementations, the method can further include repeating the warm word arbitration process until no matching warm word is detected in an iteration of the warm word arbitration process.

In some implementations, the method can further include: detecting, via a microphone of the first assistant device, a spoken utterance; identifying, by the first assistant device, occurrences of the matched warm word in the spoken utterance using an on-device warm word detection model for the matched warm word; determining that the second assistant device is a target of a command mapped to the matched warm word based on performing automatic speech recognition on at least a portion of the spoken utterance preceding the matched warm word or following the matched warm word; and in response to determining that the second assistant device is the target of the command mapped to the matching warm word, sending the command mapped to the matching warm word to the second assistant device.

In some implementations, the method can further include determining to include the first assistant device and the one or more additional assistant devices in the group of assistant devices based on determining that the first assistant device and the one or more additional assistant devices are in proximity. In some implementations, the method can further include determining that the group of assistant devices includes the first assistant device and the one or more additional assistant devices based on the first assistant device and the one or more additional assistant devices each detecting the same spoken utterance.

In some additional or alternative implementations, the computer program product may include one or more computer-readable storage media having program instructions stored collectively on the one or more computer-readable storage media. The program instructions may be executable to: identifying a set of first on-device warm word detection models active on a first assistant device and a set of second on-device warm word detection models active on a second assistant device, the first assistant device and the second assistant device being in proximity; identifying a duplicate on-device warm word detection model that includes a set of on-device warm word detection models and a set of on-device warm word detection models; and disabling the repeated on-device warm word detection model on one of the first and second assistant devices in response to identifying the repeated on-device warm word detection model and based on determining that the first and second assistant devices are in proximity.

In some implementations, the first and second assistant devices may be determined to be in proximity based on the first and second assistant devices each detecting the same spoken utterance.

In some implementations, the program instructions are further executable to identify, for each of the first and second assistant devices, an affinity score for a warm word associated with the repeated on-device warm word detection model. In some implementations, disabling the duplicate on-device warm word detection model on one of the first and second assistant devices may be based on the affinity score.

In some implementations, the affinity score of the warm word associated with the repeated on-device warm word detection model may be determined based on the frequency of detection of the warm word. In some implementations, the affinity score of a warm word associated with the repeated on-device warm word detection model may be determined based on the time when the warm word was last detected.

In some additional or alternative implementations, a system may include a processor, a computer-readable memory, one or more computer-readable storage media, and program instructions stored together on the one or more computer-readable storage media. The program instructions may be executable to: determining to initiate warm word arbitration between the first assistant device and one or more additional assistant devices, the one or more additional assistant devices including the second assistant device, and the first assistant device and the one or more additional assistant devices included in the group of assistant devices; in response to determining to initiate warm word arbitration, performing warm word arbitration, the warm word arbitration including: broadcasting, by the first assistant device, an active warm word set of the first assistant device to one or more additional assistant devices in the set of assistant devices; for each of one or more additional assistant devices in the set of assistant devices, receiving an active set of warm words for the additional assistant device from the additional assistant device; identifying the matching warm word based on the matching warm word being included in the active warm word set of the first assistant device and included in the active warm word set of the second assistant device; and enabling or disabling detection of the matching warm word by the first assistant device in response to identifying the matching warm word.

By utilizing one or more techniques described herein, overall processing costs among multiple automated assistant devices in an environment may be reduced, and overall memory and processing resources among multiple automated assistant devices may be more efficiently utilized. This improves performance by allowing a larger set of warm words to be detected among multiple automated assistant devices.

The foregoing description is provided as an overview of some implementations of the present disclosure. Further descriptions of these and other implementations are described in more detail below.

Various implementations may include a non-transitory computer-readable storage medium storing instructions executable by one or more processors (e.g., a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a Digital Signal Processor (DSP), and/or a Tensor Processing Unit (TPU)) to perform methods, such as one or more of the methods described herein. Other implementations may include an automated assistant client device (e.g., a client device including at least one automated assistant interface for interacting with a cloud-based automated assistant component) including a processor operable to execute stored instructions to perform a method, such as one or more of the methods described herein. Still other implementations may include a system of one or more servers including one or more processors operable to execute stored instructions to perform methods, such as one or more of the methods described herein.

Drawings

FIG. 1 schematically depicts an example environment in which selected aspects of the present disclosure may be implemented, according to various implementations.

Fig. 2 depicts a flowchart illustrating an example method for practicing selected aspects of the present disclosure.

Fig. 3 depicts another flow chart illustrating an example method for practicing selected aspects of the present disclosure.

FIG. 4 depicts an example architecture of a computing device.

Detailed Description

A user may interact with an automated assistant using any of a variety of automated assistant devices. For example, some users may possess a coordinated automated assistant device "ecosystem" that may receive user input for an automated assistant and/or may be controlled by an automated assistant, such as one or more smartphones, one or more tablet computers, one or more vehicle computing systems, one or more wearable computing devices, one or more smart televisions, one or more interactive independent speakers with a display, one or more IoT devices, and other assistant devices.

A user may use any of these assistant devices to conduct a human-machine conversation with the automated assistant (assuming that the automated assistant client is installed and the assistant device is able to receive input). In some cases, these automated assistant devices may be dispersed around a user's primary residence, secondary residence, workplace, and/or other structure. For example, mobile assistant devices such as smartphones, tablet computers, smartwatches, etc. may be carried with the user and/or where the user last placed them. Other automated assistant devices, such as traditional desktop computers, smart televisions, interactive independent speakers, and IoT devices, may be more stationary, but may still be located in various places (e.g., rooms) within a user's home or workplace.

FIG. 1 schematically depicts an example environment 100 in which selected aspects of the present disclosure may be implemented, according to various implementations. Any of the computing devices shown in fig. 1 or elsewhere in the drawings may include logic, such as one or more microprocessors (e.g., central processing units or "CPUs," graphics processing units or "GPUs") executing computer readable instructions stored in memory, or other types of logic, such as application specific integrated circuits ("ASICs"), field programmable gate arrays ("FPGAs"), and the like. Some of the systems shown in fig. 1, such as the cloud-based automation assistant component 140, may be implemented using one or more server computing devices, sometimes referred to as a "cloud infrastructure," although this is not required.

In an implementation, the environment 100 may include an automated assistant ecosystem, for example, including a first assistant device 110A, a second assistant device 110B, a third assistant device 110C, and a fourth assistant device 110D. The assistant devices 110A-D may all be located in a home, business, or other environment. In an implementation, the assistant devices 110A-D may be proximately located within the environment 100. In particular, the physical location of the assistant devices 110A-D in the environment 100 may enable each of the assistant devices 110A-D to detect (e.g., via a microphone) the same spoken utterance from the user, e.g., when the user's physical location is in the vicinity of the assistant devices 110A-D. For example, each of the assistant devices 110A-D may be physically located within the same room (and/or within the same area of the room) in a home, business, or other environment, and when the user is also located within the same room and utterances are made, the microphone of each of the assistant devices 110A-D may detect the utterances. In another example, each of the assistant devices 110A-D may be physically located in a different room in a home, business, or other environment, but still be close enough to each other such that when a user utters an utterance detected by a microphone of one of the assistant devices 110A-D, the microphones of the other assistant devices 110A-D will also detect the same utterance.

Further, the assistant devices 110A-D may all be linked together in one or more data structures or otherwise associated with each other. For example, four assistant devices 110A-D may all be registered with the same user account, registered with the same set of user accounts, registered with a particular structure, and/or all assigned to a particular structure in the device topology representation. For each of the assistant devices 110A-D, the device topology representation may include a corresponding unique identifier, and may optionally include corresponding unique identifiers of other devices that are not assistant devices (but may interact via assistant devices), such as IoT devices that do not include an assistant interface. Further, the device topology representation may specify device attributes associated with the respective assistant devices 110A-D. The device attributes of a given assistant device may indicate, for example, one or more input and/or output modes supported by the respective assistant device, processing capabilities of the respective assistant device, a brand, model, and/or unique identifier (e.g., serial number) of the respective assistant device (based on which processing capabilities may be determined), and/or other attributes. As another example, the assistant devices 110A-D may all be linked together or otherwise associated with each other, as a function of connecting to the same wireless network (e.g., secure access wireless network), and/or as a function of peer-to-peer communication with each other collectively (e.g., via bluetooth and after pairing). In other words, in some implementations, multiple assistant devices may be considered linked together as a function of being in secure network connection with each other, and not necessarily associated with each other in any data structure.

As a non-limiting working example, the first assistant device 110A may be a first type of assistant device, such as a particular model of interactive independent speaker with a display and camera. The second assistant device 110B may be a second type of assistant device, such as a second model of an interactive independent speaker without a display or camera. The assistant devices 110C and 110D may each be a third type of assistant device, such as a third model of an interactive independent speaker without a display. The third type (assistant devices 110C and 110D) may have lower processing power than the second type (assistant device 110B). For example, the third type of processor may have a lower processing power than the second type of processor. For example, a third type of processor may lack any GPU, while a second type of processor includes a GPU. Also, for example, the third type of processor may have a smaller cache and/or a lower operating frequency than the second type of processor. As another example, the size of the memory on the third type of device may be smaller than the size of the memory on the second type of device (e.g., 1 GB compared to 2 GB). As yet another example, the third type of available disk space may be smaller than the first type of available disk space. The available disk space may be different from the currently available disk space. For example, the available disk space may be determined as the current available disk space plus the disk space currently occupied by the model on one or more devices. As another example, the available disk space may be the total disk space minus any space occupied by the operating system and/or other specific software. Continuing with the working example, the first type and the second type may have the same processing power.

In addition to being linked together in a data structure, two or more (e.g., all) of the assistant devices 110A-D also communicate with each other at least selectively via a Local Area Network (LAN) 108. LAN 108 may include a wireless network (such as a network utilizing Wi-Fi), a direct peer-to-peer network (such as a network utilizing bluetooth), and/or other communication topologies utilizing other communication protocols.

The assistant device 110A includes an assistant client 120A, which may be a stand-alone application on an operating system or may form all or part of the operating system of the assistant device 110A. The assistant client 120A in fig. 1 includes a wake/call (hotword) engine 121A1 and one or more associated on-device wake/call (hotword) models 131A1. The wake/call engine 121A1 may monitor for the occurrence of one or more wake or call prompts (e.g., hotwords) and, in response to detecting one or more prompts, may call one or more previously inactive functions of the assistant client 120A. For example, invoking assistant client 120A may include causing ASR engine 122A1, NLU engine 123A1, and/or other engines to be activated. For example, the ASR engine 122A1 may be caused to process additional frames of audio data after waking up or invoking the prompt (without further processing of the frames of audio data occurring prior to invocation), and/or the assistant client 120A may be caused to transmit additional frames of audio data and/or other data to be transmitted to the cloud-based assistant component 140 for processing (e.g., processing the frames of audio data by a remote ASR engine of the cloud-based assistant component 140).

In some implementations, the wake prompt engine 121A may continue to process (e.g., if not in "inactive" mode) a stream of audio data frames based on output from one or more microphones of the Assistant device 110A to monitor for the occurrence of spoken wake words or call phrases (e.g., "OK assant", "HEY ASSISTANT (in the Assistant)"). This process may be performed by wake prompt engine 121A using one or more wake models 131A 1. For example, one of the wake models 131A1 may be a trained neural network model for processing frames of audio data and generating an output indicating whether one or more wake words are present in the audio data. Upon monitoring for the occurrence of a wake word, wake hint engine 121 discards (e.g., after temporary storage in a buffer) any frames of audio data that do not include the wake word. In addition to or instead of monitoring for the occurrence of wake words, wake hint engine 121A1 can also monitor for the occurrence of other call hints. For example, wake prompt engine 121A1 may also monitor for presses of call hardware buttons and/or call software buttons. As another example, continuing with this operational example, when the assistant device 110A includes a camera, the wake prompt engine 121A1 may also optionally process image frames from the camera to monitor for call gestures (such as waving hands) that occur when the user's gaze is directed toward the camera, and/or other call prompts such as the user's gaze is directed toward the camera while indicating that the user is speaking.

The assistant client 120A in fig. 1 also includes an Automatic Speech Recognition (ASR) engine 122A1 and one or more associated on-device ASR models 132A1. The ASR engine 122A1 may be used to process audio data including spoken utterances to generate a transcription (i.e., a sequence of terms and/or other tokens) of the user utterance. The ASR engine 122A1 may process the audio data using the on-device ASR model 132A1. The on-device ASR model 132A1 may include, for example, a two-pass ASR model that is a neural network model and is used by the ASR engine 122A1 to generate a series of probabilities (and probabilities for generating transcriptions) on a token. As another example, the on-device ASR model 132A1 may include an acoustic model that is a neural network model, and a language model that includes a mapping of phoneme sequences to words. The ASR engine 122A1 may process the audio data using an acoustic model to generate a sequence of phonemes and map the sequence of phonemes to particular terms using a language model. Additional or alternative ASR models may be used.

The assistant client 120A in fig. 1 also includes a Natural Language Understanding (NLU) engine 123A1 and one or more associated on-device NLU models 133A1. The NLU engine 123A1 can generate a symbolic representation or belief state that is a semantic representation of natural language text, such as text in the transcription generated by the ASR engine 122A1 or typed text (e.g., typed using a virtual keyboard of the assistant device 110A). For example, the belief state may include an intent corresponding to text and optionally an intent parameter (e.g., a slot value). Once the belief state is fully formed by one or more dialog turns (e.g., all mandatory parameters have been resolved), it represents an action to be performed in response to the spoken utterance. In generating the symbolic representation, NLU engine 123A1 may utilize one or more on-device NLU models 133A1.NLU model 133A1 may include one or more neural network models that are trained to process text and generate output that indicates an intent of the text expression and/or an indication of which portion(s) of the text correspond to which parameter(s) of the intent. The NLU model may additionally or alternatively include one or more models that include a mapping of text and/or templates to corresponding symbolic representations. For example, the mapping may include an intent to map the text "WHAT TIME IS IT (now a few points)" to "current time" with "current location" parameters. As another example, the mapping may include a mapping of the template "add [ merchandise ] to my shopping list" to the intent of "insert shopping list" with merchandise parameters included in the actual natural language corresponding to [ merchandise ] in the template.

The assistant client 120A in fig. 1 also includes a fulfillment engine 124A1 and one or more associated on-device fulfillment models 134A1. Fulfillment engine 124A1 may perform or cause to be performed an action corresponding to the fully formed symbolic representation from NLU engine 123 A1. These actions may include providing responsive user interface output (e.g., audible and/or visual user interface output), controlling a smart device, and/or performing other actions. In performing or causing to perform an action, the fulfillment engine 124A1 may utilize the fulfillment model 134A1. As one example, for an "on" intent with parameters specifying a particular smart light, the fulfillment engine 124A1 may utilize the fulfillment model 134A1 to identify a network address of the particular smart light and/or a command to be transmitted to cause the particular smart light to transition to an "on" state. As another example, for a "current" intent with a "current location" parameter, the fulfillment engine 124A1 may utilize the fulfillment model 134A1 to identify a current time at the assistant device 110A that should be retrieved and audibly presented (utilizing the TTS engine 125 A1).

The assistant client 120A in fig. 1 also includes a text-to-speech (TTS) engine 125A1 and one or more associated on-device TTS models 135A1.TTS engine 125A1 can process text (or a phonetic representation thereof) using on-device TTS model 135A1 to generate synthesized speech. The synthesized speech may be audibly presented through a speaker of a text-to-speech ("TTS") engine (which converts text to speech) local to the assistant device 110A. Synthesized speech may be generated and presented as all or part of a response from an automated assistant, and/or to prompt a user to define and/or clarify parameters and/or intent (e.g., coordinated by NLU engine 123A1 and/or a separate dialog state engine).

The assistant client 120A in fig. 1 also includes an authentication engine 126A1 and one or more associated on-device authentication models 136A1. Authentication engine 126A1 may utilize one or more authentication techniques to verify which of a plurality of registered users is interacting with assistant device 110, or if only one user is registered with assistant device 110, to verify whether the registered user is interacting with assistant device 110 (or alternatively a guest/unregistered user). As one example, text-related speaker verification (TD-SV) may be generated and stored for each registered user (e.g., associated with the user's corresponding user profile) with permission of the associated user. The authentication engine 126A1 may utilize the TD-SV model of the on-device authentication model 136A1 to generate a corresponding TD-SV and/or process a corresponding portion of the audio data TD-SV to generate a corresponding current TD-SV, which may then be compared to the stored TD-SVs to determine if there is a match. As other examples, authentication engine 126A1 may additionally or alternatively utilize text-independent speaker verification (TI-SV) techniques, speaker verification techniques, facial verification techniques, and/or other verification techniques (e.g., PIN entry) and corresponding on-device authentication model 136A1 to authenticate a particular user.

Assistant client 120A in fig. 1 also includes a warm word engine 127A1 and one or more associated on-device warm word models 137A1. Warm word engine 127A1 can at least selectively monitor for the occurrence of one or more warm words or other warm cues and, in response to detecting one or more warm words or other warm cues, cause assistant client 120A to perform a particular action. The warm words may be in addition to any wake words or other wake cues, and each warm word may be at least selectively active (e.g., enabled). Notably, detecting the presence of a warm cue may cause a particular action to be performed even though the detected presence was not preceded by any wake cues. Thus, when the warm reminder is one or more particular words, the user can simply speak the word without providing any wake reminder and cause the corresponding particular action to be performed.

As one example, a "stop" warm word may be active (e.g., enabled) at least when a timer or alarm is audibly presented at the assistant device 110A via the assistant client 120A. For example, at this time, the warm word engine 127A may continuously (or at least when voice activity is detected by the VAD engine 128 A1) process a stream of audio data frames based on the output of one or more microphones of the assistant device 110A to monitor for the occurrence of a "stop", "pause", or other limited set of specific warm words. This process may be performed by warm word engine 127A using one of warm word models 137A1, such as a neural network model trained to process frames of audio data and generate an output of "stop" indicating whether there is a spoken occurrence in the audio data. In response to detecting the occurrence of a "stop," warm word engine 127A may cause a command to be implemented to clear an audibly sounding timer or alarm. At this point, warm word engine 127A may continuously (or at least when presence sensor detects presence) process the image stream from the camera of assistant device 110A to monitor for the presence of a hand in a "stop" pose. This process may be performed by warm word engine 127A using one of warm word models 137A1, such as a neural network model trained to process visual data frames and generate an output indicating whether a "stop" gesture is present for a hand. In response to detecting the occurrence of the "stop" gesture, warm hint engine 127A can cause a command to clear an audibly sounding timer or alarm.

As another example, the "volume up", "volume down", and "next" warm words may be active (e.g., enabled) at least when music or other media content is audibly presented at the assistant device 110A via the assistant client 120A. For example, at this time, warm word engine 127A may continuously process a stream of audio data frames based on the output of one or more microphones of assistant device 110A. The process may include monitoring for the occurrence of "volume up" using a first one of the warm word models 137A1, monitoring for the occurrence of "volume down" using a second one of the warm word models 137A1, and monitoring for the occurrence of "next" using a third one of the warm word models 137 A1. In some implementations, the different warm word models 137A1 may be loaded into memory from a storage device of the assistant device 110A, or may be downloaded by the assistant device 110A from a local model repository 150 accessible through interaction with the cloud-based assistant component 140. In response to detecting the occurrence of "volume up", the warming engine 127A may cause a command to effect an increase in the volume of the music being presented, in response to detecting the occurrence of "volume down", the warming engine may cause a command to effect a decrease in the volume of the music, and in response to detecting the occurrence of "next", the warming engine may cause a command to effect a presentation of the next track instead of the current music track.

Continuing with this example, when music is not audibly presented at assistant device 110A via assistant client 120A, "volume up," "volume down," and "next" warm words may be inactive (e.g., disabled), and warm word engine 127A may not monitor for the occurrence of such warm words. Specifically, a first one of the warm word models 137A1 for monitoring for the occurrence of "volume up", a second one of the warm word models 137A1 for monitoring for the occurrence of "volume down", and a third one of the warm word models 137A1 for monitoring for the occurrence of "next" may not be loaded into memory by the assistant client 120A.

In some implementations, the set of warm words detected by the assistant device 110A using the warm word model 137A may be user-configured and/or automatically selected based on the current context and/or functionality of the assistant device 110A.

In another example, assistant client 120A may enable detection of new warm words by warm word engine 127A by causing new warm word models 137A to be loaded into memory, e.g., from storage of assistant device 110A, or downloaded from local model repository 150 accessible via interaction with cloud-based assistant component 140. Assistant client 120A may disable detection of a warm word by warm word engine 127A by avoiding loading warm word model 137A1 corresponding to the warm word into memory and/or by unloading warm word model 137A1 corresponding to the warm word from memory. While unloaded from memory, the warm word model 137A1 may remain stored in the storage of the assistant device 110A and may be loaded or reloaded into memory at another point in time, e.g., to enable detection of a corresponding warm word. Or assistant client 120A may disable detection of warm words by warm word engine 127A by clearing warm word model 137A1 from assistant device 110A, for example, by unloading warm word model 137A1 from memory and also deleting warm word model 137A1 from the storage of assistant device 110A.

The assistant client 120A also includes a Voice Activity Detector (VAD) engine 128A1 and one or more associated on-device VAD models 138A1. The VAD engine 128A1 can at least selectively monitor the audio data for the occurrence of voice activity and, in response to detecting the occurrence, cause the assistant client 120A to perform one or more functions. For example, in response to detecting voice activity, the VAD engine 128A1 may cause the wake prompt engine 121A1 to be activated. As another example, the VAD engine 128A1 may be used in a continuous listening mode to monitor for the occurrence of voice activity in audio data and cause the ASR engine 122A1 to be activated in response to detecting the occurrence. The VAD engine 128A1 may process the audio data using the VAD model 138A1 to determine whether voice activity is present in the audio data.

The particular engine and corresponding model have been described with respect to assistant client 120A. However, it should be noted that some engines may be omitted and/or additional engines may be included. It should also be noted that with its various on-device engines and corresponding models, the assistant client 120A can fully process many assistant requests, including many assistant requests provided in spoken utterances. However, since the processing power of the assistant device 110A is relatively limited, many assistant requests cannot be completely processed locally on the assistant device 110A. For example, NLU engine 123A1 and/or corresponding NLU model 133A1 may cover only a subset of all available intents and/or parameters available via the automated assistant. As another example, the fulfillment engine 124A1 and/or corresponding fulfillment model may only cover a subset of the available fulfillment. As yet another example, the ASR engine 122A1 and the corresponding ASR model 132A1 may not be robust and/or accurate enough to properly transcribe various spoken utterances.

In view of these and other considerations, the cloud-based assistant component 140 can still be at least selectively employed to perform at least some processing of the assistant request received at the assistant device 110A. The cloud-based automation assistant component 140 can include engines and/or models that correspond to (and/or in addition to or instead of) the engines and/or models of the assistant device 110A. However, because cloud-based automation assistant component 140 can utilize virtually unlimited resources of the cloud, one or more cloud-based corresponding components can be more robust and/or accurate than corresponding components of assistant client 120A. As one example, in response to a spoken utterance seeking to perform an assistant action not supported by the local NLU engine 123A1 and/or the local fulfillment engine 124A1, the assistant client 120A may transmit audio data of the spoken utterance and/or its transcription generated by the ASR engine 122A1 to the cloud-based automation assistant component 140. The cloud-based automated assistant component 140 (e.g., NLU engine and/or its fulfillment engine) can perform more robust processing of such data, enabling resolution and/or execution of assistant actions. The data is transmitted to the cloud-based automation assistant component 140 via one or more Wide Area Networks (WANs) 109, such as the internet or a private WAN.

The second assistant device 110B includes an assistant client 120B, which may be a stand-alone application on an operating system or may form all or part of the operating system of the assistant device 110B. Similar to assistant client 120A, assistant client 120B includes: wake/call engine 121B1 and one or more associated on-device wake/call models 131B1; the ASR engine 122B1 and one or more associated on-device ASR models 132B1; NLU engine 123B1 and one or more associated on-device NLU models 133B1; fulfillment engine 124B1 and one or more associated on-device fulfillment models 134B1; TTS engine 125B1 and one or more associated on-device TTS models 135B1; authentication engine 126B1 and one or more associated on-device authentication models 136B1; warm word engine 127B1 and one or more associated on-device warm word models 137B1; and a VAD engine 128B1 and one or more associated on-device VAD models 138B1.

Some or all of the engines and/or models of the assistant client 120B may be the same as the engines and/or models of the assistant client 120A and/or some or all of the engines and/or models may be different. For example, wake prompt engine 121B1 may lack functionality to detect wake prompts in an image and/or wake model 131B1 may lack a model for processing an image to detect wake prompts-while wake prompt engine 121A1 includes such functionality and wake model 131B1 includes such model. This may be due to, for example, the assistant device 110A including a camera, while the assistant device 110B does not include a camera. As another example, the ASR model 131B1 used by the ASR engine 122B1 may be different from the ASR model 131A1 used by the ASR engine 122 A1. This may be due to, for example, different models being optimized for different processor and/or memory functions between the assistant device 110A and the assistant device 110B.

The particular engine and corresponding model have been described with respect to assistant client 120B. However, it should be noted that some engines may be omitted and/or additional engines may be included. It should also be noted that with its various on-device engines and corresponding models, the assistant client 120B can fully process many assistant requests, including many assistant requests provided in spoken utterances. However, since the processing power of the client device 110B is relatively limited, many assistant requests cannot be completely processed locally at the assistant device 110B. In view of these and other considerations, the cloud-based assistant component 140 can still be at least selectively employed to perform at least some processing of the assistant request received at the assistant device 110B.

The third assistant device 110C includes an assistant client 120C, which may be a stand-alone application on the operating system or may form all or part of the operating system of the assistant device 110C. As with assistant client 120A and assistant client 120B, assistant client 120C includes: wake/call engine 121C1 and one or more associated on-device wake/call models 131C1; authentication engine 126C1 and one or more associated on-device authentication models 136C1; warm word engine 127C1 and one or more associated on-device warm word models 137C1; and a VAD engine 128C1 and one or more associated on-device VAD models 138C1. Some or all of the engines and/or models of assistant client 120C may be the same as and/or different from the engines and/or models of assistant client 120A and/or assistant client 120B.

However, it should be noted that unlike assistant client 120A and assistant client 120B, assistant client 120C does not include: any ASR engine or associated model; any NLU engine or associated model; any fulfillment engine or associated model; and any TTS engines or associated models. Further, it should also be noted that with its various on-device engines and corresponding models, assistant client 120B can only fully process certain assistant requests (i.e., assistant requests that conform to the warm word detected by warm word engine 127C 1) and cannot process many assistant requests, such as assistant requests provided in spoken utterances and assistant requests that do not conform to warm cues. In view of these and other considerations, the cloud-based assistant component 140 can still be at least selectively employed to perform at least some processing of the assistant request received at the assistant device 110C.

The fourth assistant device 110D includes an assistant client 120D, which may be a stand-alone application on the operating system or may form all or part of the operating system of the assistant device 110D. As with assistant client 120A, assistant client 120B, and assistant client 120C, assistant client 120D includes: wake/call engine 121D1 and one or more associated on-device wake/call models 131D1; authentication engine 126D1 and one or more associated on-device authentication models 136D1; warm word engine 127D1 and one or more associated on-device warm word models 137D1; and a VAD engine 128D1 and one or more associated on-device VAD models 138D1. Some or all of the engines and/or models of assistant client 120C may be the same as, and/or may be different from, the engines and/or models of assistant client 120A, assistant client 120B, and/or assistant client 120C.

It should be noted, however, that unlike assistant client 120A and assistant client 120B-and unlike assistant client 120C, assistant client 120D does not include: any ASR engine or associated model; any NLU engine or associated model; any fulfillment engine or associated model; and any TTS engines or associated models. Further, it should also be noted that with its various on-device engines and corresponding models, assistant client 120D can only fully process certain assistant requests (i.e., assistant requests that conform to the warm cues detected by warm word engine 127D 1) and cannot process many assistant requests, such as assistant requests provided in spoken utterances and assistant requests that do not conform to warm words. In view of these and other considerations, the cloud-based assistant component 140 can still be at least selectively employed to perform at least some processing of the assistant request received at the assistant device 110D.

Fig. 2 is a flowchart illustrating an example method 200 of warm word arbitration between automated assistant devices according to an implementation disclosed herein. For convenience, the operations of the flowcharts are described with reference to systems performing the operations. The system may include various components of various computer systems, such as one or more components of the assistant devices 110A-D. Furthermore, although the operations of method 200 are shown in a particular order, this is not meant to be limiting. One or more operations may be reordered, omitted, or added.

At block 205, the system discovers, by the first assistant device, one or more additional assistant devices in the set of assistant devices using a wireless protocol. In an implementation, at block 205, an assistant client 120A executing on a first assistant device (e.g., assistant device 110A) discovers one or more additional assistant devices (e.g., assistant devices 110B-D) in a group of assistant devices (e.g., a group of assistant devices 110A-D that are located in proximity in environment 100) using a wireless protocol (e.g., bluetooth, wi-Fi, ultrasonic audio, etc.). In some implementations, the one or more additional assistant devices include a second assistant device (e.g., assistant device 110B).

At block 210, the system determines to include a first assistant device and one or more additional assistant devices in the set of assistant devices. In an implementation, at block 210, the assistant client 120A executing on the assistant device 110A determines to include a first assistant device (e.g., the assistant device 110A) and one or more additional assistant devices (e.g., the assistant devices 110B-D) discovered at block 205 in the set of assistant devices. In some implementations, determining to include the first assistant device and the one or more additional assistant devices in the set of assistant devices is based on determining that the first assistant device and the one or more additional assistant devices are in proximity. In other implementations, determining that the group of assistant devices includes the first assistant device and the one or more additional assistant devices is based on the first assistant device and the one or more additional assistant devices each detecting the same spoken utterance. In some implementations, determining to include the first assistant device and the one or more additional assistant devices in the group of assistant devices may also be based on the first assistant device and the one or more additional assistant devices both registering in the same user account, registering in the same group of user accounts, and so on.

Still referring to block 210, in some implementations, the assistant device location proximity may be determined based on detecting the occurrence of the same spoken utterance having at least a threshold "loudness" and/or a threshold signal-to-noise ratio (SNR), and the loudness and/or SNR ratio between the assistant devices being similar at the occurrence. In other words, two or more assistant device location proximity may be determined based on all assistant devices clearly detecting the same spoken utterance. In other implementations, two or more assistant device location proximity may be determined based on being assigned to the same room in the family map. In yet other implementations, the first assistant device may emit sound (optionally at a frequency that is not audible to humans) via its speaker and request that other assistant devices use their microphones to listen to the sound and report whether sound is detected and/or characteristics of the sound detected (e.g., loudness, SNR, timestamp of detection, etc.). The assistant devices may use this information regarding whether sound is detected and/or the nature of the detected sound to determine which assistant devices are in proximity. In yet other implementations, near Field Communication (NFC) may be used to determine which assistant devices are in proximity.

At block 215, the system determines that warm word arbitration is to be initiated between the first assistant device and one or more additional assistant devices. In an implementation, at block 215, the assistant client 120A executing on the assistant device 110A determines that warm word arbitration is to be initiated between a first assistant device (e.g., the assistant device 110A) and one or more additional assistant devices (e.g., the assistant devices 110B-D). In response to determining that warm word arbitration is to be initiated at block 215, the system performs warm word arbitration according to blocks 220 through 235.

Still referring to block 215, in some implementations, determining to initiate warm word arbitration is based on discovering a new assistant device in the set of assistant devices (e.g., adding a new assistant device to environment 100 in the vicinity of assistant devices 110A-D) and/or based on determining that an assistant device has been removed from the set of assistant devices (e.g., no more assistant devices are detected in environment 100 in the vicinity of assistant devices 110A-D). In other implementations, determining to initiate warm word arbitration is based on adding or removing warm words in the active set of warm words of the first assistant device (e.g., assistant device 110A), such as based on a configuration change requested by the user. In still other implementations, determining to initiate warm word arbitration is based on a change in the surrounding context detected by the first assistant device (e.g., assistant device 110A), e.g., a new user entering the space in which the first assistant device is located, and/or the user initiating a new activity such as cooking. In some implementations, any of the assistant devices 110A-D in the environment 100 may initiate arbitration.

At block 220, the system broadcasts, by the first assistant device, an active warm word set of the first assistant device to one or more additional assistant devices in the set of assistant devices. In an implementation, at block 220, the assistant client 120A executing on the assistant device 110A broadcasts an active warm word set of the assistant device 110A to one or more additional assistant devices in the set of assistant devices (e.g., assistant devices 110B-D).

Still referring to block 220, in some implementations, for each warm word in the active set of warm words of the first assistant device, the system broadcasts an affinity score for the warm word to one or more additional assistant devices (e.g., assistant devices 110B-D) in the set of assistant devices through the first assistant device (e.g., assistant device 110A). Affinity scores may be comparable between the assistant devices 110A-D and may be determined based on sharing metrics. In some implementations, for each warm word in the active set of warm words of the first assistant device, an affinity score for the warm word is determined based on a frequency of detection of the warm word by the first assistant device. In other implementations, for each warm word in the active set of warm words of the first assistant device, an affinity score for the warm word is determined based on a time the first assistant device recently detected the warm word. Affinity scores may also be determined in consideration of the current context and/or explicit user signals from past interactions. In other implementations, the affinity score of a warm word is determined taking into account the accuracy of a particular warm word model used by the assistant device (e.g., assistant device 110A). For example, a higher resource device may have a more accurate version of the warm word model loaded on the device, and thus the affinity score for the warm word for the higher resource device may be higher than the affinity score for the warm word for the lower resource device.

Referring still to block 220, in still other implementations, for each warm word in the active set of warm words for the first assistant device, an affinity score for the warm word may be determined based on the device characteristics of the first assistant device. Affinity scores may also be determined based on user characteristics and/or warm word embedding. The affinity score may be determined taking into account warm words that may be semantically related (e.g., "play" and "pause") and/or that are similar in speech and may be more related to a particular assistant device (e.g., based on a user's preferred media device). In some implementations, the affinity score can be determined based on an output of the machine learning model.

At block 225, for each of one or more additional assistant devices in the set of assistant devices, the system receives an active warm word set of additional assistant devices from the additional assistant devices. In an implementation, at block 225, for each of one or more additional assistant devices in the set of assistant devices (e.g., assistant devices 110B-D), the assistant client 120A executing on the assistant device 110A receives an active set of warm words for the additional assistant device from the additional assistant device.

Still referring to block 225, in some implementations, for each of one or more additional assistant devices in the set of assistant devices, for each warm word in the active set of warm words for the additional assistant device, the system receives an affinity score for the warm word from the additional assistant device. In some implementations, for each of the one or more additional assistant devices, for each warm word in the active set of warm words for the additional assistant device, an affinity score for the warm word is determined based on a frequency of detection of the warm word by the additional assistant device. In other implementations, for each of the one or more additional assistant devices, for each warm word in the active set of warm words for the additional assistant device, an affinity score for the warm word is determined based on a time at which the additional assistant device most recently detected the warm word. In yet other implementations, for each of the one or more additional assistant devices, for each warm word in the active set of warm words for the additional assistant device, an affinity score for the warm word is determined based on the device characteristics of the additional assistant device.

At block 230, the system determines whether the matching warm word is identified based on the matching warm word being included in the set of active warm words of the first assistant device and included in the set of active warm words of the second assistant device. In an implementation, at block 230, the assistant client 120A executing on the assistant device 110A determines whether the matching warm word is identified based on the matching warm word being included in the active warm word set of the first assistant device (e.g., assistant device 110A) and included in the active warm word set of the second assistant device (e.g., assistant device 110B) received at block 225.

Still referring to block 230, in response to the assistant client 120A determining that a matching warm word is identified based on the matching warm word being included in the active warm word set of the first assistant device (e.g., assistant device 110A) and being included in the active warm word set of the second assistant device (e.g., assistant device 110B), flow proceeds to block 235. On the other hand, in response to assistant client 120A determining that a matching warm word is not identified, flow proceeds to block 240.

In response to identifying the matching warm word, the system enables or disables detection of the matching warm word by the first assistant device at block 235. In an implementation, at block 235, in response to identifying the matching warm word at block 230, the assistant client 120A executing on the assistant device 110A enables or disables detection of the matching warm word by the first assistant device (e.g., assistant device 110A). Subsequently, flow returns to block 220 and the warm word arbitration process is repeated until no matching warm word is detected at block 230 in an iteration of the warm word arbitration process.

Still referring to block 235, in some implementations, assistant client 120A may enable detection of new warm words by warm word engine 127A by causing new warm word models 137A1 to be loaded into memory, for example, from a storage of assistant device 110A or downloaded from local model repository 150 accessible via interaction with cloud-based assistant component 140. In some implementations, assistant client 120A may disable detection of a warm word by warm word engine 127A by avoiding loading warm word model 137A1 corresponding to the warm word into memory and/or by unloading warm word model 137A1 corresponding to the warm word from memory. While unloaded from memory, the warm word model 137A1 may remain stored in the storage of the assistant device 110A and may be loaded or reloaded into memory at another point in time, e.g., to enable detection of a corresponding warm word. Or assistant client 120A may disable detection of warm words by warm word engine 127A by clearing warm word model 137A1 from assistant device 110A, for example, by unloading warm word model 137A1 from memory and also deleting warm word model 137A1 from the storage of assistant device 110A.

Still referring to block 235, in some implementations, in response to identifying a matching warm word, the detection of the matching warm word by the first assistant device (e.g., assistant device 110A) is enabled or disabled based on the affinity score of the matching warm word by the first assistant device (e.g., assistant device 110A) and the affinity score of the matching warm word by the second assistant device (e.g., assistant device 110B). Specifically, in some implementations, if the affinity score of the matching warm word of a first assistant device (e.g., assistant device 110A) is higher than the affinity score of the matching warm word of one or more additional assistant devices including a second assistant device (e.g., assistant device 110B), the assistant client 120A executing on assistant device 110A may enable detection of the matching warm word. In another aspect, if the affinity score of a matching warm word that is not the first assistant (e.g., assistant 110A) is higher than the affinity score of a matching warm word that includes one or more additional assistants of the second assistant (e.g., assistant 110B), the assistant client 120A executing on assistant 110A may disable detection of the matching warm word.

Still referring to block 235, in some implementations, in response to disabling detection of a warm word by a first assistant device (e.g., assistant device 110A), the first assistant device adds a new warm word to the active warm word set of the first assistant device. In some implementations, the new warm word may be selected (e.g., by the assistant client 120A) based on the new warm word having a second highest affinity score for the first assistant device (e.g., assistant device 110A). In other implementations, the new warm word may be selected (e.g., by the assistant client 120A) based on a relationship between the new warm word and the existing warm word or set of warm words enabled for detection on the first assistant device. For example, if the existing warm word set includes "volume up", "volume down", and "next track", a new warm word "last track" may be added based on "last track" and the existing warm word set associated with music play control.

Still referring to block 235, in some implementations, the system may provide an indication to the user (e.g., on a user interface displayed on a display of the assistant device 110A) as to which warm words are active and on which device. For example, the assistant device 110A may display a list of available warm words with device icons to indicate which device of the assistant devices 110A-D is to process each warm word.

At block 240, the system detects a spoken utterance via a microphone of the first assistant device. In an implementation, at block 240, the assistant client 120A executing on the assistant device 110A detects the spoken utterance via a microphone of the assistant device 110A. For example, the assistant client 120A may detect the spoken utterance "SET A TIMER for 3 minutes on my phone (set a 3 minute timer on my cell phone)". In this example, the user's cell phone may be the assistant device 110B.

At block 245, the system identifies, by the first assistant device, the occurrence of the matched warm word in the spoken utterance using an on-device warm word detection model for the matched warm word. In an implementation, at block 245, the assistant client 120A uses the on-device warm word model 137A for the matched warm word to identify occurrences of the matched warm word in the spoken utterance detected at block 240. For example, the assistant client 120A may identify "SET A TIMER (set timer)" as the occurrence of a matching warm word.

At block 250, the system determines that the second assistant device is the target of the command mapped to the matched warm word based on performing automatic speech recognition on at least a portion of the spoken utterance preceding the matched warm word or following the matched warm word. In an implementation, at block 250, the assistant client 120A determines that the device name provided in the spoken utterance does not match the device name of the processing device (assistant device 110A) based on performing automatic speech recognition (e.g., using the ASR engine 122 A1) on at least a portion of the spoken utterance (detected at block 240) that precedes the matched warm word or that follows the matched warm word, and determines that alternatively the second assistant device (e.g., assistant device 110B) is the target of the command mapped to the matched warm word. In the above example, the assistant client 120A determines that "my phone" (i.e., assistant device 110B) is the target of the command mapped to the matching warm word ("SET A TIMER").

Still referring to block 250, in some implementations, in addition to modifying the processing of the current spoken utterance, the goal of the user to explicitly assign another device as a command mapped to a matching warm word may also be used to modify the ongoing processing of commands mapped to the matching warm word. For example, the devices specified in the spoken utterance may be assigned to execute commands that map to matching warm words that are being executed at least for a period of time. In this case, assistant client 120A may disable detection of the matching warm word and assistant client 120B may enable detection of the matching warm word. The assistant client 120A may enable another warm word to replace the disabled matching warm word. In some implementations, such a change may take effect within a particular amount of time (e.g., n hours, or the remaining time of day) or for the duration of the current user activity.

In response to determining that the second assistant device is the target of the command mapped to the matching warm word, the system sends the command mapped to the matching warm word to the second assistant device, block 255. In an implementation, at block 255, in response to determining that the second assistant device is the target of the command mapped to the matching warm word at block 250, the assistant client 120A sends the command mapped to the matching warm word to the second assistant device (e.g., assistant device 110B). In an example, the assistant client 120A sends a command mapped to the matching warm word ("SET A TIMER") to the user's cell phone (assistant device 110B).

Fig. 3 is a flowchart illustrating an example method 300 of warm word arbitration between automated assistant devices according to an implementation disclosed herein. For convenience, the operations of the flowcharts are described with reference to systems performing the operations. The system may include various components of various computer systems, such as one or more components of the assistant devices 110A-D. Furthermore, although the operations of method 300 are shown in a particular order, this is not meant to be limiting. One or more operations may be reordered, omitted, or added.

At block 310, the system determines that the first and second assistant devices are in proximity. In an implementation, at block 310, the cloud-based assistant component 140 determines that the first assistant device (e.g., assistant device 110A) and the second assistant device (e.g., assistant device 110B) are in proximity based on the first assistant device and the second assistant device each detecting the same spoken utterance.

Still referring to block 310, in some implementations, the assistant device location proximity may be determined based on detecting the occurrence of the same spoken utterance having at least a threshold "loudness" and/or a threshold signal-to-noise ratio (SNR), and the loudness and/or SNR ratio between the assistant devices being similar at the occurrence. In other words, two or more assistant device location proximity may be determined based on all assistant devices clearly detecting the same spoken utterance. In other implementations, two or more assistant device location proximity may be determined based on being assigned to the same room in the family map. In yet other implementations, the first assistant device may emit sound (optionally at a frequency that is not audible to humans) via its speaker and request that other assistant devices use their microphones to listen to the sound and report whether sound is detected and/or characteristics of the sound detected (e.g., loudness, SNR, timestamp of detection, etc.). The assistant devices may use this information regarding whether sound is detected and/or the nature of the detected sound to determine which assistant devices are in proximity. In yet other implementations, near Field Communication (NFC) may be used to determine which assistant devices are in proximity.

At block 320, the system identifies a set of first on-device warm word detection models active on a first assistant device and a set of second on-device warm word detection models active on a second assistant device. In an implementation, at block 320, the cloud-based assistant component 140 identifies a set of first on-device warm word detection models (e.g., on-device warm word models 137 A1) that are active on a first assistant device (e.g., assistant device 110A) and a set of second on-device warm word detection models (e.g., on-device warm word models 137B 1) that are active on a second assistant device (e.g., assistant device 110B), the first and second assistant devices having been determined to be in proximity at block 310.

At block 330, the system identifies a duplicate on-device warm word detection model that includes both the set of on-device warm word detection models and the set of on-device warm word detection models. In an implementation, at block 330, the cloud-based assistant component 140 identifies duplicate on-device warm word detection models that are included in both the first set of on-device warm word detection models (e.g., on-device warm word model 137 A1) and the second set of on-device warm word detection models (e.g., on-device warm word model 137B 1).

Still referring to block 330, in some implementations, for each of the first and second assistant devices, the system identifies an affinity score for a warm word associated with the repeated on-device warm word detection model. In some implementations, the affinity score of the warm word associated with the repeated on-device warm word detection model is determined based on the frequency of detection of the warm word. In other implementations, the affinity score of the warm word associated with the repeated on-device warm word detection model is determined based on the time when the warm word was last detected.

In block 340, in response to identifying the duplicate on-device warm word detection model and based on determining that the first and second assistant devices are in proximity, the system disables the duplicate on-device warm word detection model on one of the first and second assistant devices. In an implementation, at block 340, responsive to identifying the duplicate on-device warm word detection model at block 330, and based on determining that the first assistant device (e.g., assistant device 110A) and the second assistant device (e.g., assistant device 110B) are in proximity at block 310, the cloud-based assistant component 140 disables the duplicate on-device warm word detection model identified at block 330 on one of the first assistant device and the second assistant device.

FIG. 4 is a block diagram of an example computing device 410 that may optionally be used to perform one or more aspects of the techniques described herein. In some implementations, one or more of the client device, cloud-based automation assistant component, and/or other components may include one or more components of the example computing device 410.

The computing device 410 typically includes at least one processor 414 that communicates with a number of peripheral devices via a bus subsystem 412. These peripheral devices may include a storage subsystem 424 (including, for example, a memory subsystem 425 and a file storage subsystem 426), a user interface output device 420, a user interface input device 422, and a network interface subsystem 416. The input devices and output devices allow user interaction with the computing device 410. The network interface subsystem 416 provides an interface to external networks and couples with corresponding interface devices in other computing devices.

User interface input devices 422 may include a keyboard, a pointing device (such as a mouse, trackball, touch pad or tablet, scanner, touch screen integrated into a display), an audio input device (such as a voice recognition system, microphone), and/or other types of input devices. Generally, use of the term "input device" is intended to include all possible types of devices and ways of inputting information into computing device 410 or onto a communication network.

The user interface output device 420 may include a display subsystem, a printer, a facsimile machine, or a non-visual display such as an audio output device. The display subsystem may include a Cathode Ray Tube (CRT), a flat panel device such as a Liquid Crystal Display (LCD), a projection device, or some other mechanism for producing a viewable image. The display subsystem may also provide for non-visual display, such as via an audio output device. Generally, use of the term "output device" is intended to include all possible types of devices and ways to output information from computing device 410 to a user or to another machine or computing device.

Storage subsystem 424 stores programming and data structures that provide the functionality of some or all of the modules described herein. For example, storage subsystem 424 may include logic for performing selected aspects of the methods disclosed herein, as well as for implementing the various components depicted in fig. 1.

These software modules are typically executed by processor 414 alone or in combination with other processors. Memory subsystem 425 included in storage subsystem 424 may include a number of memories, including a main Random Access Memory (RAM) 430 for storing instructions and data during program execution and a Read Only Memory (ROM) 432 in which fixed instructions are stored. File storage subsystem 426 may provide persistent storage for program and data files, and may include a hard disk drive, a floppy disk drive, and associated removable media, CD-ROM drive, optical disk drive, or removable media cartridge. Modules implementing the functionality of certain implementations may be stored by file storage subsystem 426 in storage subsystem 424 or in other machines accessible to processor 414.

Bus subsystem 412 provides a mechanism for allowing the various components and subsystems of computing device 410 to communicate with each other as intended. Although bus subsystem 412 is shown schematically as a single bus, alternative implementations of the bus subsystem may use multiple buses.

Computing device 410 may be of different types including a workstation, a server, a computing cluster, a blade server, a server farm, or any other data processing system or computing device. Due to the ever-changing nature of computers and networks, the description of computing device 410 depicted in FIG. 4 is intended only as a specific example for purposes of illustrating some implementations. Many other configurations of computing device 410 are possible, with more or fewer components than the computing device depicted in fig. 4.

Although several implementations have been described and illustrated herein, various other means and/or structures for performing a function and/or obtaining results and/or one or more advantages described herein may be used and each of such variations and/or modifications is deemed to be within the scope of the implementations described herein. More generally, all parameters, dimensions, materials, and configurations described herein are meant to be exemplary and the actual parameters, dimensions, materials, and/or configurations will depend upon the specific application or applications in which the teachings are used. Those skilled in the art will recognize, or be able to ascertain using no more than routine experimentation, many equivalents to the implementations described herein. It is, therefore, to be understood that the foregoing implementations are presented by way of example only and that, within the scope of the appended claims and equivalents thereto, the implementations may be practiced otherwise than as specifically described and claimed. Implementations of the present disclosure relate to each individual feature, system, article, material, kit, and/or method described herein. In addition, any combination of two or more such features, systems, articles, materials, kits, and/or methods is included within the scope of the present disclosure if such features, systems, articles, materials, kits, and/or methods are not mutually inconsistent.

Claims

1. A method implemented by one or more processors, the method comprising:

determining to initiate warm word arbitration between a first assistant device and one or more additional assistant devices, the one or more additional assistant devices including a second assistant device, and the first assistant device and the one or more additional assistant devices being included in a group of assistant devices;

in response to determining to initiate warm word arbitration, performing warm word arbitration, the warm word arbitration comprising:

Broadcasting, by the first assistant device, an active warm word set of the first assistant device to the one or more additional assistant devices in the set of assistant devices;

For each of the one or more additional assistant devices in the set of assistant devices, receiving an active set of warm words for the additional assistant device from the additional assistant device;

Identifying the matching warm word based on the matching warm word being included in the set of active warm words of the first assistant device and included in the set of active warm words of the second assistant device; and

Responsive to identifying the matching warm word, enabling or disabling detection of the matching warm word by the first assistant device.

2. The method of claim 1, further comprising discovering the additional assistant device in the set of assistant devices using a wireless protocol.

3. The method of claim 1 or 2, wherein determining to initiate warm word arbitration is based on discovering a new assistant device in the set of assistant devices.

4. The method of any of the preceding claims, wherein determining to initiate warm word arbitration is based on determining that an assistant device has been removed from the set of assistant devices.

5. The method of any of the preceding claims, wherein determining to initiate warm word arbitration is based on adding warm words or removing warm words in the active set of warm words of the first assistant device.

6. The method of any of the preceding claims, wherein determining to initiate warm word arbitration is based on a change in ambient context detected by the first assistant device.

7. The method of any of the preceding claims, further comprising:

Broadcasting, by the first assistant device, for each warm word in the active set of warm words of the first assistant device, an affinity score for the warm word to the one or more additional assistant devices in the set of assistant devices; and

For each of the one or more additional assistant devices in the set of assistant devices, for each warm word in the active set of warm words for the additional assistant device, receiving an affinity score for the warm word from the additional assistant device,

Wherein enabling or disabling detection of the matching warm word by the first assistant device in response to identifying the matching warm word is based on the affinity score of the matching warm word of the first assistant device and the affinity score of the matching warm word of the second assistant device.

8. The method of claim 7, wherein:

For each warm word in the active set of warm words of the first assistant device, the affinity score for the warm word is determined based on a frequency of detection of the warm word by the first assistant device; and

For each of the one or more additional assistant devices:

For each warm word in the active set of warm words of the additional assistant device, the affinity score for the warm word is determined based on a frequency of detection of the warm word by the additional assistant device.

9. The method of claim 7, wherein:

for each warm word in the active set of warm words of the first assistant device, the affinity score for the warm word is determined based on a time the warm word was last detected by the first assistant device; and

For each of the one or more additional assistant devices:

For each warm word in the active set of warm words of the additional assistant device, the affinity score for the warm word is determined based on a time the additional assistant device recently detected the warm word.

10. The method of claim 7, wherein:

for each warm word in the active set of warm words for the first assistant device, the affinity score for the warm word is determined based on a device characteristic of the first assistant device; and

For each of the one or more additional assistant devices:

For each warm word in the active set of warm words for the additional assistant device, the affinity score for the warm word is determined based on device characteristics of the additional assistant device.

11. The method of any of the preceding claims, wherein upon enabling or disabling detection of the matching warm word by the first assistant device, the first assistant device disables detection of the warm word, and

The method also includes adding a new warm word to the active set of warm words of the first assistant device in response to the first assistant device disabling detection of the warm word.

12. The method of any of the preceding claims, further comprising repeating the warm word arbitration process until no matching warm word is detected in an iteration of the warm word arbitration process.

13. The method of any of the preceding claims, further comprising:

detecting a spoken utterance via a microphone of the first assistant device;

Identifying, by the first assistant device, an occurrence of the matched warm word in the spoken utterance using an on-device warm word detection model for the matched warm word;

Determining that the second assistant device is a target of a command mapped to the matched warm word based on performing automatic speech recognition on at least a portion of the spoken utterance preceding the matched warm word or following the matched warm word; and

Responsive to determining that the second assistant device is the target of the command mapped to the matching warm word, the command mapped to the matching warm word is sent to the second assistant device.

14. The method of any of the preceding claims, further comprising determining to include the first assistant device and the one or more additional assistant devices in the set of assistant devices based on determining that the first assistant device and the one or more additional assistant devices are in proximity.

15. The method of any of the preceding claims, further comprising determining that the set of assistant devices includes the first assistant device and the one or more additional assistant devices based on the first assistant device and the one or more additional assistant devices each detecting a same spoken utterance.

16. A method implemented by one or more processors, the method comprising:

identifying a set of first on-device warm word detection models active on a first assistant device and a set of second on-device warm word detection models active on a second assistant device, the first assistant device and the second assistant device being in proximity;

Identifying a duplicate on-device warm word detection model that is included in both the first set of on-device warm word detection models and the second set of on-device warm word detection models; and

Responsive to identifying the repeated on-device warm word detection model and based on determining that the first and second assistant devices are in proximity, disabling the repeated on-device warm word detection model on one of the first and second assistant devices.

17. The method of claim 16, wherein the first assistant device and the second assistant device are determined to be in proximity based on the first assistant device and the second assistant device each detecting the same spoken utterance.

18. The method of claim 16 or 17, further comprising, for each of the first and second assistant devices, identifying an affinity score for a warm word associated with the repeated on-device warm word detection model,

Wherein disabling the duplicate on-device warm word detection model on one of the first assistant device and the second assistant device is based on the affinity score.

19. The method of claim 18, wherein the affinity score of the warm word associated with the repeated on-device warm word detection model is determined based on a frequency of detection of the warm word.

20. A computer program product comprising instructions which, when executed by one or more processors, cause the one or more processors to perform the method of any of claims 1 to 19.

21. A computer-readable storage medium comprising instructions that, when executed by one or more processors, cause the one or more processors to perform the method of any of claims 1-19.

22. A system comprising a processor, a computer readable memory, one or more computer readable storage media, and program instructions co-stored on the one or more computer readable storage media, the program instructions executable to perform the method of any one of claims 1 to 19.