WO2023113877A1 - Selecting between multiple automated assistants based on invocation properties - Google Patents
Selecting between multiple automated assistants based on invocation properties Download PDFInfo
- Publication number
- WO2023113877A1 WO2023113877A1 PCT/US2022/042726 US2022042726W WO2023113877A1 WO 2023113877 A1 WO2023113877 A1 WO 2023113877A1 US 2022042726 W US2022042726 W US 2022042726W WO 2023113877 A1 WO2023113877 A1 WO 2023113877A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- automated assistant
- invocation
- input
- features
- user
- Prior art date
Links
- 238000000034 method Methods 0.000 claims abstract description 64
- 238000012545 processing Methods 0.000 claims abstract description 30
- 230000004044 response Effects 0.000 claims description 32
- 230000000007 visual effect Effects 0.000 claims description 22
- 238000010801 machine learning Methods 0.000 claims description 17
- 238000012549 training Methods 0.000 claims description 12
- 230000015654 memory Effects 0.000 claims description 10
- 230000000694 effects Effects 0.000 claims description 4
- 238000009877 rendering Methods 0.000 claims description 2
- 230000008569 process Effects 0.000 description 22
- 230000009471 action Effects 0.000 description 15
- 238000001514 detection method Methods 0.000 description 9
- 238000004891 communication Methods 0.000 description 5
- 239000000463 material Substances 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 230000002452 interceptive effect Effects 0.000 description 4
- 238000003058 natural language processing Methods 0.000 description 4
- 239000013598 vector Substances 0.000 description 4
- 238000013475 authorization Methods 0.000 description 2
- 239000003795 chemical substances by application Substances 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 238000012544 monitoring process Methods 0.000 description 2
- 230000002093 peripheral effect Effects 0.000 description 2
- 230000009118 appropriate response Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 230000000116 mitigating effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L51/00—User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
- H04L51/02—User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail using automatic reactions or user delegation, e.g. automatic replies or chatbot-generated messages
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/9032—Query formulation
- G06F16/90332—Natural language query formulation or dialogue systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/16—Sound input; Sound output
- G06F3/167—Audio in a user interface, e.g. using voice commands for navigating, audio feedback
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/004—Artificial life, i.e. computing arrangements simulating life
- G06N3/006—Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
Definitions
- H umans may engage in human-to-computer dialogs with interactive software applications referred to herein as "automated assistants” (also referred to as “digital agents,” “chatbots,” “interactive personal assistants,” “intelligent personal assistants,” “assistant applications,” “conversational agents,” etc.).
- automated assistants also referred to as “digital agents,” “chatbots,” “interactive personal assistants,” “intelligent personal assistants,” “assistant applications,” “conversational agents,” etc.
- humans which when they interact with automated assistants may be referred to as “users”
- spoken natural language input i.e., utterances
- An automated assistant responds to a request by providing responsive user interface output, which can include audible and/or visual user interface output.
- automated assistants are configured to be interacted with via spoken utterances, such as an invocation indication followed by a spoken query.
- spoken utterances such as an invocation indication followed by a spoken query.
- a user must often explicitly invoke an automated assistant before the automated assistant will fully process a spoken utterance.
- the explicit invocation of an automated assistant typically occurs in response to certain user interface input being received at a client device.
- the client device includes an assistant interface that provides, to a user of the client device, an interface for interfacing with the automated assistant (e.g., receives spoken and/or typed input from the user, and provides audible and/or graphical responses), and that interfaces with one or more additional components that implement the automated assistant (e.g., remote server device(s) that process user inputs and generate appropriate responses).
- the automated assistant e.g., receives spoken and/or typed input from the user, and provides audible and/or graphical responses
- additional components e.g., remote server device(s) that process user inputs and generate appropriate responses.
- Some user interface inputs that can invoke an automated assistant via a client device include a hardware and/or virtual button at the client device for invoking the automated assistant (e.g., a tap of a hardware button, a selection of a graphical interface element displayed by the client device).
- Many automated assistants can additionally or alternatively be invoked in response to one or more spoken invocation phrases, which are also known as "hot words/phrases" or “trigger words/phrases”.
- a spoken invocation phrase such as "Hey Assistant,” “OK Assistant”, and/or "Assistant” can be spoken to invoke an automated assistant.
- an assistant may be invoked based on one or more gestures of the user, such as pressing a button on a device and/or motioning in a particular manner such that the motion can be captured by a camera of a device.
- a client device that includes an assistant interface includes one or more locally stored models that the client device utilizes to monitor for an occurrence of a spoken invocation phrase.
- Such a client device can locally process received audio data utilizing the locally stored model, and discards any audio data that does not include the spoken invocation phrase.
- the client device will then cause that audio data and/or following audio data to be further processed by the automated assistant.
- a spoken invocation phrase is "Hey, Assistant”
- a user speaks “Hey, Assistant, what time is it”
- audio data corresponding to "what time is it” can be processed by an automated assistant based on detection of "Hey, Assistant”, and utilized to provide an automated assistant response of the current time.
- the user simply speaks “what time is it” (without first speaking an invocation phrase or providing alternate invocation input)
- no response from the automated assistant will be provided as a result of "what time is it” not being preceded by an invocation phrase (or other invocation input).
- Techniques are described herein for selecting, from multiple candidate automated assistants, a particular automated assistant to invoke for processing a request when an invocation is received that is capable of selectively invoking any of the multiple candidate automated assistants.
- the invocation, when received can result in invocation of the particular automated assistant exclusively, while in other situations the invocation, when received, can result in the invocation of an alternate automated assistant, of the multiple candidate automated assistants, exclusively.
- various techniques are directed to utilizing a general automated assistant, which can receive an invocation from a user, to determine which of a plurality of secondary automated assistants to invoke based on features of the invocation and/or based on features of audio data that is provided with the invocation.
- the user can utter a general invocation phrase, such as "OK Assistant", that is capable of invoking both a first automated assistant and a second automated assistant.
- the general automated assistant can determine, based on one or more features of the invocation, features of a query provided by the user, and/or additional features other than speech recognition of any voice input (e.g., invocation voice input and/or query) to determine whether to invoke and subsequently provide the query to the first automated assistant in lieu of invoking and providing the query to the second automated assistant.
- the user may have a device that has multiple automated assistants installed on the client device (e.g., at least client applications for the multiple automated assistants). Both a first automated assistant and a second automated assistant executing on the device are capable of being invoked with the invocation phrase "OK Assistant.”
- the invocation input of "OK Assistant" can be processed to determine, based on one or more features of the invocation input (e.g., prosodic features, usage of vocabulary) whether to invoke the first automated assistant or the second automated assistant.
- one or more other features can be processed to determine whether to invoke the first automated assistant or the second automated assistant.
- speech recognition features e.g., other applications executing by the client device, location of the client device, a classification of the location where the device is located, presence of others in proximity to the client device
- the user may have a client device that has a first instantiation of an automated assistant installed on the device and a second instantiation of the same automated assistant installed on the same device.
- the user may have a "home” automated assistant that the user utilizes when in a home setting and a "work” automated assistant instantiation that the user utilizes when in a work setting.
- the invocation input of "OK Assistant” may be capable of invoking both the "home” assistant and the "work” assistant, and the invocation input can be processed to determine, based on one or more features as described herein, whether to invoke the "home” instantiation or the "work” instantiation.
- a location can be a geographical location and/or can be a classification of the location of the client device, such as "public” versus "private” location.
- utilization of techniques described herein mitigates occurrences of users inadvertently initially invoking incorrect automated assistant(s) in attempting to invoke an intended automated assistant.
- utilization of computational and/or network resources is mitigated as the computational and/or network resource(s), that would otherwise be utilized if the unintended automated assistant was invoked, are not utilized .
- the quantity of inputs and/or duration of input(s) from a user to invoke a particular automated assistant can be reduced, thereby reducing processing time and memory resources.
- a user can utilize a single invocation to selectively invoke one of a plurality of automated assistants without requiring additional input from the user specifying which of the plurality of automated assistants the user is intending to invoke.
- the need to load and/or utilize multiple invocation models, each being used in monitoring for invocation(s) for a respective automated assistant can be reduced or eliminated. For example, having a first hot word detection model for a first automated assistant and a second hotword detection model for a second automated assistant continuously loaded and continuously being utilized can consume more processor, memory resources, and power than having a general hot word detection model loaded and utilized.
- DSPs digital signal processors
- a client device can have constrained memory and/or processor resources, preventing multiple hot word detection models from being loaded and/or executed in parallel (or at least preventing that without also having to forego not executing other process(es) on the DSP).
- a first assistant and a second assistant can both be configured to be invoked by the same invocation.
- the invocation can be a spoken utterance from the user that indicates which automated assistant, of a plurality of automated assistants, the user has interest in providing a spoken query.
- both the first and second automated assistant can be invoked by the invocation phrase "OK Assistant.”
- the invocation can be an action from the user.
- both the first and second automated assistant can be invoked by the user performing a particular gesture, pressing a button on the client device, looking at the client device, and/or one or more other actions that can be captured by one or more cameras of the client device.
- the invocation is received and/or detected by a general automated assistant that is not configured to generate responses to a query.
- the general automated assistant can be a "meta assistant" that is configured to receive invocation input, process the invocation input to determine which of a plurality of secondary automated assistants to invoke based on one or more features, and invoke the intended secondary automated assistant.
- the invocation input, and/or additional audio input that precedes and/or follows the invocation input can be processed before providing the invocation input and/or additional audio data to a secondary automated assistant for further processing.
- one or more invocation features of the invocation input can be processed to determine whether to invoke a first automated assistant or a second automated assistant.
- one or more prosodic features of spoken invocation input determined from the audio data detected by one or more microphones of the client device, can be utilized to determine whether to invoke the first automated assistant or the second automated assistant.
- a first automated assistant can be invoked in lieu of invoking a second automated assistant based on determining that the prosodic features that are detected in the audio data are more likely to be present in invocations of the first automated assistant than in invocations of the second automated assistant.
- one or more additional features detected by the client device can be processed, either in addition to the invocation input features or in lieu of the invocation input features, to determine whether to invoke the first automated assistant or the second automated assistant. Additional features that are processed to determine whether to invoke the first automated assistant or the second automated assistant can be in addition to any automatic speech recognition features of the invocation input. For example, additional features can include one or more terms that are utilized by the user in a query that precedes or follows the invocation input that are more likely to be utilized when invoking (and/or providing a query) to a "home" automated assistant over invoking (and/or providing a query) to a "work" automated assistant.
- the one or more additional features can include one or more applications that are executing (or that have recently been accessed and/or executed) by the client device when the invocation input is detected.
- a client device may be executing a calendar application that includes calendar information for the "work" profile of a user.
- the "work” automated assistant may be invoked based on a likelihood that the user is currently engaged in work activities.
- a client device may be executing a game, a web browser, and/or other application that would more likely to accessed by the user in a non-work setting, and a non-work profile of an automated assistant may be invoked in lieu of invoking a "work" profile of the automated assistant.
- the one or more additional features can include location information related to the location of the client device when the invocation input is detected.
- a location can be a physical location of the client device, such as the geographic location of the device.
- the location can be a classification of a location, such as a "public” location versus a "private” location. In some implementations, then location can be a specific area within a geographic location, such as "living room” and "home office.” Location classifications can be set as user preferences by the user and/or can be determined based on past interactions of the user with the client device (e.g., locations where the user commonly utilizes a "work” calendar can be classified as "work' or "private” locations).
- one or more additional features can include visual input features that are captured by one or more cameras of the client device when the invocation input is detected.
- visual input that is received during the invocation input can indicate the presence of others in proximity to the client device.
- a "public" automated assistant can be invoked in lieu of invoking a "private” automated assistant.
- visual input may indicate that the user is in a home setting and invoke a "home” automated assistant in lieu of invoking a "work” automated assistant.
- the user may be provided with an indication of the automated assistant that was invoked by the invocation input.
- the user may be provided, via an interface of the client device, with an indication that a "work" automated assistant has been invoked.
- the user may be provided with an audio indication that a first automated assistant has been invoked in lieu of invoking a second automated assistant.
- the user may be provided with an audio indication of "Invoking your work assistant” when a "work” automated assistant has been invoked in lieu of invoking a "home” automated assistant.
- one or more audio prompts and/or responses from an invoked automated assistant can be provided using a different voice profile such that the user can differentiate between multiple automated assistants and/or profiles (e.g., a synthesized male voice for a "home” automated assistant and a synthesized female voice for a "work” automated assistant).
- one or more machine learning models can be utilized to process features of the invocation input and/or one or more additional features.
- Invocation input, audio data received with the invocation input, and/or visual data that is received with the invocation input can be utilized to generate one or more vectors in an embedding space that indicate the location information, prosodic features of the input, and/or other features described herein.
- the vector can be processed utilizing a machine learning model that can generate, as output, probabilities (and/or other indications) of whether to invoke a first automated assistant in lieu of invoking a second automated assistant.
- the user may provide feedback (e.g., additional audio feedback and/or visual feedback) that indicates whether the invoked automated assistant was the intended automated assistant to be invoked.
- feedback e.g., additional audio feedback and/or visual feedback
- the noninvoked automated assistant can be invoked and one or more of the features that were processed to invoke the first automated assistant can be provided to the machine learning model as training data for further training based on the negative feedback.
- a user may utter an invocation phrase of "OK Assistant," which may invoke both a first and a second automated assistant. Based on the invocation features and/or additional features, the first automated assistant can be invoked.
- the user may have instead intended to invoke the second automated assistant, and respond to the invoked automated assistant with "Use the second automated assistant instead.”
- the second automated assistant can be invoked, and the originally provided invocation input can be utilized to further train the associated machine learning model so that future invocations with the same features are less likely to result in the first automated assistant being invoked.
- the invocation input can be processed and any additional audio data that is captured immediately preceding or following the invocation input can be provided to the invoked automated assistant without further processing.
- the client device may detect audio data that includes an invocation phrase and, based on the invocation features and/or additional features, as described herein, invoke the first automated assistant.
- the audio data that precedes and/or follows the invocation input can be directly provided to the invoked automated assistant.
- the invoked automated assistant can then process the audio data, such as performing natural language processing, automatic speech recognition, STT, and/or other processing to generate a response and/or perform an action based on the provided input.
- the invocation input can be processed and any additional audio data that is captured immediately preceding or following the invocation input can be further processed to determine one or more features before the audio data is provided to the invoked automated assistant.
- a general automated assistant can receive the invocation input and/or additional audio input that precedes and/or follows the invocation input. Based on the invocation features and/or additional features, a first automated assistant can be invoked.
- the general automated assistant can perform, for example, automatic speech recognition, natural language processing, and/or other processing of the invocation input and/or additional input on the input from the user, and detected features and/or processed input can be provided with the audio data (or in lieu of the audio data) to the invoked automated assistant.
- FIG. 1 is an illustration of an example environment in which implementations disclosed herein can be implemented.
- FIG. 2 is a block diagram of an example environment in which various methods disclosed herein can be implemented.
- FIG. 3A and FIG. 3B are block diagrams of example implementations of multiple automated assistants.
- FIG. 4 is a block diagram illustrating components of a general automated assistant in which implementations disclosed herein can be implemented.
- FIG. 5 depicts a flowchart illustrating an example method according to various implementations disclosed herein.
- FIG. 6 illustrates an example architecture of a computing device.
- an example environment which includes multiple automated assistants that may be invoked by a user 101.
- the environment includes a first standalone interactive speaker 105 with a microphone (not depicted) and a camera (also not depicted), and a second standalone interactive speaker 110 with a microphone (not depicted) and a camera (also not depicted).
- the first speaker may be executing, at least in part, a first automated assistant that may be invoked with an invocation phrase.
- the second speaker 110 may be executing a second automated assistant that may be invoked with an invocation phrase, either the same invocation phrase as the first automated assistant or a different phrase to allow the user, based on the phrase uttered, to select which automated assistant to invoke.
- the user 101 is speaking a spoken utterance 115 of "OK Assistant, What's on my calendar" in proximity to the first speaker 105 and the second speaker 110.
- the invoked assistant may process the query that follows the invocation phrase (i.e., "What's on my calendar").
- one or both of the automated assistants 105 and 110 can be capable to be invoked by the user performing one or more actions that can be captured by the cameras of the automated assistants.
- automated assistant 105 can be invoked by the user looking in the direction of automated assistant 105, making a waving motion in the direction of automated assistant 105, and/or one or more other actions that can be captured by the camera of automated assistant 105.
- a device such as first speaker 105
- FIG. 2 an example environment is illustrated that includes multiple client devices executing multiple automated assistants.
- the system includes a first client device 105 that is executing a first automated assistant 215 and a second automated assistant 220.
- Each of the first and second automated assistants may be invoked by uttering an invocation phrase (unique to each assistant or the same phrase to invoke both assistants) proximate to the client device 105 such that the audio may be captured by a microphone 225 of client device 105 and/or performing an action that may be captured by camera 235 of client device 105.
- user 101 may invoke the first automated assistant 215 by uttering "OK Assistant 1" in proximity to the client device 105, and further invoke the second automated assistant 220 by uttering the phrase "OK Assistant 2" in proximity to client device 105. Further, user 101 may invoke the first automated assistant 215 by performing a first action and invoke the second automated assistant 220 by performing a second action. Based on which invocation phrase is uttered and/or which action is performed, the user can indicate which of the multiple assistants that are executing on the client device 105 that the user has interest in processing a spoken query.
- the example environment further includes a second client device 110 that is executing a third automated assistant 245.
- the third automated assistant may be configured to be invoked using a third invocation phrase, such as "OK Assistant 3" such that it may be captured by microphone 230. Further, the third automated assistant 245 can be configured to be invoked using a third gesture and/or action that may be captured by camera 250 In some implementations, one or more of the automated assistants of FIG. 2 may be absent. Further, the example environment may include additional automated assistants that are not present in FIG. 2. For example, the system may include a third device executing additional automated assistants and/or client device 110 and/or client device 105 may be executing additional automated assistants and/or fewer automated assistants than illustrated.
- one or more automated assistants can be capable of being invoked based on constraints of the devices that are executing the automated assistants.
- first client device 205 may include a camera to capture gestures of the user
- second client device 210 may include a microphone (and not a camera), thus being capable of only identifying audio invocations.
- the gesture may be identified by first client device 205 and can invoke at least one of the first automated assistant 215 and/or second automated assistant 220.
- a user utters an invocation phrase only automated assistants on client devices that include a microphone may be invoked.
- first automated assistant 215 and third automated assistant 245 are capable of being invoked with the same invocation input
- the user can indicate a preference for one of the invocable automated assistants over the other based on the type of invocation input that is detected by one or more of the client devices 205 and 210.
- Each of the automated assistants 215, 220, and 245 can include one or more components of the automated assistants described herein.
- automated assistant 215 may include its own speech capture component to process incoming queries, visual capture component to process incoming visual data, hotword detection engine, and/or other components.
- automated assistants that are executing on the same device such as automated assistants 215 and 220, can share one or more components that may be utilized by both of the automated assistants.
- automated assistant 315 and automated assistant 320 may share an on-device speech recognizer, on-device NLU engine, and/or one or more of the other components.
- two or more of the automated assistants may be invoked by the same invocation phrase, such as "OK Assistant," that is not unique to a single automated assistant.
- the user utters an invocation phrase and/or provides other invocation input (e.g., a gesture that can invoke two or more of the automated assistants)
- one or more of the automated assistants may function as a general automated assistant and determine which, of the automated assistants that may be invoked, to invoke based on the invocation input.
- a general automated assistant 305 is illustrated along with two additional automated assistants 310 and 320.
- the general automated assistant 305 may be configured to process invocation input, such as an utterance that includes the phrase "OK Assistant" or other invocation input, which may indicate that the user has interest in providing a query to one of multiple automated assistants that can be invoked by the invocation input.
- the general automated assistant 305 may not include all of the functionality of an automated assistant.
- the general automated assistant 305 may not include a query processing engine and/or functionality to perform actions other than processing invocation input to determine which of multiple automated assistants to invoke.
- the general automated assistant 305 may include the functionality of other automated assistants and may determine, for invocation input, whether to invoke itself or invoke a different automated assistant that is configured to be invoked by the same invocation input.
- both general automated assistant 305 and first automated assistant 310 may be configured to be invoked in response to detecting a spoken invocation phrase such as "Hey Assistant,” “OK Assistant", and/or "Assistant”.
- General automated assistant 305 can continuously process (e.g., if not in an "inactive" mode) a stream of audio data frames that are based on output from one or more microphones 320 of the client device 301, to monitor for an occurrence of a spoken invocation phrase.
- the general automated assistant 305 While monitoring for the occurrence of the spoken invocation phrase, the general automated assistant 305 discards (e.g., after temporary storage in a buffer) any audio data frames that do not include the spoken invocation phrase. However, when the general automated assistant 305 detects an occurrence of a spoken invocation phrase in processed audio data frames, the general automated assistant 305 can determine whether the invocation input is directed to the general automated assistant 305 or directed to one or more other automated assistants 310 and 320 that can be invoked with the same invocation input.
- Automated assistants 305 and 310 can include multiple components for processing a query, once invoked, for example, a local speech-to-text (“STT”) engine (that converts captured audio to text), a local text-to-speech (“TTS”) engine (that converts text to speech), a local natural language processor (that determines semantic meaning of audio and/or text converted from audio), and/or other local components.
- STT speech-to-text
- TTS text-to-speech
- a local natural language processor that determines semantic meaning of audio and/or text converted from audio
- the client devices executing automated assistants may be relatively constrained in terms of computing resources (e.g., processor cycles, memory, battery, etc.)
- the local components may have limited functionality relative to any counterparts that are included in any cloud-based automated assistant components that are executing remotely in conjunction with the automated assistant(s).
- one or more of the automated assistants may be invoked by one or more gestures that indicate that the user has interest in interacting with the primary automated assistant.
- a user may demonstrate intention to invoke an automated assistant by interacting with a device, such as pressing a button or a touchscreen, perform a movement that is visible and may be captured by an image capture device, such as camera, and/or may look at a device such that the image capture device can recognize the user movement and/or positioning.
- the automated assistant may be invoked and begin capturing audio data that follows the gesture or action, as described above.
- one or more automated assistants 305 and 310 may share one or more modules, such as a natural language processor and/or the results of a natural language, TTS, and/or STT processor.
- a natural language processor such as a natural language processor and/or the results of a natural language, TTS, and/or STT processor.
- both first automated assistant 215 and second automated assistant 220 may share natural language processing so that, when client device 105 receives audio data, the audio data is processed once into text that may then be provided to both automated assistants 215 and 220.
- one or more components of client device 105 may process audio data into text and provide the textual representation of the audio data to third automated assistant 245, as further described herein.
- the audio data may not be processed into text and may instead be provided to one or more of the automated assistants as raw audio data.
- a user may utter a query after uttering an invocation phrase, indicating that the user has interest in receiving a response to the query from a primary automated assistant.
- the user may utter a query before or in the middle of an invocation phrase, such as "What is the weather, Assistant" and/or "What is the weather today, Assistant, and what is the weather tomorrow.”
- the general automated assistant 305 can process the invocation input (e.g., "Assistant") and other captured audio data (e.g., "What is the weather”) to determine which automated assistant to invoke based on features further described herein.
- FIG. 3B two instantiations of an automated assistant 325 is illustrated, each with a different profile for the same user.
- the user may configure the two instantiations of the automated assistant 325 such that both are responsive to the same user voice and are both capable of being invoked with the same invocation phrase.
- different results may be provided to the user.
- the user may have a work calendar and a home calendar, each of which operates independently and handles appointments and/or other calendar functionality for particular purposes.
- the user can be provided with information from the "work" profile 335 of the user.
- both automated assistants 325 have the same general invocation input that is capable of generally invoking the automated assistant 325 but does not specify between the instantiations.
- one or both automated assistants 325 may be invoked with the invocation input "OK Assistant" without specifying whether the invocation is intended for the instantiation with the home profile 330 or work profile 335.
- one or both instantiations of the automated assistant 325 can be configured to determine which profile to utilize upon detecting a general invocation phrase, in a manner similar to the general automated assistant 305 of FIG. 3A.
- components of a general automated assistant 305 are illustrated in which implementations described herein can be implemented. Although described herein for an environment whereby a general automated assistant processes invocation input and determines which automated assistant to invoke, components described with respect to general automated assistant 305 may be present in instantiations of automated assistant 325 and be utilized to determine whether to selectively invoke automated assistant 325 utilizing home profile 330 over utilizing automated assistant 325 with work profile 335.
- Invocation input analysis engine 410 can process invocation input to determine one or more invocation features that can be utilized to determine which automated assistant to invoke.
- invocation features can be determined based on general invocation input that is capable of invoking multiple automated assistants.
- general automated assistant 305 can process invocation input of the user uttering an invocation phrase of "OK Assistant" that is capable of invoking both first automated assistant 310 and second automated assistant 320.
- general automated assistant 305 can process invocation input of the performing a gesture that is captured by one or more cameras and is capable of invoking both first automated assistant 310 and second automated assistant 320.
- one or more invocation features can include one or more prosodic features of audio input that includes the invocation input.
- Prosodic features can include, for example, a tone of the speaker, speech rate, inflection, volume, and/or other features of human speech that can be indications of whether the user intends to invoke one automated assistant in lieu of invoking a second automated assistant.
- a user may utilize first automated assistant 310 for non-work purposes, and may, when speaking a general invocation phrase, speak in a more relaxed manner (e.g., slower, friendly, louder).
- a user may utilize second automated assistant 310 for work purposes, and may, when speaking a general invocation phrase, speak in a more formal manner (e.g., quieter, less inflection, more rapidly).
- invocation features can be determined that may be utilized by invocation determination engine 430 to determine which automated assistant to invoke.
- Additional input analysis engine 420 can determine one or more additional features that can be utilized to determine which automated assistant to invoke.
- additional features can be based on a location that is associated with the client device that is executing the general automated assistant 305. For example, a user may have interest in utilizing a particular automated assistant when at work and a different automated assistant when at home. In instances where both automated assistants are invocable utilizing the same invocation input, the location of the user can be an indication of whether to invoke a first automated assistant (e.g., a work automated assistant) in lieu of invoking a second automated assistant (e.g., a home automated assistant).
- a first automated assistant e.g., a work automated assistant
- second automated assistant e.g., a home automated assistant
- a location can be based on a geographic location of the client device that is executing the general automated assistant 305.
- additional input analysis engine 420 can identify a current location of the client device that is executing the automated assistant based on GPS and determine whether the user has previously indicated that the location is a particular classification of location.
- additional input analysis engine 420 can identify a current location of the client device that is executing the automated assistant based on WiFi, signal strength of a wireless communication signal, and/or other indication of a location of the device.
- one or more locations can be associated with a location type, such as "airport” and/or "restaurant.” In some implementations, one or more locations can be associated with an area within an identified geographic location, such as a room of a house and/or a particular office of an office building.
- a location can be based on a classification of the location where the client device that is executing the general automated assistant 305 is located. For example, a user may be located in a location that has been tagged as an "airport" location and additional input analysis engine 420 can determine that the location is a "public" location based on the type of location. Also, for example, additional input analysis engine 420 can determine that the user is at a location that the user has previously indicated is a "home" location, and additional input analysis engine 420 can determine that the location is classified as a "private" location.
- additional features can be determined based on additional audio data that precedes and/or follows the invocation input.
- additional features can include prosodic features of the user speaking a query that precedes and/or follows the invocation input.
- additional input analysis engine 420 can determine that, based on word usage, vocabulary selections, and/or other terms that are included in audio data whether the spoken utterance of the user is more closely associated with an intent of the user to invoke a first automated assistant in lieu of invoking a second automated assistant.
- the user may utilize a more formal vocabulary when uttering a query when intending to utilize a "private" automated assistant and additional input analysis engine 420 can process audio input from the user to determine whether the user's vocabulary selection is more "formal” or more "casual.”
- additional features can be determined based on background and/or other audio data other than the query and/or invocation that was uttered by the user. For example, if audio data that precedes and/or follows the invocation input includes background noise (e.g., other speakers), an additional feature can be determined that indicates that the user is likely in a public location. Also, for example, is audio data that precedes and/or follows the invocation input includes noise from a television and/or radio, an additional feature can be determined that indicates that the user is more likely in a private setting. [0046] In some implementations, additional features can include features that are determined based on visual input that is received proximate to detecting the invocation input.
- the client device that is executing general automated assistant 305 can include a camera that can capture visual input while (or proximate to) the user providing invocation input.
- Additional input analysis engine 420 can determine, based on the visual input, one or more visual input features that can indicate whether the user has interest in accessing one of the invocable automated assistants over another automated assistant.
- visual input features can include identifying whether additional users are in proximity of the user when the user provided the invocation input. For example, when the user provides the invocation input, additional input analysis engine 420 can determine, based on captured video, whether the user is alone or whether there are additional people in the vicinity of the user. In some implementations, the presence of others may be an indication that the user intends to access a "public" automated assistant in lieu of accessing a "private" automated assistant.
- the user may be provided with an indication of the automated assistant that was invoked when the invocation input was received.
- the indication can be a visual indication, such as an icon and/or message that is displayed on an interface of a client device of the user.
- the indication can be audible, such as a synthesized voice indicating the name of the invoked automated assistant and/or a sound (e.g., a beep of a particular frequency) that indicates one automated assistant has been invoked in lieu of invoking another automated assistant.
- the indication can be a variation in a synthesized speech that is provided to the user by the automated assistant. For example, a first automated assistant may have a synthesized male voice when invoked and a second automated assistant may have a synthesized female voice when invoked such that the user can determine which automated assistant was invoked when multiple automated assistants are capable of being invoked.
- Invocation determination engine 430 can determine, based on the processed invocation input and/or additional input features, whether to invoke a first automated assistant in lieu of invoking a second automated assistant. Invocation determination engine 430 can receive the invocation features and/or the additional input features from the invocation input analysis engine 410 and the additional input analysis engine 420, and determine, based on the features, whether to invoke a first automated assistant over invoking a second automated assistant. In some implementations, invocation determination engine 430 can utilize one or more machine learning models to determine which automated assistant to invoke. For example, invocation determination engine 430 can provide a machine learning model with one or more vectors representing invocation and additional features in an embedding space. The machine learning model can provide, as output, probabilities that a first automated assistant is to be invoked and that a second automated assistant is to be invoked.
- an automated assistant once an automated assistant has been invoked, additional audio data and/or other data can be provided to the invoked automated assistant.
- general automated assistant 305 can provide a spoken utterance of the user that precedes and/or follows the invocation input.
- the general automated assistant 305 can communicate with the invoked automated assistant via one or more communication protocols, such as API 440.
- general automated assistant 305 can communicate via a speaker that is received by the invoked automated assistant at a microphone (e.g., an ultrasonic signal that includes audio data).
- general automated assistant 305 can provide audio data that includes the user speaking an utterance. For example, once general automated assistant 305 has determined that a first automated assistant is to be invoked in lieu of invoking a second automated assistant, audio data of the user uttering a query can be directly provided to the invoked automated assistant.
- general automated assistant 305 can process audio data that includes a spoken utterance of the user prior to providing the audio data and/or additional data to the invoked automated assistant. For example, general automated assistant 305 can process at least a portion of the audio data utilizing STT, natural language processing, and/or automatic speech recognition.
- the general automated assistant 305 can provide, in addition to or in lieu of the audio data, the processed information to further reduce latency in the invoked automated assistant generating a response for the user.
- the user can provide feedback once an automated assistant has been invoked. For example, based on features described herein, general automated assistant 305 may determine that a first automated assistant is to be invoked in lieu of invoking a second automated assistant. The first automated assistant can then be invoked and provided with a spoken query of the user. Further, the user may be provided with an indication that the first automated assistant was invoked.
- the user may provide a spoken utterance of "No, I was talking to Assistant 2," "I was speaking to the other Assistant,” and/or other negative feedback indicating that the incorrect automated assistant was invoked.
- general automated assistant 305 can invoke the intended automated assistant (and/or the next most likely automated assistant to invoke, in instances wherein the user does not specify the intended automated assistant), and provide the intended automated assistant with the spoken query of the user.
- one or more of the invocation and/or additional features that were utilized to initially determine to invoke the first automated assistant can be provided, along with a supervised output generated based on the negative feedback, as training data for training a machine learning model that was utilized by invocation determination engine 430.
- a training example can be generated that includes the feature(s) as input and that includes, as a supervised output, an indication that Assistant 2 should be invoked based on those feature(s).
- the training example can be used in training the machine learning model.
- positive feedback from the user can additionally or alternatively be utilized to generate training data for training the machine learning model. For example, if Assistant 1 is invoked based on processing of feature(s) using the machine learning model, and the user continues to interact with Assistant 1 (implicit positive feedback) and/or has explicit positive feedback regarding invoking of Assistant 1, then a training example can be generated that includes the feature(s) and, as supervised output, an indication that Assistant 1 should be invoked.
- FIG. 5 depicts a flowchart illustrating an example method 500 of selectively determining which automated assistant to invoke.
- the operations of the method 500 are described with reference to a system that performs the operations.
- This system of method 500 includes one or more processors and/or other component(s) of a client device.
- operations of the method 500 are shown in a particular order, this is not meant to be limiting. One or more operations may be reordered, omitted, or added.
- invocation input is detected.
- the invocation input can be audio input from the user.
- the invocation input can be the user uttering a particular phrase that, when uttered, is capable of invoking both a first and a second automated assistant.
- the invocation input can be the user performing one or more actions that are captured by a camera of a device that is executing one or more of the automated assistants.
- the user may wave in the direction of a client device that is executing both a first and a second instantiation of an automated assistant, both of which are invocable utilizing the same gesture.
- invocation input is processed to determine one or more invocation input features that can be utilized to determine whether to invoke a first automated assistant in lieu of invoking a second automated assistant.
- Invocation features can include, for example, prosodic features of the user uttering an invocation phrase. For example, a user may speak with a particular tone, speed, and/or inflection when intending to invoke a first automated assistant and speak with a different tone, speed, and/or inflection when intending to invoke a second automated assistant.
- invocation input is a gesture that is visible via a camera of a client device that is executing one or more of the automated assistants
- visual input features can be identified that can indicate a particular automated assistant that the user has interest in invoking (e.g., the presence of other users).
- invocation input features can be determined by a component that shares one or more characteristics with invocation input analysis engine 410.
- step 515 additional input is processed to determine additional features that can be indications of whether the user has interest in invoking a first automated assistant in lieu of invoking a second automated assistant. Additional features can be determined by a component that shares one or more characteristics with additional input analysis engine 420.
- Additional features can include, for example, a location and/or classification of a location where the client device of the user is located, visual input indicating the presence of one or more other users when the invocation input was provided, vocabulary and/or terms utilized by the user when providing additional audio (e.g., a query) that precedes and/or follows the invocation input, and/or other features that can indicate an intent of the user to invoke a first automated assistant in lieu of invoking a second automated assistant that is capable of being invoked with the same general invocation input.
- additional audio e.g., a query
- step 520 the output from step 510 and 515 is processed to determine whether to invoke the first automated assistant or the second automated based on the invocation and additional features.
- the determination is performed by a component that shares one or more characteristics with invocation determination engine 430.
- invocation determination engine 430 can utilize one or more machine learning models that receive, as input, invocation and additional feature vectors, and provide, as output, probabilities of the user intending to invoke a first and second automated assistant.
- invocation determination engine 430 can invoke a first automated assistant or a second automated assistant, in lieu of invoking the other automated assistant.
- the invoked automated assistant can be provided with a spoken utterance of the user that precedes and/or follows the invocation input.
- the automated assistant can generate a response to the query.
- the second automated assistant can be invoked (e.g., in the case that the user indicates that the incorrect automated assistant was invoked). Feedback from the user can be utilized to further train a machine learning model that can be utilized to determine whether to invoke the first and/or second automated assistant.
- FIG. 6 is a block diagram of an example computing device 610 that may optionally be utilized to perform one or more aspects of techniques described herein.
- Computing device 610 typically includes at least one processor 614 which communicates with a number of peripheral devices via bus subsystem 612. These peripheral devices may include a storage subsystem 624, including, for example, a memory subsystem 625 and a file storage subsystem 626, user interface output devices 620, user interface input devices 622, and a network interface subsystem 616. The input and output devices allow user interaction with computing device 610.
- Network interface subsystem 616 provides an interface to outside networks and is coupled to corresponding interface devices in other computing devices.
- User interface input devices 622 may include a keyboard, pointing devices such as a mouse, trackball, touchpad, or graphics tablet, a scanner, a touchscreen incorporated into the display, audio input devices such as voice recognition systems, microphones, and/or other types of input devices.
- pointing devices such as a mouse, trackball, touchpad, or graphics tablet
- audio input devices such as voice recognition systems, microphones, and/or other types of input devices.
- use of the term "input device” is intended to include all possible types of devices and ways to input information into computing device 610 or onto a communication network.
- User interface output devices 620 may include a display subsystem, a printer, a fax machine, or non-visual displays such as audio output devices.
- the display subsystem may include a cathode ray tube (CRT), a flat-panel device such as a liquid crystal display (LCD), a projection device, or some other mechanism for creating a visible image.
- the display subsystem may also provide non-visual display such as via audio output devices.
- output device is intended to include all possible types of devices and ways to output information from computing device 610 to the user or to another machine or computing device.
- Storage subsystem 624 stores programming and data constructs that provide the functionality of some or all of the modules described herein.
- the storage subsystem 624 may include the logic to perform selected aspects of the method of FIG. 5, and/or to implement various components depicted in FIG. 2, FIG. 3, and FIG. 4.
- Memory 625 used in the storage subsystem 624 can include a number of memories including a main random access memory (RAM) 630 for storage of instructions and data during program execution and a read only memory (ROM) 632 in which fixed instructions are stored.
- a file storage subsystem 626 can provide persistent storage for program and data files, and may include a hard disk drive, a floppy disk drive along with associated removable media, a CD-ROM drive, an optical drive, or removable media cartridges.
- the modules implementing the functionality of certain implementations may be stored by file storage subsystem 626 in the storage subsystem 624, or in other machines accessible by the processor(s) 614.
- Bus subsystem 612 provides a mechanism for letting the various components and subsystems of computing device 610 communicate with each other as intended. Although bus subsystem 612 is shown schematically as a single bus, alternative implementations of the bus subsystem may use multiple busses.
- Computing device 610 can be of varying types including a workstation, server, computing cluster, blade server, server farm, or any other data processing system or computing device. Due to the ever-changing nature of computers and networks, the description of computing device 610 depicted in FIG. 6 is intended only as a specific example for purposes of illustrating some implementations. Many other configurations of computing device 610 are possible having more or fewer components than the computing device depicted in FIG. 6.
- a method implemented by one or more processors includes detecting, at a client device, an invocation input that at least selectively invokes a first automated assistant and a second automated assistant, determining whether the invocation input is directed to the first automated assistant or is directed to the second automated assistant, wherein the determining is based on processing at least one of: one or more invocation features of the invocation input, wherein the invocation features are in addition to any features that are based on speech recognition of voice input received in association with the invocation input, and one or more additional features detected by the client device, the one or more additional features being in addition to the invocation features; and in response to determining that the invocation input is directed to the first automated assistant: invoking the first automated assistant in lieu of invoking the second automated assistant.
- the one or more invocation features includes one or more prosodic features determined from audio data that includes the invocation input.
- determining whether the invocation input is directed to the first automated assistant or is directed to the second automated assistant includes identifying, independent of speech recognition, one or more terms included in audio data that includes the invocation input, and determining that the one or more terms are indicative of an intent of the user to invoke the first automated assistant.
- the one or more additional features includes one or more prosodic features determined from audio data detected by one or more microphones of the client device that captures an utterance that precedes or follows the invocation input.
- the one or more additional features includes one or more applications executing at the client device within a threshold time period from when the invocation input is detected.
- the one or more additional features include a location of the client device when the invocation input is detected.
- the one or more additional features includes an activity that the user is performing when the invocation input is detected.
- the one or more additional features include one or more visual input features that are based on vision data captured by one or more cameras of the client device when the invocation input is detected.
- processing the invocation input includes processing, by the client device, one or more of the invocation features and the additional features using a machine learning model that is stored locally at the client device.
- the method further includes receiving feedback from the user in response to invoking the first automated assistant, wherein the feedback indicates whether the invocation input was intended to invoke the first automated assistant, and training the machine learning model based on the feedback.
- the method further includes rendering, at the client device and in response to determining that the invocation input is directed to the first automated assistant, an indication that the first automated assistant has been invoked.
- the method further includes receiving user input in response to invoking the first automated assistant, determining, based on processing the user input, that the user input indicates that the invocation is not directed to the first automated assistant, and in response to determining that the user input indicates that the invocation is not directed to the first automated assistant, invoking the second automated assistant.
- the indication comprises a visual indication rendered by a display of the client device. In some of those implementations, the indication comprises an audible indication rendered by a speaker of the client device.
- the method further includes providing, to the first automated assistant and in response to invoking the first automated assistant, audio data that precedes or follows the invocation input.
- the audio data is provided without providing additional audio-based data that is based on additional processing of the audio data.
- the method further includes processing the audio data to identify one or more features of the audio data, and providing, to the first automated assistant and in response to invoking the first automated assistant, the one or more features with the audio data.
- users may collect or use personal information about users (e.g., user data extracted from other electronic communications, information about a user's social network, a user's location, a user's time, a user's biometric information, and a user's activities and demographic information, relationships between users, etc.), users are provided with one or more opportunities to control whether information is collected, whether the personal information is stored, whether the personal information is used, and how the information is collected about the user, stored and used. That is, the systems and methods discussed herein collect, store and/or use user personal information only upon receiving explicit authorization from the relevant users to do so.
- personal information about users e.g., user data extracted from other electronic communications, information about a user's social network, a user's location, a user's time, a user's biometric information, and a user's activities and demographic information, relationships between users, etc.
- users are provided with one or more opportunities to control whether information is collected, whether the personal information is stored, whether the personal information is used, and how the
- a user is provided with control over whether programs or features collect user information about that particular user or other users relevant to the program or feature.
- Each user for which personal information is to be collected is presented with one or more options to allow control over the information collection relevant to that user, to provide permission or authorization as to whether the information is collected and as to which portions of the information are to be collected.
- users can be provided with one or more such control options over a communication network.
- certain data may be treated in one or more ways before it is stored or used so that personally identifiable information is removed.
- a user's identity may be treated so that no personally identifiable information can be determined.
- a user's geographic location may be generalized to a larger region so that the user's particular location cannot be determined.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Databases & Information Systems (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- User Interface Of Digital Computer (AREA)
Abstract
Description
Claims
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202280080386.4A CN118369641A (en) | 2021-12-13 | 2022-09-07 | Selecting between multiple automated assistants based on call attributes |
EP22783119.5A EP4217845A1 (en) | 2021-12-13 | 2022-09-07 | Selecting between multiple automated assistants based on invocation properties |
KR1020247018248A KR20240094013A (en) | 2021-12-13 | 2022-09-07 | Selection between multiple automated assistants based on call properties |
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US202163288795P | 2021-12-13 | 2021-12-13 | |
US63/288,795 | 2021-12-13 | ||
US17/550,060 US20230186909A1 (en) | 2021-12-13 | 2021-12-14 | Selecting between multiple automated assistants based on invocation properties |
US17/550,060 | 2021-12-14 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023113877A1 true WO2023113877A1 (en) | 2023-06-22 |
Family
ID=83508992
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2022/042726 WO2023113877A1 (en) | 2021-12-13 | 2022-09-07 | Selecting between multiple automated assistants based on invocation properties |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2023113877A1 (en) |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170060839A1 (en) * | 2015-09-01 | 2017-03-02 | Casio Computer Co., Ltd. | Dialogue control device, dialogue control method and non-transitory computer-readable information recording medium |
US20170300831A1 (en) * | 2016-04-18 | 2017-10-19 | Google Inc. | Automated assistant invocation of appropriate agent |
WO2018164781A1 (en) * | 2017-03-06 | 2018-09-13 | Google Llc | Shared experiences |
US20180278750A1 (en) * | 2017-03-24 | 2018-09-27 | Microsoft Technology Licensing, Llc | Insight based routing for help desk service |
US20180286414A1 (en) * | 2017-03-31 | 2018-10-04 | Binuraj K. Ravindran | Systems and methods for energy efficient and low power distributed automatic speech recognition on wearable devices |
US20180324115A1 (en) * | 2017-05-08 | 2018-11-08 | Google Inc. | Initializing a conversation with an automated agent via selectable graphical element |
US20180336045A1 (en) * | 2017-05-17 | 2018-11-22 | Google Inc. | Determining agents for performing actions based at least in part on image data |
US20210289607A1 (en) * | 2016-08-05 | 2021-09-16 | Sonos, Inc. | Playback Device Supporting Concurrent Voice Assistants |
-
2022
- 2022-09-07 WO PCT/US2022/042726 patent/WO2023113877A1/en active Application Filing
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170060839A1 (en) * | 2015-09-01 | 2017-03-02 | Casio Computer Co., Ltd. | Dialogue control device, dialogue control method and non-transitory computer-readable information recording medium |
US20170300831A1 (en) * | 2016-04-18 | 2017-10-19 | Google Inc. | Automated assistant invocation of appropriate agent |
US20210289607A1 (en) * | 2016-08-05 | 2021-09-16 | Sonos, Inc. | Playback Device Supporting Concurrent Voice Assistants |
WO2018164781A1 (en) * | 2017-03-06 | 2018-09-13 | Google Llc | Shared experiences |
US20180278750A1 (en) * | 2017-03-24 | 2018-09-27 | Microsoft Technology Licensing, Llc | Insight based routing for help desk service |
US20180286414A1 (en) * | 2017-03-31 | 2018-10-04 | Binuraj K. Ravindran | Systems and methods for energy efficient and low power distributed automatic speech recognition on wearable devices |
US20180324115A1 (en) * | 2017-05-08 | 2018-11-08 | Google Inc. | Initializing a conversation with an automated agent via selectable graphical element |
US20180336045A1 (en) * | 2017-05-17 | 2018-11-22 | Google Inc. | Determining agents for performing actions based at least in part on image data |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11749284B2 (en) | Dynamically adapting on-device models, of grouped assistant devices, for cooperative processing of assistant requests | |
KR20210112403A (en) | Voice query QoS based on client-computed content metadata | |
US20240347060A1 (en) | Contextual suppression of assistant command(s) | |
US20230025709A1 (en) | Transferring dialog data from an initially invoked automated assistant to a subsequently invoked automated assistant | |
US12080293B2 (en) | Combining responses from multiple automated assistants | |
US20220310089A1 (en) | Selectively invoking an automated assistant based on detected environmental conditions without necessitating voice-based invocation of the automated assistant | |
US20230186909A1 (en) | Selecting between multiple automated assistants based on invocation properties | |
US20230061929A1 (en) | Dynamically configuring a warm word button with assistant commands | |
WO2023113877A1 (en) | Selecting between multiple automated assistants based on invocation properties | |
JP2024538771A (en) | Digital signal processor-based continuous conversation | |
US11972764B2 (en) | Providing related queries to a secondary automated assistant based on past interactions | |
US20230169963A1 (en) | Selectively masking query content to provide to a secondary digital assistant | |
WO2023086229A1 (en) | Providing related queries to a secondary automated assistant based on past interactions | |
US12106755B2 (en) | Warm word arbitration between automated assistant devices | |
US20240203411A1 (en) | Arbitration between automated assistant devices based on interaction cues | |
US20240312455A1 (en) | Transferring actions from a shared device to a personal device associated with an account of a user | |
WO2023003585A1 (en) | Transferring dialog data from an initially invoked automated assistant to a subsequently invoked automated assistant |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
ENP | Entry into the national phase |
Ref document number: 2022783119 Country of ref document: EP Effective date: 20221207 |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22783119 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 202280080386.4 Country of ref document: CN |
|
ENP | Entry into the national phase |
Ref document number: 2024535282 Country of ref document: JP Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |