US20190079724A1 - Intercom-style communication using multiple computing devices - Google Patents

Intercom-style communication using multiple computing devices Download PDF

Info

Publication number
US20190079724A1
US20190079724A1 US16/114,494 US201816114494A US2019079724A1 US 20190079724 A1 US20190079724 A1 US 20190079724A1 US 201816114494 A US201816114494 A US 201816114494A US 2019079724 A1 US2019079724 A1 US 2019079724A1
Authority
US
United States
Prior art keywords
user
message
users
voice input
input
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US16/114,494
Inventor
Sandro Feuz
Sebastian Millius
Jan Althaus
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Google LLC
Original Assignee
Google LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Google LLC filed Critical Google LLC
Priority to US16/114,494 priority Critical patent/US20190079724A1/en
Assigned to GOOGLE INC. reassignment GOOGLE INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: ALTHAUS, Jan, MILLIUS, Sebastian, FEUZ, Sandro
Assigned to GOOGLE LLC reassignment GOOGLE LLC CHANGE OF NAME (SEE DOCUMENT FOR DETAILS). Assignors: GOOGLE INC.
Publication of US20190079724A1 publication Critical patent/US20190079724A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • G06F3/167Audio in a user interface, e.g. using voice commands for navigating, audio feedback
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/14Session management
    • H04L67/141Setup of application sessions
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/52Network services specially adapted for the location of the user terminal
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W4/00Services specially adapted for wireless communication networks; Facilities therefor
    • H04W4/12Messaging; Mailboxes; Announcements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech
    • H04L67/18

Definitions

  • Humans may engage in human-to-computer dialogs with interactive software applications referred to herein as “automated assistants” (also referred to as “chat bots,” “interactive personal assistants,” “intelligent personal assistants,” “personal voice assistants,” “conversational agents,” etc.).
  • automated assistants also referred to as “chat bots,” “interactive personal assistants,” “intelligent personal assistants,” “personal voice assistants,” “conversational agents,” etc.
  • humans (which when they interact with automated assistants may be referred to as “users”) may provide commands, queries, and/or requests using spoken natural language input (i.e. utterances) which may in some cases be converted into text and then processed, and/or by providing textual (e.g., typed) natural language input.
  • spoken natural language input i.e. utterances
  • automated assistants may include automated assistant “clients” that are installed locally on client devices and that are engaged directly by users, as well as cloud-based counterpart(s) that leverage the virtually limitless resources of the cloud to help automated assistant clients respond to users' queries.
  • the automated assistant client may provide, to the cloud-based counterpart(s), an audio recording of the user's query (or a text conversion thereof) and data indicative of the user's identity (e.g., credentials).
  • the cloud-based counterpart may perform various processing on the query to return various results to the automated assistant client, which may then provide corresponding output to the user.
  • automated assistant when described herein as “serving” a particular user, may refer to the automated assistant client installed on the particular user's client device and any cloud-based counterpart that interacts with the automated assistant client to respond to the user's queries.
  • Many users may engage automated assistants using multiple devices.
  • some users may possess a coordinated “ecosystem” of computing devices that includes one or more smart phones, one or more tablet computers, one or more vehicle computing systems, one or wearable computing devices, one or more smart televisions, and/or one or more standalone interactive speakers, among other more traditional computing devices.
  • a user may engage in human-to-computer dialog with an automated assistant using any of these devices (assuming an automated assistant client is installed). In some cases these devices may be scattered around the user's home or workplace.
  • mobile computing devices such as smart phones, tablets, smart watches, etc., may be on the user's person and/or wherever the user last placed them (e.g., at a charging station).
  • Other computing devices, such as traditional desktop computers, smart televisions, and standalone interactive speakers may be more stationary but nonetheless may be located at various places (e.g., rooms) within the user's home or workplace.
  • Techniques exist to enable multiple users (e.g., a family, co-workers, co-inhabitants, etc.) to leverage the distributed nature of a plurality of computing devices to facilitate intercom-style spoken communication between the multiple users.
  • these techniques are limited to users issuing explicit commands to convey messages to explicitly-defined computing devices. For example, a first user who wishes to convey a message to a second user at another location out of earshot (e.g., in another room) must first determine where the second user is located. Only then can the first user explicitly invoke an intercom communication channel to a computing device at or near the second user's location, so that the first user can convey a message to the second user at the second user's location.
  • the first user may be forced to simply cause the message to be broadcast at all computing devices that are available for intercom-style communication. Moreover, if the first user is unaware that the second user is not within earshot (e.g., the first user is cooking and didn't notice the second user leaving the kitchen), the first user may not realize that intercom-style communication is necessary, and may speak the message to an empty room.
  • Techniques are described herein for improved intercom-style communication using a plurality of computing devices distributed about an environment such as a house, an apartment, a place of business, etc. For example, techniques are described herein for enabling determination of location(s) of multiple users within the environment, so that (i) it can be determined automatically whether an intended recipient of a spoken message is within earshot of the speaker, and (ii) a suitable computing device near the intended recipient can be identified and used to output the message so that the intended recipient receives it.
  • techniques are described herein for automatically determining whether a user utterance constitutes (a) a command to invoke an automated assistant for normal use; (b) an attempt to convey a spoken message to another user that may potentially require the intercom-style communication described herein; and/or (c) other background noise/conversation that requires no action. Additionally, techniques are described herein for allowing a recipient of an intercom-style message received using disclosed techniques to issue a request (e.g., a search query or other commands to an automated assistant such as ordering pizza, playing a song, etc.) that is processed (e.g., using natural language processing) based at least in part on the initial message conveyed by the speaker.
  • a request e.g., a search query or other commands to an automated assistant such as ordering pizza, playing a song, etc.
  • processing e.g., using natural language processing
  • users' locations may be determined within an environment or area by computing devices configured with selected aspects of the present disclosure using various techniques.
  • one or more computing devices may be equipped with various types of presence sensors, such as passive infrared (“PIR”) sensors, cameras, microphones, ultrasonic sensors, and so forth, which can determine whether a user is nearby.
  • PIR passive infrared
  • These computing devices can come in various forms, such as smart phones, standalone interactive speakers, smart televisions, other smart appliances (e.g., smart thermostats, smart refrigerators, etc.), networked cameras, and so forth.
  • signals emitted by mobile computing devices may be detected by other computing devices and used to determine the users' locations (e.g., using time-of-flight, triangulation, etc.).
  • the determination of a user's location within an environment for utilization in various techniques described herein can be contingent on explicit user-provided authorization for such determination.
  • users' locations may be determined “on demand” in response to determining that a user utterance constitutes an attempt to convey a spoken message to another user that may require intercom-style communication.
  • the users' locations may be determined periodically and/or at other interval and most recently determined locations utilized in determining whether an intended recipient of a spoken message is within earshot of a speaker of the spoken message and/or in identifying a suitable computing device near an intended recipient of the spoken message.
  • a variety of standalone interactive speakers and/or smart televisions may be distributed at various locations in a home.
  • Each of these devices may include one or more sensors (e.g., microphone, camera, PIR sensor, etc.) capable of detecting a nearby human presence.
  • these devices may simply detect whether a person is present.
  • these devices may be able to not only detect presence, but distinguish the detected person, e.g., from other known members of a household. Presence signals generated by these standalone interactive speakers and/or smart televisions may be collected and used to determine/track where people are located at a particular point in time.
  • These detected locations may then be used for various purposes in accordance with techniques described herein, such as determining whether an utterance provided by a speaker is likely to be heard by the intended recipient (e.g., whether the speaker and intended recipients are in different rooms or the same room), and/or to select which of the multiple speakers and/or televisions should be used to output the utterance to the intended recipient.
  • techniques are described herein for automatically determining whether a user utterance constitutes (a) a command to invoke an automated assistant for normal use; (b) an attempt to convey a spoken message to another user that may potentially require the intercom-style communication described herein; and/or (c) other background noise/conversation that requires no action.
  • a machine learning classifier e.g., neural network
  • speech-to-text may not be performed automatically on every utterance.
  • the machine learning classifier may be trained to recognize phonemes in the audio recording of the voice input, and in particular to classify the collective phonemes with one of the aforementioned labels.
  • conventional automated assistants are typically invoked using one or more invocation phrases.
  • a simple invocation machine learning model e.g., classifier
  • is trained to distinguish these invocation phrases from anything else to determine when a user invokes the automated assistant e.g., to recognize phonemes associated with “Hey, Assistant”.
  • the same invocation machine learning model or a different machine learning model may be (further) trained to classify utterances as being intended to convey a message to another user, which may or may not require use of intercom-style communications described herein.
  • such a machine learning model may be used, e.g., in parallel with an invocation machine learning model or after the invocation machine learning model determines that the user is not invoking the automated assistant, to determine whether the user may benefit from using intercom-style communication to cause a remote computing device to convey a message to another user.
  • a machine learning model may be trained, or “customized,” so that it is possible to recognize names spoken by a user and to attach those names with other individuals. For example, an automated assistant may detect a first utterance such as “Jan, can you pass me the salt?” The automated assistant may detect a second utterance, presumably from Jan, such as “Sure, here you go.” From these utterances and the associated phonemes, the automated assistant may learn that when a user makes a request to Jan, it should locate the individual with Jan's voice. Suppose that later, Jan is talking on the phone in a separate room. When the user says something like “Jan, where are my shoes,” the automated assistant may determine from this utterance (particularly, “Jan, . . .
  • the automated assistant may also determine that Jan is probably out of earshot, and therefore the message should be conveyed to Jan as an intercom message. By detecting Jan's voice on a nearby client device, the automated assistant may locate Jan and select the nearby client device to output the speaker's message.
  • a user may invoke an automated assistant using traditional invocation phrases and then explicitly command the automated assistant to cause some other computing device to output a message to be conveyed to a recipient.
  • the other computing device may be automatically selected based on the recipient's detected location as described above, or explicitly designated by the speaking user.
  • techniques are described herein for allowing a recipient of an intercom-style message received using disclosed techniques to leverage context provided in the received intercom message to perform other actions, such as issuing a search query or a command to an automated assistant.
  • the recipient may issue a search query, e.g., at the computing device at which she received the conveyed intercom message or another computing device.
  • Search results may then be obtained, e.g., by an automated assistant serving the second user, that are responsive to the search query.
  • the search results may be biased or ranked based at least in part on content of the originally conveyed intercom message.
  • the recipient's search query may be disambiguated based at least in part on content of the originally conveyed intercom message.
  • the original speaker's utterance may be transcribed (STT) only if it is determined that the recipient makes a downstream request. If the recipient simply listens to the message and does nothing further, no SST may be performed.
  • the original speaker's utterance may always be processed using SST (e.g., on determination that the utterance is to be conveyed through intercom-style communication), but the resulting transcription may be stored only locally and/or for a limited amount of time (e.g., long enough to give the recipient user ample time to make some downstream request).
  • one or more computing devices may wait till an intended user is able to perceive a message (e.g., within earshot) till they convey a message using techniques described herein. For example, suppose a first user conveys a message to an intended recipient but the intended recipient has stepped outside momentarily. In some implementations, the first computing device to detect the recipient upon their return may output the original message.
  • a method performed by one or more processors includes: receiving, at a microphone of a first computing device of a plurality of computing devices, from a first user, voice input; analyzing the voice input; determining, based on the analyzing, that the first user intends to convey a message to a second user; determining a location of the second user relative to the plurality of computing devices; selecting, from the plurality of computing devices, based on the location of the second user, a second computing device that is capable of providing audio or visual output that is perceptible to the second user; and causing the second computing device to exclusively provide audio or visual output that conveys the message to the second user (e.g., only the second computing device provides the output, to the exclusion of other computing devices).
  • the analyzing may include applying an audio recording of the voice input as input across a trained machine learning model to generate output, wherein the output indicates that the first user intends to convey the message to the second user.
  • the machine learning model may be trained using a corpus of labelled utterances, and wherein labels applied to the utterances include a first label indicative of a command to convey a message to another user and a second label indicative of a command to engage in a human-to-computer dialog with an automated assistant.
  • labels applied to the utterances may further include a third label indicative of background conversation.
  • the selecting may be performed in response to a determination, based on the location of the second user, that the second user is not within earshot of the first user.
  • the location of the second user may be determined based at least in part on one or more signals generated by a mobile computing device operated by the second user.
  • the technical equipment may include the multiple computing devices referred to above, as well as a network over which the messages may be conveyed between the devices.
  • the efficiency in the manner in which, and the times at which, messages are conveyed may result in at least more efficient use of the network between the computing devices and also more efficient use of the computational resources, within the computing devices, which are employed to convey and receive the messages.
  • the location of the second user may be determined based at least in part on one or more signals generated by one or more of the plurality of computing devices other than the first computing device.
  • the one or more signals may include a signal indicative of the second user being detected by one or more of the plurality of computing devices other than the first computing device using passive infrared or ultrasound.
  • the one or more signals may include a signal indicative of the second user being detected by one or more of the plurality of computing devices other than the first computing device using a camera or a microphone.
  • the analyzing may include determining that the voice input includes an explicit command to convey the message to the second user as an intercom message via one or more of the plurality of computing devices. In various implementations, the analyzing may include performing speech-to-text processing on the voice input to generate textual input, and performing natural language processing on the textual input to determine that the user intends to convey the message to the second user.
  • the method may further include: identifying a search query issued by the second user after the audio or visual output is provided by the second computing device; obtaining search results that are responsive to the search query, wherein the obtaining is based at least in part on the voice input from the first user; and causing one or more of the plurality of computing devices to provide output indicative of at least some of the search results.
  • implementations include one or more processors of one or more computing devices, where the one or more processors are operable to execute instructions stored in associated memory, and where the instructions are configured to cause performance of any of the aforementioned methods. Some implementations also include one or more non-transitory computer readable storage media storing computer instructions executable by one or more processors to perform any of the aforementioned methods.
  • FIG. 1A is a block diagram of an example environment in which implementations disclosed herein may be implemented.
  • FIG. 1B schematically depicts one example of how a trained classifier may be applied to generate output based on user utterances and/or locations, in accordance with various implementations.
  • FIGS. 2, 3, and 4 depict example dialogs between various users and automated assistants, including intercom-style communications, in accordance with various implementations.
  • FIG. 5 depicts a flowchart illustrating an example method according to implementations disclosed herein.
  • FIG. 6 illustrates an example architecture of a computing device.
  • the example environment includes a plurality of client computing devices 106 1-N .
  • Each client device 106 may execute a respective instance of an automated assistant client 118 .
  • One or more cloud-based automated assistant components 119 such as a natural language processor 122 , may be implemented on one or more computing systems (collectively referred to as a “cloud” computing system) that are communicatively coupled to client devices 106 1-N via one or more local and/or wide area networks (e.g., the Internet) indicated generally at 110 .
  • the plurality of client devices 106 1-N may be communicatively coupled with each other via one or more local area networks (“LANs,” including Wi-Fi LANs, mesh networks, etc.).
  • LANs local area networks
  • plurality of client computing devices 106 1-N may be associated with each other in various ways in order to facilitate performance of techniques described herein.
  • plurality of client computing devices 106 1-N may be associated with each other by virtue of being communicatively coupled via one or more LANs. This may be the case, for instance, where plurality of client computing devices 106 1-N are deployed across a particular area or environment, such as a home, a building, a campus, and so forth.
  • plurality of client computing devices 106 1-N may be associated with each other by virtue of them being members of a coordinated ecosystem of client devices 106 that are operated by one or more users (e.g., an individual, a family, employees of an organization, other predefined groups, etc.).
  • users e.g., an individual, a family, employees of an organization, other predefined groups, etc.
  • an instance of an automated assistant client 118 by way of its interactions with one or more cloud-based automated assistant components 119 , may form what appears to be, from the user's perspective, a logical instance of an automated assistant 120 with which the user may engage in a human-to-computer dialog. Two instances of such an automated assistant 120 are depicted in FIG. 1A .
  • a first automated assistant 120 A encompassed by a dashed line serves a first user (not depicted) operating first client device 106 1 and includes automated assistant client 118 1 and one or more cloud-based automated assistant components 119 .
  • a second automated assistant 120 B encompassed by a dash-dash-dot line serves a second user (not depicted) operating another client device 106 N and includes automated assistant client 118 N and one or more cloud-based automated assistant components 119 . It thus should be understood that each user that engages with an automated assistant client 118 executing on a client device 106 may, in effect, engage with his or her own logical instance of an automated assistant 120 .
  • automated assistant as used herein as “serving” a particular user will refer to the combination of an automated assistant client 118 executing on a client device 106 operated by the user and one or more cloud-based automated assistant components 119 (which may be shared amongst multiple automated assistant clients 118 ). It should also be understood that in some implementations, automated assistant 120 may respond to a request from any user regardless of whether the user is actually “served” by that particular instance of automated assistant 120 .
  • the client devices 106 1-N may include, for example, one or more of: a desktop computing device, a laptop computing device, a tablet computing device, a mobile phone computing device, a computing device of a vehicle of the user (e.g., an in-vehicle communications system, an in-vehicle entertainment system, an in-vehicle navigation system), a standalone interactive speaker, a smart appliance such as a smart television, and/or a wearable apparatus of the user that includes a computing device (e.g., a watch of the user having a computing device, glasses of the user having a computing device, a virtual or augmented reality computing device). Additional and/or alternative client computing devices may be provided.
  • a desktop computing device e.g., a laptop computing device, a tablet computing device, a mobile phone computing device, a computing device of a vehicle of the user (e.g., an in-vehicle communications system, an in-vehicle entertainment system, an in-vehicle navigation system),
  • one or more of the client computing devices 106 1-N may include one or more presence sensors 105 1-N that are configured to provide signals indicative of detected presence, particularly human presence.
  • Presence sensors 105 1-N may come in various forms.
  • Some client devices 106 may be equipped with one or more digital cameras that are configured to capture and provide signal(s) indicative of movement detected in their fields of view.
  • some client devices 106 may be equipped with other types of light-based presence sensors 105 , such as passive infrared (“PIR”) sensors that measure infrared (“IR”) light radiating from objects within their fields of view.
  • PIR passive infrared
  • some client devices 106 may be equipped with presence sensors 105 that detect acoustic (or pressure) waves, such as one or more microphones.
  • presence sensors 105 may be configured to detect other phenomena associated with human presence.
  • a client device 106 may be equipped with a presence sensor 105 that detects various types of waves (e.g., radio, ultrasonic, electromagnetic, etc.) emitted by, for instance, a mobile client device 106 carried/operated by a particular user.
  • some client devices 106 may be configured to emit waves that are imperceptible to humans, such as ultrasonic waves or infrared waves, that may be detected by other client devices 106 (e.g., via ultrasonic/infrared receivers such as ultrasonic-capable microphones).
  • various client devices 106 may emit other types of human-imperceptible waves, such as radio waves (e.g., Wi-Fi, Bluetooth, cellular etc.) that may be detected by one or more other client devices 106 and used to determine an operating user's particular location.
  • radio waves e.g., Wi-Fi, Bluetooth, cellular etc.
  • Wi-Fi triangulation may be used to detect a person's location, e.g., based on Wi-Fi signals to/from a client device 106 .
  • other wireless signal characteristics such as time-of-flight, signal strength, etc., may be used by various client devices 106 , alone or collectively, to determine a particular person's location based on signals emitted by a client device 106 they carry.
  • one or more client devices 106 may perform voice recognition to recognize an individual from their voice.
  • some automated assistants 120 may be configured to match a voice to a user's profile, e.g., for purposes of providing/restricting access to various resources.
  • movement of the speaker may then be tracked, e.g., by one or more other presence sensors that may be incorporated, for instance, in lights, light switches, smart thermostats, security cameras, etc.
  • a location of the individual may be predicted, and this location may be assumed to be the individual's location when other individual (i.e., a speaker) provides an utterance with a message for the first individual.
  • an individual may simply be assumed to be in the last location at which he or she engaged with automated assistant 120 , especially if no much time has passed since the last engagement.
  • Each of the client computing devices 106 1-N may operate a variety of different applications, such as a corresponding one of a plurality of message exchange clients 107 1-N .
  • Message exchange clients 107 1-N may come in various forms and the forms may vary across the client computing devices 106 1-N and/or multiple forms may be operated on a single one of the client computing devices 106 1-N .
  • one or more of the message exchange clients 107 1-N may come in the form of a short messaging service (“SMS”) and/or multimedia messaging service (“MMS”) client, an online chat client (e.g., instant messenger, Internet relay chat, or “IRC,” etc.), a messaging application associated with a social network, a personal assistant messaging service dedicated to conversations with automated assistant 120 , and so forth.
  • SMS short messaging service
  • MMS multimedia messaging service
  • IRC Internet relay chat
  • one or more of the message exchange clients 107 1-N may be implemented via a webpage or other resources rendered by a web browser (not depicted) or other application of client computing device
  • automated assistant 120 engages in human-to-computer dialog sessions with one or more users via user interface input and output devices of one or more client devices 106 1-N .
  • automated assistant 120 may engage in a human-to-computer dialog session with a user in response to user interface input provided by the user via one or more user interface input devices of one of the client devices 106 1-N .
  • the user interface input is explicitly directed to automated assistant 120 .
  • one of the message exchange clients 107 1-N may be a personal assistant messaging service dedicated to conversations with automated assistant 120 and user interface input provided via that personal assistant messaging service may be automatically provided to automated assistant 120 .
  • the user interface input may be explicitly directed to automated assistant 120 in one or more of the message exchange clients 107 1-N based on particular user interface input that indicates automated assistant 120 is to be invoked.
  • the particular user interface input may be one or more typed characters (e.g., @AutomatedAssistant), user interaction with a hardware button and/or virtual button (e.g., a tap, a long tap), an oral command (e.g., “Hey Automated Assistant”), and/or other particular user interface input.
  • automated assistant 120 may engage in a dialog session in response to user interface input, even when that user interface input is not explicitly directed to automated assistant 120 .
  • automated assistant 120 may examine the contents of user interface input and engage in a dialog session in response to certain terms being present in the user interface input and/or based on other cues.
  • automated assistant 120 may engage interactive voice response (“IVR”), such that the user can utter commands, searches, etc., and the automated assistant may utilize natural language processing and/or one or more grammars to convert the utterances into text, and respond to the text accordingly.
  • IVR interactive voice response
  • the automated assistant 120 can additionally or alternatively respond to utterances without converting the utterances into text.
  • the automated assistant 120 can convert voice input into an embedding, into entity representation(s) (that indicate entity/entities present in the voice input), and/or other “non-textual” representation and operate on such non-textual representation. Accordingly, implementations described herein as operating based on text converted from voice input may additionally and/or alternatively operate on the voice input directly and/or other non-textual representations of the voice input.
  • Each of the client computing devices 106 1-N and computing device(s) operating cloud-based automated assistant components 119 may include one or more memories for storage of data and software applications, one or more processors for accessing data and executing applications, and other components that facilitate communication over a network.
  • the operations performed by one or more of the client computing devices 106 1-N and/or by automated assistant 120 may be distributed across multiple computer systems.
  • Automated assistant 120 may be implemented as, for example, computer programs running on one or more computers in one or more locations that are coupled to each other through a network.
  • each of the client computing devices 106 1-N may operate an automated assistant client 118 .
  • each automated assistant client 118 may include a corresponding speech capture/text-to-speech (“TTS”)/STT module 114 .
  • TTS speech capture/text-to-speech
  • one or more aspects of speech capture/TTS/STT module 114 may be implemented separately from automated assistant client 118 .
  • Each speech capture/TTS/STT module 114 may be configured to perform one or more functions: capture a user's speech, e.g., via a microphone (which in some cases may comprise presence sensor 105 ); convert that captured audio to text (and/or to other representations or embeddings); and/or convert text to speech.
  • the speech capture/TTS/STT module 114 that is local to each client device 106 may be configured to convert a finite number of different spoken phrases—particularly phrases that invoke automated assistant 120 and/or intercom-style communication—to text (or to other forms, such as lower dimensionality embeddings).
  • Other speech input may be sent to cloud-based automated assistant components 119 , which may include a cloud-based TTS module 116 and/or a cloud-based STT module 117 .
  • components that contribute to implementation of intercom-style communication as described herein may intentionally be operated exclusively on one or more client devices 106 that are associated with each other, for instance, by virtue of being on the same LAN.
  • any machine learning models described elsewhere herein may be trained and/or stored on one or more client devices 106 , e.g., behind an Internet firewall, so that training data and other information generated by or associated with the machine learning models may be maintained in privacy.
  • the cloud-based STT module 117 , cloud-based TTS module 116 , and/or cloud-based aspects of natural language processor 122 may not be involved in invocation of intercom-style communications.
  • Cloud-based STT module 117 may be configured to leverage the virtually limitless resources of the cloud to convert audio data captured by speech capture/TTS/STT module 114 into text (which may then be provided to natural language processor 122 ).
  • Cloud-based TTS module 116 may be configured to leverage the virtually limitless resources of the cloud to convert textual data (e.g., natural language responses formulated by automated assistant 120 ) into computer-generated speech output.
  • TTS module 116 may provide the computer-generated speech output to client device 106 to be output directly, e.g., using one or more speakers.
  • textual data (e.g., natural language responses) generated by automated assistant 120 may be provided to speech capture/TTS/STT module 114 , which may then convert the textual data into computer-generated speech that is output locally.
  • Automated assistant 120 may include a natural language processor 122 , the aforementioned TTS module 116 , the aforementioned STT module 117 , and other components, some of which are described in more detail below.
  • one or more of the engines and/or modules of automated assistant 120 may be omitted, combined, and/or implemented in a component that is separate from automated assistant 120 .
  • one or more of the components of automated assistant 120 such as natural language processor 122 , speech capture/TTS/STT module 114 , etc., may be implemented at least on part on client devices 106 (e.g., to the exclusion of the cloud).
  • speech capture/TTS/STT module 114 may be sufficiently configured to perform selected aspects of the present disclosure to enable intercom-style communication, while in some cases leaving other, non-intercom-related natural language processing aspects to cloud-based components when suitable.
  • automated assistant 120 generates responsive content in response to various inputs generated by a user of one of the client devices 106 1-N during a human-to-computer dialog session with automated assistant 120 .
  • Automated assistant 120 may provide the responsive content (e.g., over one or more networks when separate from a client device of a user) for presentation to the user as part of the dialog session.
  • automated assistant 120 may generate responsive content in response to free-form natural language input provided via one of the client devices 106 1-N .
  • free-form input is input that is formulated by a user and that is not constrained to a group of options presented for selection by the user.
  • a “dialog session” may include a logically-self-contained exchange of one or more messages between a user and automated assistant 120 (and in some cases, other human participants).
  • Automated assistant 120 may differentiate between multiple dialog sessions with a user based on various signals, such as passage of time between sessions, change of user context (e.g., location, before/during/after a scheduled meeting, etc.) between sessions, detection of one or more intervening interactions between the user and a client device other than dialog between the user and the automated assistant (e.g., the user switches applications for a while, the user walks away from then later returns to a standalone voice-activated product), locking/sleeping of the client device between sessions, change of client devices used to interface with one or more instances of automated assistant 120 , and so forth.
  • Natural language processor 122 of automated assistant 120 processes natural language input generated by users via client devices 106 1-N and may generate annotated output for use by one or more other components of automated assistant 120 .
  • the natural language processor 122 may process natural language free-form input that is generated by a user via one or more user interface input devices of client device 106 1 .
  • the generated annotated output includes one or more annotations of the natural language input and optionally one or more (e.g., all) of the terms of the natural language input.
  • the natural language processor 122 is configured to identify and annotate various types of grammatical information in natural language input.
  • the natural language processor 122 may include a part of speech tagger configured to annotate terms with their grammatical roles.
  • the part of speech tagger may tag each term with its part of speech such as “noun,” “verb,” “adjective,” “pronoun,” etc.
  • the natural language processor 122 may additionally and/or alternatively include a dependency parser (not depicted) configured to determine syntactic relationships between terms in natural language input.
  • the dependency parser may determine which terms modify other terms, subjects and verbs of sentences, and so forth (e.g., a parse tree)—and may make annotations of such dependencies.
  • the natural language processor 122 may additionally and/or alternatively include an entity tagger (not depicted) configured to annotate entity references in one or more segments such as references to people (including, for instance, literary characters, celebrities, public figures, etc.), organizations, locations (real and imaginary), and so forth.
  • entity tagger (not depicted) configured to annotate entity references in one or more segments such as references to people (including, for instance, literary characters, celebrities, public figures, etc.), organizations, locations (real and imaginary), and so forth.
  • data about entities may be stored in one or more databases, such as in a knowledge graph (not depicted).
  • the knowledge graph may include nodes that represent known entities (and in some cases, entity attributes), as well as edges that connect the nodes and represent relationships between the entities.
  • a “banana” node may be connected (e.g., as a child) to a “fruit” node,” which in turn may be connected (e.g., as a child) to “produce” and/or “food” nodes.
  • a restaurant called “Hypothetical Café” may be represented by a node that also includes attributes such as its address, type of food served, hours, contact information, etc.
  • the “Hypothetical Café” node may in some implementations be connected by an edge (e.g., representing a child-to-parent relationship) to one or more other nodes, such as a “restaurant” node, a “business” node, a node representing a city and/or state in which the restaurant is located, and so forth.
  • edge e.g., representing a child-to-parent relationship
  • other nodes such as a “restaurant” node, a “business” node, a node representing a city and/or state in which the restaurant is located, and so forth.
  • the entity tagger of the natural language processor 122 may annotate references to an entity at a high level of granularity (e.g., to enable identification of all references to an entity class such as people) and/or a lower level of granularity (e.g., to enable identification of all references to a particular entity such as a particular person).
  • the entity tagger may rely on content of the natural language input to resolve a particular entity and/or may optionally communicate with a knowledge graph or other entity database to resolve a particular entity.
  • the natural language processor 122 may additionally and/or alternatively include a coreference resolver (not depicted) configured to group, or “cluster,” references to the same entity based on one or more contextual cues.
  • the coreference resolver may be utilized to resolve the term “there” to “Hypothetical Café” in the natural language input “I liked Hypothetical Café last time we ate there.”
  • one or more components of the natural language processor 122 may rely on annotations from one or more other components of the natural language processor 122 .
  • the named entity tagger may rely on annotations from the coreference resolver and/or dependency parser in annotating all mentions to a particular entity.
  • the coreference resolver may rely on annotations from the dependency parser in clustering references to the same entity.
  • one or more components of the natural language processor 122 may use related prior input and/or other related data outside of the particular natural language input to determine one or more annotations.
  • cloud-based automated assistant components 119 may include an intercom communication analysis service (“ICAS”) 138 and/or an intercom communication location service (“ICLS”) 140 .
  • IMS intercom communication analysis service
  • ICLS intercom communication location service
  • services 138 and/or 140 may be implemented separately from cloud-based automated assistant components 119 , e.g., on one or more client devices 106 and/or on another computer system (e.g., in the so-called “cloud”).
  • ICAS 138 may be configured to determine, based on a variety of signals and/or data points, how and/or when to facilitate intercom-style communication between multiple users using multiple client devices 106 .
  • ICAS 138 may be configured to analyze voice input provided by a first user at a microphone of a client device 106 of a plurality of associated client devices 106 1-N .
  • ICAS 138 may analyze the first user's voice input and determine, based on the analysis, that the voice input contains a message intended for a second user.
  • an audio recording of the first user's voice input may be applied as input across a trained machine learning classifier to generate output.
  • the output may indicate that the first user's voice input contained a message intended for the second user.
  • machine learning classifiers or more generally, “models” may be trained to provide such output, including but not limited to various types neural networks (e.g., feed-forward, convolutional, etc.).
  • labeled phonemes of user's utterances may be used to train a machine learning model such as a neural network to learn embeddings of utterances into lower dimensionality representations. These embeddings, which may include lower dimensionality representations of the original phonemes, may then be used (e.g., as input for the trained model) to identify when a user intends to use the intercom-style communication described herein, and/or when a user's utterance contains a message intended for another person.
  • labeled utterances may be embedded into reduced dimensionality space, e.g., such that they are clustered into groups associated with intercom-style communication and not-intercom-style communication.
  • the new, unlabeled utterance may then be embedded, and may be classified based on which cluster it's embedding is nearest (e.g., in Euclidian space).
  • a neural network may be trained using training data in the form of a corpus of labelled user utterances (in which case the training is “supervised”). Labels applied to the corpus of utterances may include, for instance, a first label indicative of an utterance that contains a message intended for another user, a second label indicative of a command to engage in a human-to-computer dialog with automated assistant 120 , and/or a third label indicative of background noise (which may be ignored).
  • the labelled training examples may be applied as input to an untrained neural network. Differences between the output of the untrained (or not fully-trained) neural network and the labels—a.k.a. error—may be determined and used with techniques such as back propagation, stochastic gradient descent, objective function optimization, etc., to adjust various weights of one or more hidden layers of the neural network to reduce the error.
  • machine learning classifiers such as neural networks may be trained already to recognize (e.g., classify) phonemes or other audio characteristics of utterances that are intended to invoke automated assistant 120 .
  • the same classifier may be trained further to both recognize (e.g., classify) explicit invocation of automated assistant 120 , and to determine whether an utterance contains a message intended for a second user.
  • separate machine learning classifiers may be used for each of these two tasks, e.g., one after the other or in parallel.
  • ICLS 140 may determine a location of the intended recipient relative to the plurality of client devices 106 1-N , e.g., using presence sensor(s) 105 associated with one or more of the client devices 106 1-N . For example, ICLS 140 may determine which client device 106 is nearest the intended recipient, and/or which room the intended recipient is in (which in some cases may be associated with a client device deployed in that room).
  • ICAS 138 may select, from the plurality of client devices 106 1-N , a second client device 106 that is capable of providing audio or visual output that is perceptible to the intended recipient. For example, if the intended recipient was last detected walking into a particular area, then a client device 106 nearest that area may be selected.
  • ICLS 140 may be provided, e.g., as part of cloud-based automated assistant components 119 and/or separately therefrom.
  • ICAS 138 and ICLS 140 may be implemented together in a single model or engine.
  • ICLS 140 may be configured to track locations of persons within an area of interest, such as within a home, a workplace, a campus, etc., based on signals provided by, for example, presence sensors 105 integral with a plurality of client devices 106 1-N that are distributed throughout the area. Based on these tracked locations, ICLS 140 and/or ICAS 138 may be configured to facilitate intercom-style communication between persons in the area using the plurality of client devices 106 1-N as described herein.
  • ICLS 140 may create and/or maintain a list or database of persons located in a particular area, and/or their last known locations relative to a plurality of client devices 106 1-N deployed in the area. In some implementations, this list/database may be updated, e.g., in real time, as persons are detected by different client devices as having moved to different locations. For example, ICLS 140 may drop a particular person from the list/database if, for example, that person is not detected in the overall area for some predetermined time interval (e.g., one hour) and/or if the person is last detected passing through an ingress or egress area (e.g., front door, back door, etc.). In other implementations, ICLS 140 may update the list/database periodically, e.g., every few minutes, hours, etc.
  • some predetermined time interval e.g., one hour
  • ICLS 140 may update the list/database periodically, e.g., every few minutes, hours, etc.
  • ICAS 138 and/or ICLS 140 may be configured to distinguish between different people using signals from presence sensors 105 , rather than simply detect presence of a generic person. For example, suppose a client device 106 includes a microphone as a presence sensor 105 . Automated assistant 120 may be configured to use a variety of speaker recognition and/or voice recognition techniques to determine not only that someone is present nearby, but who is present. These speaker recognition and/or voice recognition techniques may include but are not limited to hidden Markov models, Gaussian mixture models, frequency estimation, trained classifiers, deep learning, pattern matching algorithms, matrix representation, vector quantization, decisions trees, etc.
  • a client device 106 includes, as a presence sensor 105 , a camera and/or a PIR sensor.
  • a machine learning visual recognition classifier may be trained using labelled training data captured by such a presence sensor 105 to recognize the person visually.
  • a user may cause the visual recognition classifier to be trained by invoking a training routine at one or more camera/PIR sensor-equipped client devices 106 .
  • a user may stand in a field of view of presence sensor 105 and invoke automated assistant 120 with a phrase such as “Hey Assistant, I am Jan and this is what I look like.”
  • automated assistant 120 may provide audible or visual output that prompts the user to move around to various positions within a field of view of presence sensor 105 , while presence sensor 105 captures one or more snapshots of the user. These snapshots may then be labelled (e.g., with “Jan”) and used as labelled training examples for supervised training of the visual recognition classifier.
  • labelled training examples for visual recognition may be generated automatically, e.g., without the user being aware.
  • a signal e.g., radio wave, ultrasonic
  • a mobile client device 106 carried by the user may be analyzed, e.g., by automated assistant 120 , to determine the user's identity (and hence, a label) for snapshots captured by presence sensor 105 .
  • a user's mobile client device 106 may include a network identifier, such as “Jan's Smartphone,” that may be used to identify the user.
  • FIG. 1B an example data flow is depicted schematically to demonstrate one possible way in which a trained machine learning classifier may be applied to analyze user utterances and determine, among other things, whether to employ intercom-style communication.
  • a phoneme classifier 142 (which may be a component of automated assistant 120 ) may be trained such that one or more utterances and one or more person locations may be applied across phoneme classifier 142 as input.
  • Phoneme classifier 142 may then generate, as output, a classification of the utterance(s).
  • these classifications include “invoke assistant,” “convey message,” and “background noise,” but additional and/or alternative labels are possible.
  • phoneme classifier 142 may include the same functionality such that when an input utterance includes such an invocation phrase, the output of phoneme classifier 142 is “invoke assistant.” Once automated assistant 120 is invoked, the user may engage in human-to-computer dialog with automated assistant 120 as is known in the art.
  • phoneme classifier 142 may be further trained to recognize other phonemes that signal a user intent to convey a message to another user. For example, users may often use phrases such as “Hey, ⁇ name>” to get another person's attention. More generally, phoneme classifier 142 may operate to match custom phrases, words, etc. Additionally or alternatively, to get another person's attention, it may be common to first speak the other person's name, sometimes in a slightly elevated volume and/or with particular intonations, or to use other types of intonations. In various implementations, phoneme classifier 142 may be trained to recognize such phonemes and generate output such as “convey message” to signal a scenario in which intercom-style communication may potentially be warranted.
  • a separate intonation model may optionally be separately trained to recognize utterances that seek communication with another person (e.g., to differentiate such utterances from casual utterances) and generate output that indicates the presence of such utterances (e.g., a likelihood that such an utterance is present).
  • the outputs from the phoneme classifier and the intonation model, for a given user utterance, may be collectively considered in determining if intercom-style communication may be warranted.
  • one or more person locations may be provided, e.g., by ICLS 140 , as input to phoneme classifier 142 . These person locations may be used, in addition to or instead of the utterance(s), to determine whether intercom-style communication is warranted. For example, if the recipient location is sufficiently near (e.g., within earshot of) a speaker's location, that may influence phoneme classifier 142 to produce output such as “background noise,” even if the utterance contains a message intended for another. On the other hand, suppose the intended recipient's location is out of earshot of the speaker's location.
  • phoneme classifier 142 may influence phoneme classifier 142 to produce output such as “convey message,” which may increase a likelihood that intercom-style communication is employed.
  • a two-step approach may be implemented in which it is first determined whether a speaker's utterance contains a message intended for another user, and it is then determined whether the other user is within earshot of the speaker. If the answer to both questions is yes, then intercom-style communication may be implemented to convey the message to the intended recipient.
  • a home floorplan is depicted that includes a plurality of rooms, 250 - 262 .
  • a plurality of client devices 206 1-4 are deployed throughout at least some of the rooms.
  • Each client device 206 may implement an instance of automated assistant client 118 configured with selected aspects of the present disclosure and may include one or more input devices, such as microphones, that are capable of capturing utterances spoken by a person nearby.
  • a first client device 206 1 taking the form of a standalone interactive speaker is deployed in room 250 , which in this example is a kitchen.
  • a second client device 206 2 taking the form of a so-called “smart” television (e.g., a networked television with one or more processors that implement an instance of automated assistant client 118 ) is deployed in room 252 , which in this example is a den.
  • a third client device 206 3 taking the form of an interactive standalone speaker is deployed in room 254 , which in this example is a bedroom.
  • a fourth client device 206 4 taking the form of another interactive standalone speaker is deployed in room 256 , which in this example is a living room.
  • the plurality of client devices 106 1-4 may be communicatively coupled with each other and/or other resources (e.g., the Internet) via one or more wired or wireless LANs (e.g., 110 2 in FIG. 1A ).
  • client devices particularly mobile devices such as smart phones, tablets, laptops, wearable devices, etc.—may also be present, e.g., carried by one or more persons in the home and may or may not also be connected to the same LAN.
  • client devices depicted in FIG. 2 and elsewhere in the Figures is just one example; more or less client devices 106 may be deployed across any number of other rooms and/or areas other than a home.
  • First client device 206 1 which as noted above is configured with selected aspects of the present disclosure, may detect Jack's utterance. A recording of the utterance may be analyzed using techniques described above to determine that Jack's utterance contains a message intended for Jan.
  • First client device 206 1 also may determine, e.g., based on information shared amongst all of the plurality of client devices 206 1-4 , that Jan is in living room 256 (or at least nearest fourth client device 206 4 ). For example, client device 206 4 may have detected, e.g., using one or more integral presence sensors (e.g., 105 in FIG. 1A ), that Jan is in living room 256 .
  • integral presence sensors e.g., 105 in FIG. 1A
  • first client device 206 1 may determine that Jack intended his message for Jan and that Jan is out of earshot of Jack. Consequently, first client device 206 1 may push (over one or more of the aforementioned LANs) a recording of Jack's utterance (or in some cases, transcribed text of Jack's utterance) to the client device nearest Jan, which in this example is fourth client device 206 4 .
  • fourth client device 206 4 may, e.g., by way of automated assistant 120 executing at least in part on fourth client device 206 4 , audibly output Jack's message to Jan as depicted in FIG. 2 , thus effecting intercom-style communication between Jack and Jan.
  • Jack's question is output to Jan audibly using fourth client device 206 4 , which as noted above is a standalone interactive speaker.
  • fourth client device 206 4 which as noted above is a standalone interactive speaker.
  • Jack's message may be conveyed to Jan using other output modalities.
  • a mobile client device not depicted carried by Jan is connected to the Wi-Fi LAN
  • that mobile device may output Jack's message, either as an audible recording or as a textual message that is conveyed to Jan visually, e.g., using an application such as message exchange client 107 executing on Jan's mobile client device.
  • recordings and/or STT transcriptions of utterances that are exchanged between client devices 106 to facilitate intercom communication may be used for a variety of additional purposes. In some embodiments, they may be used to provide context to downstream human-to-computer dialogs between user(s) and automated assistant 120 . For example, in some scenarios, a recorded utterance and/or its STT transcription may be used to disambiguate a request provided to an instance of automated assistant 120 , whether that request be from the user who originally provided the utterance, an intended recipient of the utterance, or even another user who engages automated assistant 120 subsequent to an intercom-style communication involving a plurality of client devices 106 .
  • FIG. 3 depicts the same home and distribution of client devices 206 1-4 as was depicted in FIG. 2 .
  • Jan still in living room 256
  • ICLS 140 may determine, e.g., based on a signal provided by an onboard camera and/or PIR sensor of a “smart” thermostat 264 , that Jack is located in den 252 .
  • a client device near Jack's detected location such as client device 206 2
  • client device 206 2 may be identified to output Jan's utterance.
  • Jan's recorded utterance may be pushed from another computing device near Jan that recorded it, such as client device 206 4 , to client device 206 2 identified near Jack and output audibly (or visually since client device 206 2 is a smart television with display capabilities).
  • FIG. 4 demonstrates an example follow up scenario to that depicted in FIG. 3 .
  • Jack says “OK Assistant—when is the next tram leaving?”
  • this request, or search query may be too ambiguous to answer, and automated assistant 120 may be required to solicit disambiguating information from Jack.
  • automated assistant 120 may disambiguate Jack's request based on Jan's original utterance to determine that the tram to the airport is the one Jack is interested in.
  • automated assistant 120 could simply retrieve normal results for all nearby trams, and then rank those results based on Jack's utterance, e.g., so that the tram to the airport is ranked highest. Whichever the case, in FIG. 4 , automated assistant 120 provides audio output at client device 206 2 of “Next tram to the airport leaves in 10 minutes.”
  • FIG. 5 is a flowchart illustrating an example method 500 according to implementations disclosed herein. For convenience, the operations of the flow chart are described with reference to a system that performs the operations. This system may include various components of various computer systems, such as one or more components of computing systems that implement automated assistant 120 . Moreover, while operations of method 500 are shown in a particular order, this is not meant to be limiting. One or more operations may be reordered, omitted or added.
  • the system may receive, at an input device of a first computing device of a plurality of computing devices, from a first user, free form natural language input.
  • this free form natural language input may come in the form of voice input, i.e., an utterance from the first user, though this is not required. It should be understood that this voice input need not necessarily be directed by the first user at automated assistant 120 , and instead may include any utterance provided by the first user that is captured and/or recorded by a client device configured with selected aspects of the present disclosure.
  • the system may analyze the voice input.
  • Various aspects e.g., phonemes
  • the voice input may be analyzed, including but not limited to intonation, volume, recognized phrases, etc.
  • the system may analyze other signals in addition to the voice input.
  • These other signals may include, for instance, a number of people in an environment such as a house. For instance, if only one person is present, no intercom capabilities may be utilized. If only two people are present, then the location of the other person may automatically be determined to be the location at which intercom output should be provided. If more than two people are present, then the system may attempt various techniques (e.g., voice recognition, facial recognition, wireless signal recognition, etc.) to attempt to distinguish people from each other.
  • voice recognition e.g., voice recognition, facial recognition, wireless signal recognition, etc.
  • the system may determine, based on the analyzing, that the first user intends to convey a message to a second user (e.g., that the voice input contains a message intended for the second user).
  • automated assistant 120 may employ various techniques, such as a classifier trained on labeled training examples in the form of recorded utterances, to analyze the voice input and/or determine whether the first user's voice input is a command to invoke automated assistant 120 (e.g., “Hey, Assistant”) to engage in further human-to-computer dialog, an utterance intended to convey a message to the second user (or multiple other users), or other background noise.
  • a rules-based approach may be implemented. For example, one or more simple IVR grammars may be defined, e.g., using technologies such as voice extensible markup language, or “VXML,” that are designed to match utterances that are intended to convey messages between users.
  • the system may determine a location of the second user relative to the plurality of computing devices.
  • ICLS 140 may maintain a list or database of people in an area such as a home or workplace and their last-known (i.e., last-detected) locations. In some such implementations, the system may simply consult this list or database for the second user's location.
  • the system may actively poll a plurality of client devices in the environment to seek out the second user, e.g., on an as-needed basis (e.g., when it is determined that the first user's utterance contains a message intended to be conveyed to the second user). This may cause the client devices to activate presence sensors ( 105 in FIG. 1A ) so that they can detect whether someone (e.g., the second user) is nearby.
  • the system may select, from the plurality of computing devices, based on the location of the second user, a second computing device that is capable of providing audio or visual output that is perceptible to the second user.
  • the second computing device may be a stationary client device (e.g., a standalone interactive speaker, smart television, desktop computer, etc.) that is deployed in a particular area of an environment.
  • the second computing device may be a mobile client device carried by the second user.
  • the mobile client device may become part of the plurality of computing devices considered by the system by virtue of being part of the same coordinate ecosystem and/or joining the same wireless LAN (or simply being located within a predetermined distance).
  • the system may cause the second computing device identified at block 510 to provide audio or visual output that conveys the message to the second user.
  • the second computing device may cause a recording of the first user's utterance to be forwarded (e.g., streamed) to the second computing device selected at block 510 .
  • the second computing device which may be configured with selected aspects of the present disclosure, may respond by outputting the forwarded recording.
  • a user's utterance may be intended for multiple other users, such as multiple members of the speaker's family, all persons in the area or environment, etc.
  • one or more of the aforementioned machine learning classifiers may be (further) trained to determine whether an utterance contains a message intended for a single recipient or multiple recipients. If the answer is yes, then the system may convey the message to the multiple intended recipients at multiple locations in various ways.
  • the system may simply cause the message to be pushed to all client devices in the area (e.g., all client devices of a coordinated ecosystem and/or all client devices connected to a Wi-Fi LAN), effectively broadcasting the message.
  • the system e.g., ICLS 140
  • the system may determine locations of all intended recipients on an individual basis, and output the message on only those client devices that are near each intended recipient.
  • automated assistant 120 may wait until an intended recipient of a message is able to perceive a message (e.g., within earshot) until it causes a message to be conveyed using techniques described herein. For example, suppose a first user conveys a message to an intended recipient but the intended recipient has stepped outside momentarily. In some implementations, the message may be temporarily delayed until the intended recipient is detected by one or more computing devices. The first computing device to detect the recipient upon their return may output the original message.
  • a variety of signals such as intended recipient's position coordinate (e.g., Global Positioning System, or “GPS”) obtained from a mobile device they carry, may be used to determine that the intended recipient will not be reachable using intercom-style communication (at least not with any devices on the LAN).
  • the message e.g., a recording of the speaker's utterance
  • the recipient's mobile device may be forwarded to the recipient's mobile device.
  • automated assistant 120 may determine that the intended recipient is unreachable, and may provide output, e.g., at a client device closest to the speaker (e.g., the device that captured the speaker's utterance) that notifies the speaker that the recipient is unreachable at the moment.
  • automated assistant 120 may prompt the user for permission to forward the message to the recipient's mobile device, e.g., by output something like “I can't reach Jan directly right now. Would you like me to send a message to their phone?”
  • Optional blocks 514 - 518 may or may not occur if the second user issues free form natural language input sometime after the first user's message is output at block 512 .
  • the system may identify a free form natural language input, such as a voice input, that is issued by the second user after the audio or visual output is provided by the second computing device at block 512 .
  • the second user's voice input may include, for instance, a command and/or a search query.
  • the command and/or search query may be, by itself, too ambiguous to properly interpret, as was the case with Jack's utterance in FIG. 4 .
  • the system may analyze the second user's free form natural language input identified at block 514 based at least in part on the free form natural language input received from the first user at block 502 .
  • the first user's utterance may be transcribed and used to provide context to the second user's subsequent request.
  • the system may formulate a response to the second user's natural language input based on the context provided by the first user's original free form natural language input. For example, if the second user's free form natural language input included a search query (such as Jack's query in FIG. 4 ), the system may obtain search results that are responsive to the search query based at least in part on the voice input from the first user.
  • a search query such as Jack's query in FIG. 4
  • the second user's search query may be disambiguated based on the first user's original utterance, and/or one or more responsive search results may be ranked based on the first user's original utterance.
  • the system may then cause one or more of the plurality of computing devices to provide output indicative of at least some of the search results, as occurred in FIG. 4 .
  • users may preconfigure (e.g., commission) client computing devices in their home, workplace, or in another environment, to be usable to engage in intercom-style communications as described herein.
  • a user may, e.g., using a graphical user interface and/or by engaging in a human-to-computer dialog session with automated assistant 120 , assign a “location” to each stationary client computing device, such as “kitchen,” “dining room,” etc. Consequently, in some such implementations, a user may explicitly invoke automated assistant 120 to facilitate intercom-style communication to a particular location.
  • a user may provide the following voice input to convey a message to another user: “Hey Assistant, tell Oliver in the kitchen that we need more butter.”
  • users may explicitly designate a recipient of a message when they invoke intercom-style communication. If the user does not also specify a location of the recipient, then techniques described herein, e.g., in association with ICLS 140 , may be used automatically to determine a location of the recipient and select which computing device will be used to output the message to the recipient.
  • a user need not explicitly invoke intercom-style communications. Rather, as described above, various signals and/or data points (e.g., output of a machine learning classifier, location of an intended recipient, etc.) may be considered to determine, without explicit instruction from the user, that the user's message should be convey automatically using intercom-style communication.
  • FIG. 6 is a block diagram of an example computing device 610 that may optionally be utilized to perform one or more aspects of techniques described herein.
  • one or more of a client computing device, user-controlled resources engine 130 , and/or other component(s) may comprise one or more components of the example computing device 610 .
  • Computing device 610 typically includes at least one processor 614 which communicates with a number of peripheral devices via bus subsystem 612 .
  • peripheral devices may include a storage subsystem 624 , including, for example, a memory subsystem 625 and a file storage subsystem 626 , user interface output devices 620 , user interface input devices 622 , and a network interface subsystem 616 .
  • the input and output devices allow user interaction with computing device 610 .
  • Network interface subsystem 616 provides an interface to outside networks and is coupled to corresponding interface devices in other computing devices.
  • User interface input devices 622 may include a keyboard, pointing devices such as a mouse, trackball, touchpad, or graphics tablet, a scanner, a touchscreen incorporated into the display, audio input devices such as voice recognition systems, microphones, and/or other types of input devices.
  • pointing devices such as a mouse, trackball, touchpad, or graphics tablet
  • audio input devices such as voice recognition systems, microphones, and/or other types of input devices.
  • use of the term “input device” is intended to include all possible types of devices and ways to input information into computing device 610 or onto a communication network.
  • User interface output devices 620 may include a display subsystem, a printer, a fax machine, or non-visual displays such as audio output devices.
  • the display subsystem may include a cathode ray tube (CRT), a flat-panel device such as a liquid crystal display (LCD), a projection device, or some other mechanism for creating a visible image.
  • the display subsystem may also provide non-visual display such as via audio output devices.
  • output device is intended to include all possible types of devices and ways to output information from computing device 610 to the user or to another machine or computing device.
  • Storage subsystem 624 stores programming and data constructs that provide the functionality of some or all of the modules described herein.
  • the storage subsystem 624 may include the logic to perform selected aspects of the method of FIG. 5 , as well as to implement various components depicted in FIG. 1A .
  • Memory 625 used in the storage subsystem 624 can include a number of memories including a main random access memory (RAM) 630 for storage of instructions and data during program execution and a read only memory (ROM) 632 in which fixed instructions are stored.
  • a file storage subsystem 626 can provide persistent storage for program and data files, and may include a hard disk drive, a floppy disk drive along with associated removable media, a CD-ROM drive, an optical drive, or removable media cartridges.
  • the modules implementing the functionality of certain implementations may be stored by file storage subsystem 626 in the storage subsystem 624 , or in other machines accessible by the processor(s) 614 .
  • Bus subsystem 612 provides a mechanism for letting the various components and subsystems of computing device 610 communicate with each other as intended. Although bus subsystem 612 is shown schematically as a single bus, alternative implementations of the bus subsystem may use multiple busses.
  • Computing device 610 can be of varying types including a workstation, server, computing cluster, blade server, server farm, or any other data processing system or computing device. Due to the ever-changing nature of computers and networks, the description of computing device 610 depicted in FIG. 6 is intended only as a specific example for purposes of illustrating some implementations. Many other configurations of computing device 610 are possible having more or fewer components than the computing device depicted in FIG. 6 .
  • users are provided with one or more opportunities to control whether information is collected, whether the personal information is stored, whether the personal information is used, and how the information is collected about the user, stored and used. That is, the systems and methods discussed herein collect, store and/or use user personal information only upon receiving explicit authorization from the relevant users to do so.
  • a user is provided with control over whether programs or features collect user information about that particular user or other users relevant to the program or feature.
  • Each user for which personal information is to be collected is presented with one or more options to allow control over the information collection relevant to that user, to provide permission or authorization as to whether the information is collected and as to which portions of the information are to be collected.
  • users can be provided with one or more such control options over a communication network.
  • certain data may be treated in one or more ways before it is stored or used so that personally identifiable information is removed.
  • a user's identity may be treated so that no personally identifiable information can be determined.
  • a user's geographic location may be generalized to a larger region so that the user's particular location cannot be determined.

Abstract

Techniques are described related to improved intercom-style communication using a plurality of computing devices distributed about an environment. In various implementations, voice input may be received, e.g., at a microphone of a first computing device of multiple computing devices, from a first user. The voice input may be analyzed and, based on the analyzing, it may be determined that the first user intends to convey a message to a second user. A location of the second user relative to the multiple computing devices may be determined, so that, based on the location of the second user, a second computing device may be selected from the multiple computing devices that is capable of providing audio or visual output that is perceptible to the second user. The second computing device may then be operated to provide audio or visual output that conveys the message to the second user.

Description

    BACKGROUND
  • Humans may engage in human-to-computer dialogs with interactive software applications referred to herein as “automated assistants” (also referred to as “chat bots,” “interactive personal assistants,” “intelligent personal assistants,” “personal voice assistants,” “conversational agents,” etc.). For example, humans (which when they interact with automated assistants may be referred to as “users”) may provide commands, queries, and/or requests using spoken natural language input (i.e. utterances) which may in some cases be converted into text and then processed, and/or by providing textual (e.g., typed) natural language input.
  • In some cases, automated assistants may include automated assistant “clients” that are installed locally on client devices and that are engaged directly by users, as well as cloud-based counterpart(s) that leverage the virtually limitless resources of the cloud to help automated assistant clients respond to users' queries. For example, the automated assistant client may provide, to the cloud-based counterpart(s), an audio recording of the user's query (or a text conversion thereof) and data indicative of the user's identity (e.g., credentials). The cloud-based counterpart may perform various processing on the query to return various results to the automated assistant client, which may then provide corresponding output to the user. For the sakes of brevity and simplicity, the term “automated assistant,” when described herein as “serving” a particular user, may refer to the automated assistant client installed on the particular user's client device and any cloud-based counterpart that interacts with the automated assistant client to respond to the user's queries.
  • Many users may engage automated assistants using multiple devices. For example, some users may possess a coordinated “ecosystem” of computing devices that includes one or more smart phones, one or more tablet computers, one or more vehicle computing systems, one or wearable computing devices, one or more smart televisions, and/or one or more standalone interactive speakers, among other more traditional computing devices. A user may engage in human-to-computer dialog with an automated assistant using any of these devices (assuming an automated assistant client is installed). In some cases these devices may be scattered around the user's home or workplace. For example, mobile computing devices such as smart phones, tablets, smart watches, etc., may be on the user's person and/or wherever the user last placed them (e.g., at a charging station). Other computing devices, such as traditional desktop computers, smart televisions, and standalone interactive speakers may be more stationary but nonetheless may be located at various places (e.g., rooms) within the user's home or workplace.
  • Techniques exist to enable multiple users (e.g., a family, co-workers, co-inhabitants, etc.) to leverage the distributed nature of a plurality of computing devices to facilitate intercom-style spoken communication between the multiple users. However, these techniques are limited to users issuing explicit commands to convey messages to explicitly-defined computing devices. For example, a first user who wishes to convey a message to a second user at another location out of earshot (e.g., in another room) must first determine where the second user is located. Only then can the first user explicitly invoke an intercom communication channel to a computing device at or near the second user's location, so that the first user can convey a message to the second user at the second user's location. If the first user does not know the second user's location, the first user may be forced to simply cause the message to be broadcast at all computing devices that are available for intercom-style communication. Moreover, if the first user is unaware that the second user is not within earshot (e.g., the first user is cooking and didn't notice the second user leaving the kitchen), the first user may not realize that intercom-style communication is necessary, and may speak the message to an empty room.
  • SUMMARY
  • Techniques are described herein for improved intercom-style communication using a plurality of computing devices distributed about an environment such as a house, an apartment, a place of business, etc. For example, techniques are described herein for enabling determination of location(s) of multiple users within the environment, so that (i) it can be determined automatically whether an intended recipient of a spoken message is within earshot of the speaker, and (ii) a suitable computing device near the intended recipient can be identified and used to output the message so that the intended recipient receives it. Additionally, techniques are described herein for automatically determining whether a user utterance constitutes (a) a command to invoke an automated assistant for normal use; (b) an attempt to convey a spoken message to another user that may potentially require the intercom-style communication described herein; and/or (c) other background noise/conversation that requires no action. Additionally, techniques are described herein for allowing a recipient of an intercom-style message received using disclosed techniques to issue a request (e.g., a search query or other commands to an automated assistant such as ordering pizza, playing a song, etc.) that is processed (e.g., using natural language processing) based at least in part on the initial message conveyed by the speaker.
  • In various implementations, users' locations may be determined within an environment or area by computing devices configured with selected aspects of the present disclosure using various techniques. For example, one or more computing devices may be equipped with various types of presence sensors, such as passive infrared (“PIR”) sensors, cameras, microphones, ultrasonic sensors, and so forth, which can determine whether a user is nearby. These computing devices can come in various forms, such as smart phones, standalone interactive speakers, smart televisions, other smart appliances (e.g., smart thermostats, smart refrigerators, etc.), networked cameras, and so forth. Additionally or alternatively, other types of signals, such as signals emitted by mobile computing devices (e.g., smart phones, smart watches) carried by users, may be detected by other computing devices and used to determine the users' locations (e.g., using time-of-flight, triangulation, etc.). The determination of a user's location within an environment for utilization in various techniques described herein can be contingent on explicit user-provided authorization for such determination. In various implementations, users' locations may be determined “on demand” in response to determining that a user utterance constitutes an attempt to convey a spoken message to another user that may require intercom-style communication. In various other implementations, the users' locations may be determined periodically and/or at other interval and most recently determined locations utilized in determining whether an intended recipient of a spoken message is within earshot of a speaker of the spoken message and/or in identifying a suitable computing device near an intended recipient of the spoken message.
  • As one example, a variety of standalone interactive speakers and/or smart televisions may be distributed at various locations in a home. Each of these devices may include one or more sensors (e.g., microphone, camera, PIR sensor, etc.) capable of detecting a nearby human presence. In some embodiments, these devices may simply detect whether a person is present. In other embodiments, these devices may be able to not only detect presence, but distinguish the detected person, e.g., from other known members of a household. Presence signals generated by these standalone interactive speakers and/or smart televisions may be collected and used to determine/track where people are located at a particular point in time. These detected locations may then be used for various purposes in accordance with techniques described herein, such as determining whether an utterance provided by a speaker is likely to be heard by the intended recipient (e.g., whether the speaker and intended recipients are in different rooms or the same room), and/or to select which of the multiple speakers and/or televisions should be used to output the utterance to the intended recipient.
  • In another aspect, techniques are described herein for automatically determining whether a user utterance constitutes (a) a command to invoke an automated assistant for normal use; (b) an attempt to convey a spoken message to another user that may potentially require the intercom-style communication described herein; and/or (c) other background noise/conversation that requires no action. In some implementations, a machine learning classifier (e.g., neural network) may be trained using training examples that comprise recorded utterances (and/or features of recorded utterances) that are classified (labeled) as, for instance, a command to convey a message to another user using an intercom-style communication link, a command to engage in a conventional human-to-computer dialog with an automated assistant, or conversation that is not directed to an automated assistant (e.g., background conversation and/or noise).
  • In some embodiments, speech-to-text (“STT”) may not be performed automatically on every utterance. Instead, the machine learning classifier may be trained to recognize phonemes in the audio recording of the voice input, and in particular to classify the collective phonemes with one of the aforementioned labels. For example, conventional automated assistants are typically invoked using one or more invocation phrases. In some cases, a simple invocation machine learning model (e.g., classifier) is trained to distinguish these invocation phrases from anything else to determine when a user invokes the automated assistant (e.g., to recognize phonemes associated with “Hey, Assistant”). With techniques described herein, the same invocation machine learning model or a different machine learning model may be (further) trained to classify utterances as being intended to convey a message to another user, which may or may not require use of intercom-style communications described herein. In some implementations, such a machine learning model may be used, e.g., in parallel with an invocation machine learning model or after the invocation machine learning model determines that the user is not invoking the automated assistant, to determine whether the user may benefit from using intercom-style communication to cause a remote computing device to convey a message to another user.
  • In some implementations, a machine learning model may be trained, or “customized,” so that it is possible to recognize names spoken by a user and to attach those names with other individuals. For example, an automated assistant may detect a first utterance such as “Jan, can you pass me the salt?” The automated assistant may detect a second utterance, presumably from Jan, such as “Sure, here you go.” From these utterances and the associated phonemes, the automated assistant may learn that when a user makes a request to Jan, it should locate the individual with Jan's voice. Suppose that later, Jan is talking on the phone in a separate room. When the user says something like “Jan, where are my shoes,” the automated assistant may determine from this utterance (particularly, “Jan, . . . ”) that the utterance contains a message for the individual, Jan. The automated assistant may also determine that Jan is probably out of earshot, and therefore the message should be conveyed to Jan as an intercom message. By detecting Jan's voice on a nearby client device, the automated assistant may locate Jan and select the nearby client device to output the speaker's message.
  • In other implementations, a user may invoke an automated assistant using traditional invocation phrases and then explicitly command the automated assistant to cause some other computing device to output a message to be conveyed to a recipient. The other computing device may be automatically selected based on the recipient's detected location as described above, or explicitly designated by the speaking user.
  • In yet another aspect, techniques are described herein for allowing a recipient of an intercom-style message received using disclosed techniques to leverage context provided in the received intercom message to perform other actions, such as issuing a search query or a command to an automated assistant. For example, after perceiving a conveyed intercom message, the recipient may issue a search query, e.g., at the computing device at which she received the conveyed intercom message or another computing device. Search results may then be obtained, e.g., by an automated assistant serving the second user, that are responsive to the search query. In some implementations, the search results may be biased or ranked based at least in part on content of the originally conveyed intercom message. Additionally or alternatively, in some implementations, the recipient's search query may be disambiguated based at least in part on content of the originally conveyed intercom message.
  • In some implementations in which an initial utterance is used to provide context to downstream requests by a recipient user, to protect privacy, the original speaker's utterance may be transcribed (STT) only if it is determined that the recipient makes a downstream request. If the recipient simply listens to the message and does nothing further, no SST may be performed. In other implementations, the original speaker's utterance may always be processed using SST (e.g., on determination that the utterance is to be conveyed through intercom-style communication), but the resulting transcription may be stored only locally and/or for a limited amount of time (e.g., long enough to give the recipient user ample time to make some downstream request).
  • In some implementations, one or more computing devices may wait till an intended user is able to perceive a message (e.g., within earshot) till they convey a message using techniques described herein. For example, suppose a first user conveys a message to an intended recipient but the intended recipient has stepped outside momentarily. In some implementations, the first computing device to detect the recipient upon their return may output the original message.
  • In some implementations, a method performed by one or more processors is provided that includes: receiving, at a microphone of a first computing device of a plurality of computing devices, from a first user, voice input; analyzing the voice input; determining, based on the analyzing, that the first user intends to convey a message to a second user; determining a location of the second user relative to the plurality of computing devices; selecting, from the plurality of computing devices, based on the location of the second user, a second computing device that is capable of providing audio or visual output that is perceptible to the second user; and causing the second computing device to exclusively provide audio or visual output that conveys the message to the second user (e.g., only the second computing device provides the output, to the exclusion of other computing devices).
  • These and other implementations of technology disclosed herein may optionally include one or more of the following features.
  • In various implementations, the analyzing may include applying an audio recording of the voice input as input across a trained machine learning model to generate output, wherein the output indicates that the first user intends to convey the message to the second user. In various implementations, the machine learning model may be trained using a corpus of labelled utterances, and wherein labels applied to the utterances include a first label indicative of a command to convey a message to another user and a second label indicative of a command to engage in a human-to-computer dialog with an automated assistant. In various implementations, labels applied to the utterances may further include a third label indicative of background conversation.
  • In various implementations, the selecting may be performed in response to a determination, based on the location of the second user, that the second user is not within earshot of the first user. In various implementations, the location of the second user may be determined based at least in part on one or more signals generated by a mobile computing device operated by the second user. Persons skilled in the art will appreciate from reading the specification that the concepts and subject matter described herein may ensure that messages are conveyed to, and received by, an intended person in a manner which is efficient for the technical equipment used to convey and receive the messages. This may include the messages being conveyed and delivered for perception by the intended person at an appropriate time, so that the messages can be properly understood by the intended person and there is no requirement for messages to be re-conveyed/re-received by the technical equipment for this purpose. The technical equipment may include the multiple computing devices referred to above, as well as a network over which the messages may be conveyed between the devices. The efficiency in the manner in which, and the times at which, messages are conveyed may result in at least more efficient use of the network between the computing devices and also more efficient use of the computational resources, within the computing devices, which are employed to convey and receive the messages.
  • In various implementations, the location of the second user may be determined based at least in part on one or more signals generated by one or more of the plurality of computing devices other than the first computing device. In various implementations, the one or more signals may include a signal indicative of the second user being detected by one or more of the plurality of computing devices other than the first computing device using passive infrared or ultrasound. In various implementations, the one or more signals may include a signal indicative of the second user being detected by one or more of the plurality of computing devices other than the first computing device using a camera or a microphone.
  • In various implementations, the analyzing may include determining that the voice input includes an explicit command to convey the message to the second user as an intercom message via one or more of the plurality of computing devices. In various implementations, the analyzing may include performing speech-to-text processing on the voice input to generate textual input, and performing natural language processing on the textual input to determine that the user intends to convey the message to the second user.
  • In various implementations, the method may further include: identifying a search query issued by the second user after the audio or visual output is provided by the second computing device; obtaining search results that are responsive to the search query, wherein the obtaining is based at least in part on the voice input from the first user; and causing one or more of the plurality of computing devices to provide output indicative of at least some of the search results.
  • In addition, some implementations include one or more processors of one or more computing devices, where the one or more processors are operable to execute instructions stored in associated memory, and where the instructions are configured to cause performance of any of the aforementioned methods. Some implementations also include one or more non-transitory computer readable storage media storing computer instructions executable by one or more processors to perform any of the aforementioned methods.
  • It should be appreciated that all combinations of the foregoing concepts and additional concepts described in greater detail herein are contemplated as being part of the subject matter disclosed herein. For example, all combinations of claimed subject matter appearing at the end of this disclosure are contemplated as being part of the subject matter disclosed herein.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIG. 1A is a block diagram of an example environment in which implementations disclosed herein may be implemented.
  • FIG. 1B schematically depicts one example of how a trained classifier may be applied to generate output based on user utterances and/or locations, in accordance with various implementations.
  • FIGS. 2, 3, and 4 depict example dialogs between various users and automated assistants, including intercom-style communications, in accordance with various implementations.
  • FIG. 5 depicts a flowchart illustrating an example method according to implementations disclosed herein.
  • FIG. 6 illustrates an example architecture of a computing device.
  • DETAILED DESCRIPTION
  • Now turning to FIG. 1A, an example environment in which techniques disclosed herein may be implemented is illustrated. The example environment includes a plurality of client computing devices 106 1-N. Each client device 106 may execute a respective instance of an automated assistant client 118. One or more cloud-based automated assistant components 119, such as a natural language processor 122, may be implemented on one or more computing systems (collectively referred to as a “cloud” computing system) that are communicatively coupled to client devices 106 1-N via one or more local and/or wide area networks (e.g., the Internet) indicated generally at 110. Also, in some embodiments, the plurality of client devices 106 1-N may be communicatively coupled with each other via one or more local area networks (“LANs,” including Wi-Fi LANs, mesh networks, etc.).
  • In some implementations, plurality of client computing devices 106 1-N (also referred to herein simply as “client devices”) may be associated with each other in various ways in order to facilitate performance of techniques described herein. For example, in some implementations, plurality of client computing devices 106 1-N may be associated with each other by virtue of being communicatively coupled via one or more LANs. This may be the case, for instance, where plurality of client computing devices 106 1-N are deployed across a particular area or environment, such as a home, a building, a campus, and so forth. Additionally or alternatively, in some implementations, plurality of client computing devices 106 1-N may be associated with each other by virtue of them being members of a coordinated ecosystem of client devices 106 that are operated by one or more users (e.g., an individual, a family, employees of an organization, other predefined groups, etc.).
  • As noted in the background, an instance of an automated assistant client 118, by way of its interactions with one or more cloud-based automated assistant components 119, may form what appears to be, from the user's perspective, a logical instance of an automated assistant 120 with which the user may engage in a human-to-computer dialog. Two instances of such an automated assistant 120 are depicted in FIG. 1A. A first automated assistant 120A encompassed by a dashed line serves a first user (not depicted) operating first client device 106 1 and includes automated assistant client 118 1 and one or more cloud-based automated assistant components 119. A second automated assistant 120B encompassed by a dash-dash-dot line serves a second user (not depicted) operating another client device 106 N and includes automated assistant client 118 N and one or more cloud-based automated assistant components 119. It thus should be understood that each user that engages with an automated assistant client 118 executing on a client device 106 may, in effect, engage with his or her own logical instance of an automated assistant 120. For the sakes of brevity and simplicity, the term “automated assistant” as used herein as “serving” a particular user will refer to the combination of an automated assistant client 118 executing on a client device 106 operated by the user and one or more cloud-based automated assistant components 119 (which may be shared amongst multiple automated assistant clients 118). It should also be understood that in some implementations, automated assistant 120 may respond to a request from any user regardless of whether the user is actually “served” by that particular instance of automated assistant 120.
  • The client devices 106 1-N may include, for example, one or more of: a desktop computing device, a laptop computing device, a tablet computing device, a mobile phone computing device, a computing device of a vehicle of the user (e.g., an in-vehicle communications system, an in-vehicle entertainment system, an in-vehicle navigation system), a standalone interactive speaker, a smart appliance such as a smart television, and/or a wearable apparatus of the user that includes a computing device (e.g., a watch of the user having a computing device, glasses of the user having a computing device, a virtual or augmented reality computing device). Additional and/or alternative client computing devices may be provided.
  • In various implementations, one or more of the client computing devices 106 1-N may include one or more presence sensors 105 1-N that are configured to provide signals indicative of detected presence, particularly human presence. Presence sensors 105 1-N may come in various forms. Some client devices 106 may be equipped with one or more digital cameras that are configured to capture and provide signal(s) indicative of movement detected in their fields of view. Additionally or alternatively, some client devices 106 may be equipped with other types of light-based presence sensors 105, such as passive infrared (“PIR”) sensors that measure infrared (“IR”) light radiating from objects within their fields of view. Additionally or alternatively, some client devices 106 may be equipped with presence sensors 105 that detect acoustic (or pressure) waves, such as one or more microphones.
  • Additionally or alternatively, in some implementations, presence sensors 105 may be configured to detect other phenomena associated with human presence. For example, in some embodiments, a client device 106 may be equipped with a presence sensor 105 that detects various types of waves (e.g., radio, ultrasonic, electromagnetic, etc.) emitted by, for instance, a mobile client device 106 carried/operated by a particular user. For example, some client devices 106 may be configured to emit waves that are imperceptible to humans, such as ultrasonic waves or infrared waves, that may be detected by other client devices 106 (e.g., via ultrasonic/infrared receivers such as ultrasonic-capable microphones).
  • Additionally or alternatively, various client devices 106 may emit other types of human-imperceptible waves, such as radio waves (e.g., Wi-Fi, Bluetooth, cellular etc.) that may be detected by one or more other client devices 106 and used to determine an operating user's particular location. In some implementations, Wi-Fi triangulation may be used to detect a person's location, e.g., based on Wi-Fi signals to/from a client device 106. In other implementations, other wireless signal characteristics, such as time-of-flight, signal strength, etc., may be used by various client devices 106, alone or collectively, to determine a particular person's location based on signals emitted by a client device 106 they carry.
  • Additionally or alternatively, in some implementations, one or more client devices 106 may perform voice recognition to recognize an individual from their voice. For example, some automated assistants 120 may be configured to match a voice to a user's profile, e.g., for purposes of providing/restricting access to various resources. In some implementations, movement of the speaker may then be tracked, e.g., by one or more other presence sensors that may be incorporated, for instance, in lights, light switches, smart thermostats, security cameras, etc. In some implementations, based on such detected movement, a location of the individual may be predicted, and this location may be assumed to be the individual's location when other individual (i.e., a speaker) provides an utterance with a message for the first individual. In some implementations, an individual may simply be assumed to be in the last location at which he or she engaged with automated assistant 120, especially if no much time has passed since the last engagement.
  • Each of the client computing devices 106 1-N may operate a variety of different applications, such as a corresponding one of a plurality of message exchange clients 107 1-N. Message exchange clients 107 1-N may come in various forms and the forms may vary across the client computing devices 106 1-N and/or multiple forms may be operated on a single one of the client computing devices 106 1-N. In some implementations, one or more of the message exchange clients 107 1-N may come in the form of a short messaging service (“SMS”) and/or multimedia messaging service (“MMS”) client, an online chat client (e.g., instant messenger, Internet relay chat, or “IRC,” etc.), a messaging application associated with a social network, a personal assistant messaging service dedicated to conversations with automated assistant 120, and so forth. In some implementations, one or more of the message exchange clients 107 1-N may be implemented via a webpage or other resources rendered by a web browser (not depicted) or other application of client computing device 106.
  • As described in more detail herein, automated assistant 120 engages in human-to-computer dialog sessions with one or more users via user interface input and output devices of one or more client devices 106 1-N. In some implementations, automated assistant 120 may engage in a human-to-computer dialog session with a user in response to user interface input provided by the user via one or more user interface input devices of one of the client devices 106 1-N. In some of those implementations, the user interface input is explicitly directed to automated assistant 120. For example, one of the message exchange clients 107 1-N may be a personal assistant messaging service dedicated to conversations with automated assistant 120 and user interface input provided via that personal assistant messaging service may be automatically provided to automated assistant 120. Also, for example, the user interface input may be explicitly directed to automated assistant 120 in one or more of the message exchange clients 107 1-N based on particular user interface input that indicates automated assistant 120 is to be invoked. For instance, the particular user interface input may be one or more typed characters (e.g., @AutomatedAssistant), user interaction with a hardware button and/or virtual button (e.g., a tap, a long tap), an oral command (e.g., “Hey Automated Assistant”), and/or other particular user interface input.
  • In some implementations, automated assistant 120 may engage in a dialog session in response to user interface input, even when that user interface input is not explicitly directed to automated assistant 120. For example, automated assistant 120 may examine the contents of user interface input and engage in a dialog session in response to certain terms being present in the user interface input and/or based on other cues. In many implementations, automated assistant 120 may engage interactive voice response (“IVR”), such that the user can utter commands, searches, etc., and the automated assistant may utilize natural language processing and/or one or more grammars to convert the utterances into text, and respond to the text accordingly. In some implementations, the automated assistant 120 can additionally or alternatively respond to utterances without converting the utterances into text. For example, the automated assistant 120 can convert voice input into an embedding, into entity representation(s) (that indicate entity/entities present in the voice input), and/or other “non-textual” representation and operate on such non-textual representation. Accordingly, implementations described herein as operating based on text converted from voice input may additionally and/or alternatively operate on the voice input directly and/or other non-textual representations of the voice input.
  • Each of the client computing devices 106 1-N and computing device(s) operating cloud-based automated assistant components 119 may include one or more memories for storage of data and software applications, one or more processors for accessing data and executing applications, and other components that facilitate communication over a network. The operations performed by one or more of the client computing devices 106 1-N and/or by automated assistant 120 may be distributed across multiple computer systems. Automated assistant 120 may be implemented as, for example, computer programs running on one or more computers in one or more locations that are coupled to each other through a network.
  • As noted above, in various implementations, each of the client computing devices 106 1-N may operate an automated assistant client 118. In various embodiments, each automated assistant client 118 may include a corresponding speech capture/text-to-speech (“TTS”)/STT module 114. In other implementations, one or more aspects of speech capture/TTS/STT module 114 may be implemented separately from automated assistant client 118.
  • Each speech capture/TTS/STT module 114 may be configured to perform one or more functions: capture a user's speech, e.g., via a microphone (which in some cases may comprise presence sensor 105); convert that captured audio to text (and/or to other representations or embeddings); and/or convert text to speech. For example, in some implementations, because a client device 106 may be relatively constrained in terms of computing resources (e.g., processor cycles, memory, battery, etc.), the speech capture/TTS/STT module 114 that is local to each client device 106 may be configured to convert a finite number of different spoken phrases—particularly phrases that invoke automated assistant 120 and/or intercom-style communication—to text (or to other forms, such as lower dimensionality embeddings). Other speech input may be sent to cloud-based automated assistant components 119, which may include a cloud-based TTS module 116 and/or a cloud-based STT module 117.
  • In some implementations, components that contribute to implementation of intercom-style communication as described herein may intentionally be operated exclusively on one or more client devices 106 that are associated with each other, for instance, by virtue of being on the same LAN. In some such implementations, any machine learning models described elsewhere herein may be trained and/or stored on one or more client devices 106, e.g., behind an Internet firewall, so that training data and other information generated by or associated with the machine learning models may be maintained in privacy. And in some such implementations, the cloud-based STT module 117, cloud-based TTS module 116, and/or cloud-based aspects of natural language processor 122 may not be involved in invocation of intercom-style communications.
  • Cloud-based STT module 117 may be configured to leverage the virtually limitless resources of the cloud to convert audio data captured by speech capture/TTS/STT module 114 into text (which may then be provided to natural language processor 122). Cloud-based TTS module 116 may be configured to leverage the virtually limitless resources of the cloud to convert textual data (e.g., natural language responses formulated by automated assistant 120) into computer-generated speech output. In some implementations, TTS module 116 may provide the computer-generated speech output to client device 106 to be output directly, e.g., using one or more speakers. In other implementations, textual data (e.g., natural language responses) generated by automated assistant 120 may be provided to speech capture/TTS/STT module 114, which may then convert the textual data into computer-generated speech that is output locally.
  • Automated assistant 120 (and in particular, cloud-based automated assistant components 119) may include a natural language processor 122, the aforementioned TTS module 116, the aforementioned STT module 117, and other components, some of which are described in more detail below. In some implementations, one or more of the engines and/or modules of automated assistant 120 may be omitted, combined, and/or implemented in a component that is separate from automated assistant 120. And as noted above, in some implementations, to protect privacy, one or more of the components of automated assistant 120, such as natural language processor 122, speech capture/TTS/STT module 114, etc., may be implemented at least on part on client devices 106 (e.g., to the exclusion of the cloud). In some such implementations, speech capture/TTS/STT module 114 may be sufficiently configured to perform selected aspects of the present disclosure to enable intercom-style communication, while in some cases leaving other, non-intercom-related natural language processing aspects to cloud-based components when suitable.
  • In some implementations, automated assistant 120 generates responsive content in response to various inputs generated by a user of one of the client devices 106 1-N during a human-to-computer dialog session with automated assistant 120. Automated assistant 120 may provide the responsive content (e.g., over one or more networks when separate from a client device of a user) for presentation to the user as part of the dialog session. For example, automated assistant 120 may generate responsive content in response to free-form natural language input provided via one of the client devices 106 1-N. As used herein, free-form input is input that is formulated by a user and that is not constrained to a group of options presented for selection by the user.
  • As used herein, a “dialog session” may include a logically-self-contained exchange of one or more messages between a user and automated assistant 120 (and in some cases, other human participants). Automated assistant 120 may differentiate between multiple dialog sessions with a user based on various signals, such as passage of time between sessions, change of user context (e.g., location, before/during/after a scheduled meeting, etc.) between sessions, detection of one or more intervening interactions between the user and a client device other than dialog between the user and the automated assistant (e.g., the user switches applications for a while, the user walks away from then later returns to a standalone voice-activated product), locking/sleeping of the client device between sessions, change of client devices used to interface with one or more instances of automated assistant 120, and so forth.
  • Natural language processor 122 of automated assistant 120 processes natural language input generated by users via client devices 106 1-N and may generate annotated output for use by one or more other components of automated assistant 120. For example, the natural language processor 122 may process natural language free-form input that is generated by a user via one or more user interface input devices of client device 106 1. The generated annotated output includes one or more annotations of the natural language input and optionally one or more (e.g., all) of the terms of the natural language input.
  • In some implementations, the natural language processor 122 is configured to identify and annotate various types of grammatical information in natural language input. For example, the natural language processor 122 may include a part of speech tagger configured to annotate terms with their grammatical roles. For example, the part of speech tagger may tag each term with its part of speech such as “noun,” “verb,” “adjective,” “pronoun,” etc. Also, for example, in some implementations the natural language processor 122 may additionally and/or alternatively include a dependency parser (not depicted) configured to determine syntactic relationships between terms in natural language input. For example, the dependency parser may determine which terms modify other terms, subjects and verbs of sentences, and so forth (e.g., a parse tree)—and may make annotations of such dependencies.
  • In some implementations, the natural language processor 122 may additionally and/or alternatively include an entity tagger (not depicted) configured to annotate entity references in one or more segments such as references to people (including, for instance, literary characters, celebrities, public figures, etc.), organizations, locations (real and imaginary), and so forth. In some implementations, data about entities may be stored in one or more databases, such as in a knowledge graph (not depicted). In some implementations, the knowledge graph may include nodes that represent known entities (and in some cases, entity attributes), as well as edges that connect the nodes and represent relationships between the entities. For example, a “banana” node may be connected (e.g., as a child) to a “fruit” node,” which in turn may be connected (e.g., as a child) to “produce” and/or “food” nodes. As another example, a restaurant called “Hypothetical Café” may be represented by a node that also includes attributes such as its address, type of food served, hours, contact information, etc. The “Hypothetical Café” node may in some implementations be connected by an edge (e.g., representing a child-to-parent relationship) to one or more other nodes, such as a “restaurant” node, a “business” node, a node representing a city and/or state in which the restaurant is located, and so forth.
  • The entity tagger of the natural language processor 122 may annotate references to an entity at a high level of granularity (e.g., to enable identification of all references to an entity class such as people) and/or a lower level of granularity (e.g., to enable identification of all references to a particular entity such as a particular person). The entity tagger may rely on content of the natural language input to resolve a particular entity and/or may optionally communicate with a knowledge graph or other entity database to resolve a particular entity.
  • In some implementations, the natural language processor 122 may additionally and/or alternatively include a coreference resolver (not depicted) configured to group, or “cluster,” references to the same entity based on one or more contextual cues. For example, the coreference resolver may be utilized to resolve the term “there” to “Hypothetical Café” in the natural language input “I liked Hypothetical Café last time we ate there.”
  • In some implementations, one or more components of the natural language processor 122 may rely on annotations from one or more other components of the natural language processor 122. For example, in some implementations the named entity tagger may rely on annotations from the coreference resolver and/or dependency parser in annotating all mentions to a particular entity. Also, for example, in some implementations the coreference resolver may rely on annotations from the dependency parser in clustering references to the same entity. In some implementations, in processing a particular natural language input, one or more components of the natural language processor 122 may use related prior input and/or other related data outside of the particular natural language input to determine one or more annotations.
  • In various implementations, cloud-based automated assistant components 119 may include an intercom communication analysis service (“ICAS”) 138 and/or an intercom communication location service (“ICLS”) 140. In other implementations, services 138 and/or 140 may be implemented separately from cloud-based automated assistant components 119, e.g., on one or more client devices 106 and/or on another computer system (e.g., in the so-called “cloud”).
  • In various implementations, ICAS 138 may be configured to determine, based on a variety of signals and/or data points, how and/or when to facilitate intercom-style communication between multiple users using multiple client devices 106. For example, in various implementations, ICAS 138 may be configured to analyze voice input provided by a first user at a microphone of a client device 106 of a plurality of associated client devices 106 1-N. In various implementations, ICAS 138 may analyze the first user's voice input and determine, based on the analysis, that the voice input contains a message intended for a second user.
  • Various techniques may be employed as part of the analysis to determine whether the first user intended to convey a message to the second user. In some implementations, an audio recording of the first user's voice input may be applied as input across a trained machine learning classifier to generate output. The output may indicate that the first user's voice input contained a message intended for the second user. Various types of machine learning classifiers (or more generally, “models”) may be trained to provide such output, including but not limited to various types neural networks (e.g., feed-forward, convolutional, etc.).
  • In some implementations, labeled phonemes of user's utterances may be used to train a machine learning model such as a neural network to learn embeddings of utterances into lower dimensionality representations. These embeddings, which may include lower dimensionality representations of the original phonemes, may then be used (e.g., as input for the trained model) to identify when a user intends to use the intercom-style communication described herein, and/or when a user's utterance contains a message intended for another person. For example, labeled utterances may be embedded into reduced dimensionality space, e.g., such that they are clustered into groups associated with intercom-style communication and not-intercom-style communication. The new, unlabeled utterance may then be embedded, and may be classified based on which cluster it's embedding is nearest (e.g., in Euclidian space).
  • In some implementations, a neural network (or other classifier) may be trained using training data in the form of a corpus of labelled user utterances (in which case the training is “supervised”). Labels applied to the corpus of utterances may include, for instance, a first label indicative of an utterance that contains a message intended for another user, a second label indicative of a command to engage in a human-to-computer dialog with automated assistant 120, and/or a third label indicative of background noise (which may be ignored). The labelled training examples may be applied as input to an untrained neural network. Differences between the output of the untrained (or not fully-trained) neural network and the labels—a.k.a. error—may be determined and used with techniques such as back propagation, stochastic gradient descent, objective function optimization, etc., to adjust various weights of one or more hidden layers of the neural network to reduce the error.
  • As noted in the background, machine learning classifiers such as neural networks may be trained already to recognize (e.g., classify) phonemes or other audio characteristics of utterances that are intended to invoke automated assistant 120. In some implementations, the same classifier may be trained further to both recognize (e.g., classify) explicit invocation of automated assistant 120, and to determine whether an utterance contains a message intended for a second user. In other implementations, separate machine learning classifiers may be used for each of these two tasks, e.g., one after the other or in parallel.
  • In addition to determining that a captured (recorded) utterance contains a message intended for another user, it may be determined also whether intercom-style communication is warranted, e.g., based on respective locations of the speaker and the intended recipient. In various implementations, ICLS 140 may determine a location of the intended recipient relative to the plurality of client devices 106 1-N, e.g., using presence sensor(s) 105 associated with one or more of the client devices 106 1-N. For example, ICLS 140 may determine which client device 106 is nearest the intended recipient, and/or which room the intended recipient is in (which in some cases may be associated with a client device deployed in that room). Based on the location of the intended recipient determined by ICLS 140, in various implementations, ICAS 138 may select, from the plurality of client devices 106 1-N, a second client device 106 that is capable of providing audio or visual output that is perceptible to the intended recipient. For example, if the intended recipient was last detected walking into a particular area, then a client device 106 nearest that area may be selected.
  • In some implementations, ICLS 140 may be provided, e.g., as part of cloud-based automated assistant components 119 and/or separately therefrom. In other implementations, ICAS 138 and ICLS 140 may be implemented together in a single model or engine. In various implementations, ICLS 140 may be configured to track locations of persons within an area of interest, such as within a home, a workplace, a campus, etc., based on signals provided by, for example, presence sensors 105 integral with a plurality of client devices 106 1-N that are distributed throughout the area. Based on these tracked locations, ICLS 140 and/or ICAS 138 may be configured to facilitate intercom-style communication between persons in the area using the plurality of client devices 106 1-N as described herein.
  • In some implementations, ICLS 140 may create and/or maintain a list or database of persons located in a particular area, and/or their last known locations relative to a plurality of client devices 106 1-N deployed in the area. In some implementations, this list/database may be updated, e.g., in real time, as persons are detected by different client devices as having moved to different locations. For example, ICLS 140 may drop a particular person from the list/database if, for example, that person is not detected in the overall area for some predetermined time interval (e.g., one hour) and/or if the person is last detected passing through an ingress or egress area (e.g., front door, back door, etc.). In other implementations, ICLS 140 may update the list/database periodically, e.g., every few minutes, hours, etc.
  • In some implementations, ICAS 138 and/or ICLS 140 (and more generally, automated assistant 120) may be configured to distinguish between different people using signals from presence sensors 105, rather than simply detect presence of a generic person. For example, suppose a client device 106 includes a microphone as a presence sensor 105. Automated assistant 120 may be configured to use a variety of speaker recognition and/or voice recognition techniques to determine not only that someone is present nearby, but who is present. These speaker recognition and/or voice recognition techniques may include but are not limited to hidden Markov models, Gaussian mixture models, frequency estimation, trained classifiers, deep learning, pattern matching algorithms, matrix representation, vector quantization, decisions trees, etc.
  • If a person near a microphone-equipped client device 106 does not happen to be speaking, then other techniques may be employed to identify the person. Suppose a client device 106 includes, as a presence sensor 105, a camera and/or a PIR sensor. In some implementations, a machine learning visual recognition classifier may be trained using labelled training data captured by such a presence sensor 105 to recognize the person visually. In some implementations, a user may cause the visual recognition classifier to be trained by invoking a training routine at one or more camera/PIR sensor-equipped client devices 106. For example, a user may stand in a field of view of presence sensor 105 and invoke automated assistant 120 with a phrase such as “Hey Assistant, I am Jan and this is what I look like.” In some implementations, automated assistant 120 may provide audible or visual output that prompts the user to move around to various positions within a field of view of presence sensor 105, while presence sensor 105 captures one or more snapshots of the user. These snapshots may then be labelled (e.g., with “Jan”) and used as labelled training examples for supervised training of the visual recognition classifier. In other implementations, labelled training examples for visual recognition may be generated automatically, e.g., without the user being aware. For example, when the user is in a field of view of presence sensor 105, a signal (e.g., radio wave, ultrasonic) emitted by a mobile client device 106 carried by the user may be analyzed, e.g., by automated assistant 120, to determine the user's identity (and hence, a label) for snapshots captured by presence sensor 105.
  • And in yet other implementations, other types of cues besides audio and/or visual cues may be employed to distinguish uses from one another. For example, radio, ultrasonic, and/or other types of wireless signals (e.g., infrared, modulated light, etc.) emitted by client devices 106 carried by users may be analyzed, e.g., by automated assistant 120, to discern an identity of a nearby user. In some implementations, a user's mobile client device 106 may include a network identifier, such as “Jan's Smartphone,” that may be used to identify the user.
  • Referring now to FIG. 1B, an example data flow is depicted schematically to demonstrate one possible way in which a trained machine learning classifier may be applied to analyze user utterances and determine, among other things, whether to employ intercom-style communication. In FIG. 1B, a phoneme classifier 142 (which may be a component of automated assistant 120) may be trained such that one or more utterances and one or more person locations may be applied across phoneme classifier 142 as input. Phoneme classifier 142 may then generate, as output, a classification of the utterance(s). In FIG. 1B, these classifications include “invoke assistant,” “convey message,” and “background noise,” but additional and/or alternative labels are possible.
  • Conventional phoneme classifiers already exist that detect explicit invocation phrases such as “Hey, Assistant,” “OK Assistant,” etc. In some implementations, phoneme classifier 142 may include the same functionality such that when an input utterance includes such an invocation phrase, the output of phoneme classifier 142 is “invoke assistant.” Once automated assistant 120 is invoked, the user may engage in human-to-computer dialog with automated assistant 120 as is known in the art.
  • However, in some implementations, phoneme classifier 142 may be further trained to recognize other phonemes that signal a user intent to convey a message to another user. For example, users may often use phrases such as “Hey, <name>” to get another person's attention. More generally, phoneme classifier 142 may operate to match custom phrases, words, etc. Additionally or alternatively, to get another person's attention, it may be common to first speak the other person's name, sometimes in a slightly elevated volume and/or with particular intonations, or to use other types of intonations. In various implementations, phoneme classifier 142 may be trained to recognize such phonemes and generate output such as “convey message” to signal a scenario in which intercom-style communication may potentially be warranted. In various implementations, a separate intonation model may optionally be separately trained to recognize utterances that seek communication with another person (e.g., to differentiate such utterances from casual utterances) and generate output that indicates the presence of such utterances (e.g., a likelihood that such an utterance is present). The outputs from the phoneme classifier and the intonation model, for a given user utterance, may be collectively considered in determining if intercom-style communication may be warranted.
  • In some implementations, one or more person locations may be provided, e.g., by ICLS 140, as input to phoneme classifier 142. These person locations may be used, in addition to or instead of the utterance(s), to determine whether intercom-style communication is warranted. For example, if the recipient location is sufficiently near (e.g., within earshot of) a speaker's location, that may influence phoneme classifier 142 to produce output such as “background noise,” even if the utterance contains a message intended for another. On the other hand, suppose the intended recipient's location is out of earshot of the speaker's location. That may influence phoneme classifier 142 to produce output such as “convey message,” which may increase a likelihood that intercom-style communication is employed. Additionally or alternatively, a two-step approach may be implemented in which it is first determined whether a speaker's utterance contains a message intended for another user, and it is then determined whether the other user is within earshot of the speaker. If the answer to both questions is yes, then intercom-style communication may be implemented to convey the message to the intended recipient.
  • Referring now to FIG. 2, a home floorplan is depicted that includes a plurality of rooms, 250-262. A plurality of client devices 206 1-4 are deployed throughout at least some of the rooms. Each client device 206 may implement an instance of automated assistant client 118 configured with selected aspects of the present disclosure and may include one or more input devices, such as microphones, that are capable of capturing utterances spoken by a person nearby. For example, a first client device 206 1 taking the form of a standalone interactive speaker is deployed in room 250, which in this example is a kitchen. A second client device 206 2 taking the form of a so-called “smart” television (e.g., a networked television with one or more processors that implement an instance of automated assistant client 118) is deployed in room 252, which in this example is a den. A third client device 206 3 taking the form of an interactive standalone speaker is deployed in room 254, which in this example is a bedroom. A fourth client device 206 4 taking the form of another interactive standalone speaker is deployed in room 256, which in this example is a living room.
  • While not depicted in FIG. 2, the plurality of client devices 106 1-4 may be communicatively coupled with each other and/or other resources (e.g., the Internet) via one or more wired or wireless LANs (e.g., 110 2 in FIG. 1A). Additionally, other client devices—particularly mobile devices such as smart phones, tablets, laptops, wearable devices, etc.—may also be present, e.g., carried by one or more persons in the home and may or may not also be connected to the same LAN. It should be understood that the configuration of client devices depicted in FIG. 2 and elsewhere in the Figures is just one example; more or less client devices 106 may be deployed across any number of other rooms and/or areas other than a home.
  • In the example of FIG. 2, a first user, Jack, is in the kitchen 250 when he utters the question, “Hey Hon, do you know where the strainer is?” Perhaps unbeknownst to Jack, his wife, Jan, is not in kitchen 250, but rather is in living room 256, and therefore likely did not hear Jack's question. First client device 206 1, which as noted above is configured with selected aspects of the present disclosure, may detect Jack's utterance. A recording of the utterance may be analyzed using techniques described above to determine that Jack's utterance contains a message intended for Jan. First client device 206 1 also may determine, e.g., based on information shared amongst all of the plurality of client devices 206 1-4, that Jan is in living room 256 (or at least nearest fourth client device 206 4). For example, client device 206 4 may have detected, e.g., using one or more integral presence sensors (e.g., 105 in FIG. 1A), that Jan is in living room 256.
  • Based on Jan's detected location and/or on attribute(s) of Jack's utterance (which in some implementations may be classified using a trained machine learning model as described above), first client device 206 1 may determine that Jack intended his message for Jan and that Jan is out of earshot of Jack. Consequently, first client device 206 1 may push (over one or more of the aforementioned LANs) a recording of Jack's utterance (or in some cases, transcribed text of Jack's utterance) to the client device nearest Jan, which in this example is fourth client device 206 4. On receiving this data, fourth client device 206 4 may, e.g., by way of automated assistant 120 executing at least in part on fourth client device 206 4, audibly output Jack's message to Jan as depicted in FIG. 2, thus effecting intercom-style communication between Jack and Jan.
  • In the example of FIG. 2 (and in similar examples described elsewhere herein), Jack's question is output to Jan audibly using fourth client device 206 4, which as noted above is a standalone interactive speaker. However, this is not meant to be limiting. In various implementations, Jack's message may be conveyed to Jan using other output modalities. For example, in some implementations in which a mobile client device (not depicted) carried by Jan is connected to the Wi-Fi LAN, that mobile device may output Jack's message, either as an audible recording or as a textual message that is conveyed to Jan visually, e.g., using an application such as message exchange client 107 executing on Jan's mobile client device.
  • In various implementations, recordings and/or STT transcriptions of utterances that are exchanged between client devices 106 to facilitate intercom communication may be used for a variety of additional purposes. In some embodiments, they may be used to provide context to downstream human-to-computer dialogs between user(s) and automated assistant 120. For example, in some scenarios, a recorded utterance and/or its STT transcription may be used to disambiguate a request provided to an instance of automated assistant 120, whether that request be from the user who originally provided the utterance, an intended recipient of the utterance, or even another user who engages automated assistant 120 subsequent to an intercom-style communication involving a plurality of client devices 106.
  • FIG. 3 depicts the same home and distribution of client devices 206 1-4 as was depicted in FIG. 2. In FIG. 3, Jan (still in living room 256) speaks the utterance, “Hey Jack, you should leave soon to pick up Bob from the airport.” It may be determined, e.g., by ICLS 140, that Jack is in another room, out of earshot from Jan. For example, ICLS 140 may determine, e.g., based on a signal provided by an onboard camera and/or PIR sensor of a “smart” thermostat 264, that Jack is located in den 252. Based on that determination, and/or a determination that Jan's utterance has been classified (e.g., using one of the aforementioned machine learning models) as a message intended for Jack, a client device near Jack's detected location, such as client device 206 2, may be identified to output Jan's utterance. In some implementations, Jan's recorded utterance may be pushed from another computing device near Jan that recorded it, such as client device 206 4, to client device 206 2 identified near Jack and output audibly (or visually since client device 206 2 is a smart television with display capabilities).
  • FIG. 4 demonstrates an example follow up scenario to that depicted in FIG. 3. After receiving Jan's conveyed message via client device 206 2, Jack says “OK Assistant—when is the next tram leaving?” Without additional information, this request, or search query, may be too ambiguous to answer, and automated assistant 120 may be required to solicit disambiguating information from Jack. However, using techniques described herein, automated assistant 120 may disambiguate Jack's request based on Jan's original utterance to determine that the tram to the airport is the one Jack is interested in. Additionally or alternatively, automated assistant 120 could simply retrieve normal results for all nearby trams, and then rank those results based on Jack's utterance, e.g., so that the tram to the airport is ranked highest. Whichever the case, in FIG. 4, automated assistant 120 provides audio output at client device 206 2 of “Next tram to the airport leaves in 10 minutes.”
  • FIG. 5 is a flowchart illustrating an example method 500 according to implementations disclosed herein. For convenience, the operations of the flow chart are described with reference to a system that performs the operations. This system may include various components of various computer systems, such as one or more components of computing systems that implement automated assistant 120. Moreover, while operations of method 500 are shown in a particular order, this is not meant to be limiting. One or more operations may be reordered, omitted or added.
  • At block 502, the system may receive, at an input device of a first computing device of a plurality of computing devices, from a first user, free form natural language input. In many implementations, this free form natural language input may come in the form of voice input, i.e., an utterance from the first user, though this is not required. It should be understood that this voice input need not necessarily be directed by the first user at automated assistant 120, and instead may include any utterance provided by the first user that is captured and/or recorded by a client device configured with selected aspects of the present disclosure.
  • At block 504, the system may analyze the voice input. Various aspects (e.g., phonemes) of the voice input may be analyzed, including but not limited to intonation, volume, recognized phrases, etc.
  • In some implementations, the system may analyze other signals in addition to the voice input. These other signals may include, for instance, a number of people in an environment such as a house. For instance, if only one person is present, no intercom capabilities may be utilized. If only two people are present, then the location of the other person may automatically be determined to be the location at which intercom output should be provided. If more than two people are present, then the system may attempt various techniques (e.g., voice recognition, facial recognition, wireless signal recognition, etc.) to attempt to distinguish people from each other.
  • At block 506, the system may determine, based on the analyzing, that the first user intends to convey a message to a second user (e.g., that the voice input contains a message intended for the second user). As described previously, automated assistant 120 may employ various techniques, such as a classifier trained on labeled training examples in the form of recorded utterances, to analyze the voice input and/or determine whether the first user's voice input is a command to invoke automated assistant 120 (e.g., “Hey, Assistant”) to engage in further human-to-computer dialog, an utterance intended to convey a message to the second user (or multiple other users), or other background noise. In some implementations, in addition to or instead of a trained machine learning model, a rules-based approach may be implemented. For example, one or more simple IVR grammars may be defined, e.g., using technologies such as voice extensible markup language, or “VXML,” that are designed to match utterances that are intended to convey messages between users.
  • At block 508 (which may occur after blocks 502-506 or on an ongoing basis), the system may determine a location of the second user relative to the plurality of computing devices. In some embodiments, ICLS 140 may maintain a list or database of people in an area such as a home or workplace and their last-known (i.e., last-detected) locations. In some such implementations, the system may simply consult this list or database for the second user's location. In other implementations, the system may actively poll a plurality of client devices in the environment to seek out the second user, e.g., on an as-needed basis (e.g., when it is determined that the first user's utterance contains a message intended to be conveyed to the second user). This may cause the client devices to activate presence sensors (105 in FIG. 1A) so that they can detect whether someone (e.g., the second user) is nearby.
  • At block 510 the system may select, from the plurality of computing devices, based on the location of the second user, a second computing device that is capable of providing audio or visual output that is perceptible to the second user. In some implementations, the second computing device may be a stationary client device (e.g., a standalone interactive speaker, smart television, desktop computer, etc.) that is deployed in a particular area of an environment. In other implementations, the second computing device may be a mobile client device carried by the second user. In some such implementations, the mobile client device may become part of the plurality of computing devices considered by the system by virtue of being part of the same coordinate ecosystem and/or joining the same wireless LAN (or simply being located within a predetermined distance).
  • At block 512, the system may cause the second computing device identified at block 510 to provide audio or visual output that conveys the message to the second user. For example, in some implementations in which ICAS 138 and/or ICLS 140 are cloud-based, one or the other may cause a recording of the first user's utterance to be forwarded (e.g., streamed) to the second computing device selected at block 510. The second computing device, which may be configured with selected aspects of the present disclosure, may respond by outputting the forwarded recording.
  • While examples described herein have included a first user attempting to convey a message to a single other user, this is not meant to be limiting. In various implementations, a user's utterance may be intended for multiple other users, such as multiple members of the speaker's family, all persons in the area or environment, etc. In some such implementations, one or more of the aforementioned machine learning classifiers may be (further) trained to determine whether an utterance contains a message intended for a single recipient or multiple recipients. If the answer is yes, then the system may convey the message to the multiple intended recipients at multiple locations in various ways. In some simple implementations, the system may simply cause the message to be pushed to all client devices in the area (e.g., all client devices of a coordinated ecosystem and/or all client devices connected to a Wi-Fi LAN), effectively broadcasting the message. In other implementations, the system (e.g., ICLS 140) may determine locations of all intended recipients on an individual basis, and output the message on only those client devices that are near each intended recipient.
  • In some implementations, automated assistant 120 may wait until an intended recipient of a message is able to perceive a message (e.g., within earshot) until it causes a message to be conveyed using techniques described herein. For example, suppose a first user conveys a message to an intended recipient but the intended recipient has stepped outside momentarily. In some implementations, the message may be temporarily delayed until the intended recipient is detected by one or more computing devices. The first computing device to detect the recipient upon their return may output the original message. In some implementations, a variety of signals, such as intended recipient's position coordinate (e.g., Global Positioning System, or “GPS”) obtained from a mobile device they carry, may be used to determine that the intended recipient will not be reachable using intercom-style communication (at least not with any devices on the LAN). In some implementations, the message (e.g., a recording of the speaker's utterance) may be forwarded to the recipient's mobile device. In other implementations, automated assistant 120 may determine that the intended recipient is unreachable, and may provide output, e.g., at a client device closest to the speaker (e.g., the device that captured the speaker's utterance) that notifies the speaker that the recipient is unreachable at the moment. In some such implementations, automated assistant 120 may prompt the user for permission to forward the message to the recipient's mobile device, e.g., by output something like “I can't reach Jan directly right now. Would you like me to send a message to their phone?”
  • Optional blocks 514-518 may or may not occur if the second user issues free form natural language input sometime after the first user's message is output at block 512. At block 514, the system may identify a free form natural language input, such as a voice input, that is issued by the second user after the audio or visual output is provided by the second computing device at block 512. The second user's voice input may include, for instance, a command and/or a search query. In some embodiments, the command and/or search query may be, by itself, too ambiguous to properly interpret, as was the case with Jack's utterance in FIG. 4.
  • At block 516, the system may analyze the second user's free form natural language input identified at block 514 based at least in part on the free form natural language input received from the first user at block 502. In other words, the first user's utterance may be transcribed and used to provide context to the second user's subsequent request. At block 518, the system may formulate a response to the second user's natural language input based on the context provided by the first user's original free form natural language input. For example, if the second user's free form natural language input included a search query (such as Jack's query in FIG. 4), the system may obtain search results that are responsive to the search query based at least in part on the voice input from the first user. For example, the second user's search query may be disambiguated based on the first user's original utterance, and/or one or more responsive search results may be ranked based on the first user's original utterance. At block 518, the system may then cause one or more of the plurality of computing devices to provide output indicative of at least some of the search results, as occurred in FIG. 4.
  • In various implementations, users may preconfigure (e.g., commission) client computing devices in their home, workplace, or in another environment, to be usable to engage in intercom-style communications as described herein. For example, in some implementations, a user may, e.g., using a graphical user interface and/or by engaging in a human-to-computer dialog session with automated assistant 120, assign a “location” to each stationary client computing device, such as “kitchen,” “dining room,” etc. Consequently, in some such implementations, a user may explicitly invoke automated assistant 120 to facilitate intercom-style communication to a particular location. For example, a user may provide the following voice input to convey a message to another user: “Hey Assistant, tell Oliver in the kitchen that we need more butter.”
  • More generally, in some implementations, users may explicitly designate a recipient of a message when they invoke intercom-style communication. If the user does not also specify a location of the recipient, then techniques described herein, e.g., in association with ICLS 140, may be used automatically to determine a location of the recipient and select which computing device will be used to output the message to the recipient. However, as described above, a user need not explicitly invoke intercom-style communications. Rather, as described above, various signals and/or data points (e.g., output of a machine learning classifier, location of an intended recipient, etc.) may be considered to determine, without explicit instruction from the user, that the user's message should be convey automatically using intercom-style communication.
  • FIG. 6 is a block diagram of an example computing device 610 that may optionally be utilized to perform one or more aspects of techniques described herein. In some implementations, one or more of a client computing device, user-controlled resources engine 130, and/or other component(s) may comprise one or more components of the example computing device 610.
  • Computing device 610 typically includes at least one processor 614 which communicates with a number of peripheral devices via bus subsystem 612. These peripheral devices may include a storage subsystem 624, including, for example, a memory subsystem 625 and a file storage subsystem 626, user interface output devices 620, user interface input devices 622, and a network interface subsystem 616. The input and output devices allow user interaction with computing device 610. Network interface subsystem 616 provides an interface to outside networks and is coupled to corresponding interface devices in other computing devices.
  • User interface input devices 622 may include a keyboard, pointing devices such as a mouse, trackball, touchpad, or graphics tablet, a scanner, a touchscreen incorporated into the display, audio input devices such as voice recognition systems, microphones, and/or other types of input devices. In general, use of the term “input device” is intended to include all possible types of devices and ways to input information into computing device 610 or onto a communication network.
  • User interface output devices 620 may include a display subsystem, a printer, a fax machine, or non-visual displays such as audio output devices. The display subsystem may include a cathode ray tube (CRT), a flat-panel device such as a liquid crystal display (LCD), a projection device, or some other mechanism for creating a visible image. The display subsystem may also provide non-visual display such as via audio output devices. In general, use of the term “output device” is intended to include all possible types of devices and ways to output information from computing device 610 to the user or to another machine or computing device.
  • Storage subsystem 624 stores programming and data constructs that provide the functionality of some or all of the modules described herein. For example, the storage subsystem 624 may include the logic to perform selected aspects of the method of FIG. 5, as well as to implement various components depicted in FIG. 1A.
  • These software modules are generally executed by processor 614 alone or in combination with other processors. Memory 625 used in the storage subsystem 624 can include a number of memories including a main random access memory (RAM) 630 for storage of instructions and data during program execution and a read only memory (ROM) 632 in which fixed instructions are stored. A file storage subsystem 626 can provide persistent storage for program and data files, and may include a hard disk drive, a floppy disk drive along with associated removable media, a CD-ROM drive, an optical drive, or removable media cartridges. The modules implementing the functionality of certain implementations may be stored by file storage subsystem 626 in the storage subsystem 624, or in other machines accessible by the processor(s) 614.
  • Bus subsystem 612 provides a mechanism for letting the various components and subsystems of computing device 610 communicate with each other as intended. Although bus subsystem 612 is shown schematically as a single bus, alternative implementations of the bus subsystem may use multiple busses.
  • Computing device 610 can be of varying types including a workstation, server, computing cluster, blade server, server farm, or any other data processing system or computing device. Due to the ever-changing nature of computers and networks, the description of computing device 610 depicted in FIG. 6 is intended only as a specific example for purposes of illustrating some implementations. Many other configurations of computing device 610 are possible having more or fewer components than the computing device depicted in FIG. 6.
  • In situations in which certain implementations discussed herein may collect or use personal information about users (e.g., user data extracted from other electronic communications, information about a user's social network, a user's location, a user's time, a user's biometric information, and a user's activities and demographic information, relationships between users, etc.), users are provided with one or more opportunities to control whether information is collected, whether the personal information is stored, whether the personal information is used, and how the information is collected about the user, stored and used. That is, the systems and methods discussed herein collect, store and/or use user personal information only upon receiving explicit authorization from the relevant users to do so.
  • For example, a user is provided with control over whether programs or features collect user information about that particular user or other users relevant to the program or feature. Each user for which personal information is to be collected is presented with one or more options to allow control over the information collection relevant to that user, to provide permission or authorization as to whether the information is collected and as to which portions of the information are to be collected. For example, users can be provided with one or more such control options over a communication network. In addition, certain data may be treated in one or more ways before it is stored or used so that personally identifiable information is removed. As one example, a user's identity may be treated so that no personally identifiable information can be determined. As another example, a user's geographic location may be generalized to a larger region so that the user's particular location cannot be determined.
  • While several implementations have been described and illustrated herein, a variety of other means and/or structures for performing the function and/or obtaining the results and/or one or more of the advantages described herein may be utilized, and each of such variations and/or modifications is deemed to be within the scope of the implementations described herein. More generally, all parameters, dimensions, materials, and configurations described herein are meant to be exemplary and that the actual parameters, dimensions, materials, and/or configurations will depend upon the specific application or applications for which the teachings is/are used. Those skilled in the art will recognize, or be able to ascertain using no more than routine experimentation, many equivalents to the specific implementations described herein. It is, therefore, to be understood that the foregoing implementations are presented by way of example only and that, within the scope of the appended claims and equivalents thereto, implementations may be practiced otherwise than as specifically described and claimed. Implementations of the present disclosure are directed to each individual feature, system, article, material, kit, and/or method described herein. In addition, any combination of two or more such features, systems, articles, materials, kits, and/or methods, if such features, systems, articles, materials, kits, and/or methods are not mutually inconsistent, is included within the scope of the present disclosure.

Claims (20)

What is claimed is:
1. A method comprising:
accessing a trained machine learning model, wherein the machine learning model is trained, using a corpus of labeled voice inputs, to predict whether voice inputs are indicative of background conversation that should be ignored, or are indicative of a user intent to convey a message to one or more other users;
receiving, at a microphone of a first computing device of a plurality of computing devices, from a first user, voice input;
analyzing the voice input, wherein the analyzing includes applying data indicative of an audio recording of the voice input as input across the trained machine learning model to generate output, wherein the output indicates that the first user intends to convey a message to the one or more other users;
determining, based on the analyzing, that the first user intends to convey the message to the one or more other users; and
causing one or more other computing devices of the plurality of computing devices to provide audio or visual output that conveys the message to the one or more other users.
2. The method of claim 1, further comprising:
receiving, at the microphone of the first computing device, an additional voice input;
analyzing the additional voice input, wherein the analyzing includes applying data indicative of an audio recording of the voice input as input across the trained machine learning model to generate additional output, wherein the additional output indicates that the additional voice input is indicative of background noise that should be ignored;
ignoring the additional voice input in response to the additional output indicating that the additional voice input is indicative of background noise that should be ignored.
3. The method of claim 1, wherein the machine learning model is trained using a corpus of labeled voice inputs, and wherein labels applied to the voice inputs include:
a first label indicative of a user intent to convey a message to one or more other users; and
a second label indicative of background conversation between multiple users.
4. The method of claim 3, wherein the labels applied to the voice inputs further include a third label indicative of a user intent to engage in a human-to-computer dialog with an automated assistant.
5. The method of claim 1, further comprising:
determining a location of a second user of the one or more users relative to the plurality of computing devices; and
selecting, from the plurality of computing devices, based on the location of the second user, a second computing device that is capable of providing audio or visual output that is perceptible to the second user;
wherein the causing includes causing the second computing device provide the audio or visual output that conveys the message to the second user.
6. The method of claim 1, wherein the causing comprising broadcasting the message to the one or more other users using all of the plurality of computing devices.
7. The method of claim 1, wherein the analyzing includes performing speech-to-text processing on an audio recording of the voice input to generate, as the data indicative of the audio recording, textual input, wherein the textual input is applied as input across the trained machine learning.
8. A system comprising one or more processors and memory operably coupled with the one or more processors, wherein the memory stores instructions that, in response to execution of the instructions by one or more processors, cause the one or more processors to perform the following operations:
accessing a trained machine learning model, wherein the machine learning model is trained, using a corpus of labeled voice inputs, to predict whether voice inputs are indicative of background conversation that should be ignored, or are indicative of a user intent to convey a message to one or more other users;
receiving, at a microphone of a first computing device of a plurality of computing devices, from a first user, voice input;
analyzing the voice input, wherein the analyzing includes applying data indicative of an audio recording of the voice input as input across the trained machine learning model to generate output, wherein the output indicates that the first user intends to convey a message to the one or more other users;
determining, based on the analyzing, that the first user intends to convey the message to the one or more other users; and
causing one or more other computing devices of the plurality of computing devices to provide audio or visual output that conveys the message to the one or more other users.
9. The system of claim 8, further comprising:
receiving, at the microphone of the first computing device, an additional voice input;
analyzing the additional voice input, wherein the analyzing includes applying data indicative of an audio recording of the voice input as input across the trained machine learning model to generate additional output, wherein the additional output indicates that the additional voice input is indicative of background noise that should be ignored;
ignoring the additional voice input in response to the additional output indicating that the additional voice input is indicative of background noise that should be ignored.
10. The system of claim 8, wherein the machine learning model is trained using a corpus of labeled voice inputs, and wherein labels applied to the voice inputs include:
a first label indicative of a user intent to convey a message to one or more other users; and
a second label indicative of background conversation between multiple users.
11. The system of claim 10, wherein the labels applied to the voice inputs further include a third label indicative of a user intent to engage in a human-to-computer dialog with an automated assistant.
12. The system of claim 8, further comprising:
determining a location of a second user of the one or more users relative to the plurality of computing devices; and
selecting, from the plurality of computing devices, based on the location of the second user, a second computing device that is capable of providing audio or visual output that is perceptible to the second user;
wherein the causing includes causing the second computing device provide the audio or visual output that conveys the message to the second user.
13. The system of claim 8, wherein the causing comprising broadcasting the message to the one or more other users using all of the plurality of computing devices.
14. The system of claim 8, wherein the analyzing includes performing speech-to-text processing on an audio recording of the voice input to generate, as the data indicative of the audio recording, textual input, wherein the textual input is applied as input across the trained machine learning.
15. At least one non-transitory computer-readable medium comprising instructions that, in response to execution of the instructions by one or more processors, cause the one or more processors to perform the following operations:
accessing a trained machine learning model, wherein the machine learning model is trained, using a corpus of labeled voice inputs, to predict whether voice inputs are indicative of background conversation that should be ignored, or are indicative of a user intent to convey a message to one or more other users;
receiving, at a microphone of a first computing device of a plurality of computing devices, from a first user, voice input;
analyzing the voice input, wherein the analyzing includes applying data indicative of an audio recording of the voice input as input across the trained machine learning model to generate output, wherein the output indicates that the first user intends to convey a message to the one or more other users;
determining, based on the analyzing, that the first user intends to convey the message to the one or more other users; and
causing one or more other computing devices of the plurality of computing devices to provide audio or visual output that conveys the message to the one or more other users.
16. The at least one non-transitory computer-readable medium of claim 15, further comprising instructions for:
receiving, at the microphone of the first computing device, an additional voice input;
analyzing the additional voice input, wherein the analyzing includes applying data indicative of an audio recording of the voice input as input across the trained machine learning model to generate additional output, wherein the additional output indicates that the additional voice input is indicative of background noise that should be ignored;
ignoring the additional voice input in response to the additional output indicating that the additional voice input is indicative of background noise that should be ignored.
17. The at least one non-transitory computer-readable medium of claim 15, wherein the machine learning model is trained using a corpus of labeled voice inputs, and wherein labels applied to the voice inputs include:
a first label indicative of a user intent to convey a message to one or more other users; and
a second label indicative of background conversation between multiple users.
18. The at least one non-transitory computer-readable medium of claim 17, wherein the labels applied to the voice inputs further include a third label indicative of a user intent to engage in a human-to-computer dialog with an automated assistant.
19. The at least one non-transitory computer-readable medium of claim 15, further comprising:
determining a location of a second user of the one or more users relative to the plurality of computing devices; and
selecting, from the plurality of computing devices, based on the location of the second user, a second computing device that is capable of providing audio or visual output that is perceptible to the second user;
wherein the causing includes causing the second computing device provide the audio or visual output that conveys the message to the second user.
20. The at least one non-transitory computer-readable medium of claim 15, wherein the causing comprising broadcasting the message to the one or more other users using all of the plurality of computing devices.
US16/114,494 2017-09-12 2018-08-28 Intercom-style communication using multiple computing devices Abandoned US20190079724A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US16/114,494 US20190079724A1 (en) 2017-09-12 2018-08-28 Intercom-style communication using multiple computing devices

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US15/702,164 US10083006B1 (en) 2017-09-12 2017-09-12 Intercom-style communication using multiple computing devices
US16/114,494 US20190079724A1 (en) 2017-09-12 2018-08-28 Intercom-style communication using multiple computing devices

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US15/702,164 Continuation US10083006B1 (en) 2017-09-12 2017-09-12 Intercom-style communication using multiple computing devices

Publications (1)

Publication Number Publication Date
US20190079724A1 true US20190079724A1 (en) 2019-03-14

Family

ID=63556613

Family Applications (2)

Application Number Title Priority Date Filing Date
US15/702,164 Active US10083006B1 (en) 2017-09-12 2017-09-12 Intercom-style communication using multiple computing devices
US16/114,494 Abandoned US20190079724A1 (en) 2017-09-12 2018-08-28 Intercom-style communication using multiple computing devices

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US15/702,164 Active US10083006B1 (en) 2017-09-12 2017-09-12 Intercom-style communication using multiple computing devices

Country Status (6)

Country Link
US (2) US10083006B1 (en)
EP (1) EP3622510B1 (en)
JP (1) JP6947852B2 (en)
KR (1) KR102314096B1 (en)
CN (1) CN110741433B (en)
WO (1) WO2019055372A1 (en)

Cited By (79)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200302937A1 (en) * 2019-03-22 2020-09-24 Honda Motor Co., Ltd. Agent system, server device, method of controlling agent system, and computer-readable non-transient storage medium
US10811015B2 (en) 2018-09-25 2020-10-20 Sonos, Inc. Voice detection optimization based on selected voice assistant service
US10847178B2 (en) 2018-05-18 2020-11-24 Sonos, Inc. Linear filtering for noise-suppressed speech detection
US10847143B2 (en) 2016-02-22 2020-11-24 Sonos, Inc. Voice control of a media playback system
US10873819B2 (en) 2016-09-30 2020-12-22 Sonos, Inc. Orientation-based playback device microphone selection
US10880650B2 (en) 2017-12-10 2020-12-29 Sonos, Inc. Network microphone devices with automatic do not disturb actuation capabilities
US10878811B2 (en) 2018-09-14 2020-12-29 Sonos, Inc. Networked devices, systems, and methods for intelligently deactivating wake-word engines
US10891932B2 (en) 2017-09-28 2021-01-12 Sonos, Inc. Multi-channel acoustic echo cancellation
US10959029B2 (en) 2018-05-25 2021-03-23 Sonos, Inc. Determining and adapting to changes in microphone performance of playback devices
US10970035B2 (en) 2016-02-22 2021-04-06 Sonos, Inc. Audio response playback
US10997977B2 (en) * 2019-04-30 2021-05-04 Sap Se Hybrid NLP scenarios for mobile devices
US11006214B2 (en) 2016-02-22 2021-05-11 Sonos, Inc. Default playback device designation
US11024331B2 (en) 2018-09-21 2021-06-01 Sonos, Inc. Voice detection optimization using sound metadata
US11080005B2 (en) 2017-09-08 2021-08-03 Sonos, Inc. Dynamic computation of system response volume
US11100923B2 (en) 2018-09-28 2021-08-24 Sonos, Inc. Systems and methods for selective wake word detection using neural network models
US11133018B2 (en) 2016-06-09 2021-09-28 Sonos, Inc. Dynamic player selection for audio signal processing
US11132989B2 (en) 2018-12-13 2021-09-28 Sonos, Inc. Networked microphone devices, systems, and methods of localized arbitration
US11159880B2 (en) 2018-12-20 2021-10-26 Sonos, Inc. Optimization of network microphone devices using noise classification
US20210350810A1 (en) * 2020-05-11 2021-11-11 Apple Inc. Device arbitration for digital assistant-based intercom systems
US11175880B2 (en) 2018-05-10 2021-11-16 Sonos, Inc. Systems and methods for voice-assisted media content selection
US11175888B2 (en) 2017-09-29 2021-11-16 Sonos, Inc. Media playback system with concurrent voice assistance
US20210358511A1 (en) * 2020-03-19 2021-11-18 Yahoo Japan Corporation Output apparatus, output method and non-transitory computer-readable recording medium
US11183181B2 (en) 2017-03-27 2021-11-23 Sonos, Inc. Systems and methods of multiple voice services
US11184969B2 (en) 2016-07-15 2021-11-23 Sonos, Inc. Contextualization of voice inputs
US11183183B2 (en) * 2018-12-07 2021-11-23 Sonos, Inc. Systems and methods of operating media playback systems having multiple voice assistant services
US11189286B2 (en) 2019-10-22 2021-11-30 Sonos, Inc. VAS toggle based on device orientation
US11197096B2 (en) 2018-06-28 2021-12-07 Sonos, Inc. Systems and methods for associating playback devices with voice assistant services
US11200894B2 (en) 2019-06-12 2021-12-14 Sonos, Inc. Network microphone device with command keyword eventing
US11200900B2 (en) 2019-12-20 2021-12-14 Sonos, Inc. Offline voice control
US11200889B2 (en) 2018-11-15 2021-12-14 Sonos, Inc. Dilated convolutions and gating for efficient keyword spotting
US20220093087A1 (en) * 2019-05-31 2022-03-24 Huawei Technologies Co.,Ltd. Speech recognition method, apparatus, and device, and computer-readable storage medium
US11302326B2 (en) 2017-09-28 2022-04-12 Sonos, Inc. Tone interference cancellation
US11308962B2 (en) 2020-05-20 2022-04-19 Sonos, Inc. Input detection windowing
US11308961B2 (en) 2016-10-19 2022-04-19 Sonos, Inc. Arbitration-based voice recognition
US11308958B2 (en) 2020-02-07 2022-04-19 Sonos, Inc. Localized wakeword verification
US11315556B2 (en) 2019-02-08 2022-04-26 Sonos, Inc. Devices, systems, and methods for distributed voice processing by transmitting sound data associated with a wake word to an appropriate device for identification
US11343614B2 (en) 2018-01-31 2022-05-24 Sonos, Inc. Device designation of playback and network microphone device arrangements
US11354092B2 (en) 2019-07-31 2022-06-07 Sonos, Inc. Noise classification for event detection
US11361756B2 (en) 2019-06-12 2022-06-14 Sonos, Inc. Conditional wake word eventing based on environment
US11380322B2 (en) 2017-08-07 2022-07-05 Sonos, Inc. Wake-word detection suppression
US11405430B2 (en) 2016-02-22 2022-08-02 Sonos, Inc. Networked microphone device control
US11432030B2 (en) 2018-09-14 2022-08-30 Sonos, Inc. Networked devices, systems, and methods for associating playback devices based on sound codes
US20220301567A1 (en) * 2019-12-09 2022-09-22 Google Llc Relay Device For Voice Commands To Be Processed By A Voice Assistant, Voice Assistant And Wireless Network
US11482978B2 (en) 2018-08-28 2022-10-25 Sonos, Inc. Audio notifications
US11482224B2 (en) 2020-05-20 2022-10-25 Sonos, Inc. Command keywords with input detection windowing
US11501795B2 (en) 2018-09-29 2022-11-15 Sonos, Inc. Linear filtering for noise-suppressed speech detection via multiple network microphone devices
US11501773B2 (en) 2019-06-12 2022-11-15 Sonos, Inc. Network microphone device with command keyword conditioning
US11531520B2 (en) 2016-08-05 2022-12-20 Sonos, Inc. Playback device supporting concurrent voice assistants
US11551669B2 (en) 2019-07-31 2023-01-10 Sonos, Inc. Locally distributed keyword detection
US11551700B2 (en) 2021-01-25 2023-01-10 Sonos, Inc. Systems and methods for power-efficient keyword detection
US11556307B2 (en) 2020-01-31 2023-01-17 Sonos, Inc. Local voice data processing
US11556306B2 (en) 2016-02-22 2023-01-17 Sonos, Inc. Voice controlled media playback system
US11562740B2 (en) 2020-01-07 2023-01-24 Sonos, Inc. Voice verification for media playback
US11563842B2 (en) 2018-08-28 2023-01-24 Sonos, Inc. Do not disturb feature for audio notifications
US11641559B2 (en) 2016-09-27 2023-05-02 Sonos, Inc. Audio playback settings for voice interaction
US11646045B2 (en) 2017-09-27 2023-05-09 Sonos, Inc. Robust short-time fourier transform acoustic echo cancellation during audio playback
US11646023B2 (en) 2019-02-08 2023-05-09 Sonos, Inc. Devices, systems, and methods for distributed voice processing
US11664023B2 (en) 2016-07-15 2023-05-30 Sonos, Inc. Voice detection by multiple devices
US11676590B2 (en) 2017-12-11 2023-06-13 Sonos, Inc. Home graph
US11698771B2 (en) 2020-08-25 2023-07-11 Sonos, Inc. Vocal guidance engines for playback devices
US11710487B2 (en) 2019-07-31 2023-07-25 Sonos, Inc. Locally distributed keyword detection
US11727919B2 (en) 2020-05-20 2023-08-15 Sonos, Inc. Memory allocation for keyword spotting engines
US11726742B2 (en) 2016-02-22 2023-08-15 Sonos, Inc. Handling of loss of pairing between networked devices
US11750962B2 (en) 2020-07-21 2023-09-05 Apple Inc. User identification using headphones
US20230325146A1 (en) * 2020-04-17 2023-10-12 Harman International Industries, Incorporated Systems and methods for providing a personalized virtual personal assistant
US11790914B2 (en) 2019-06-01 2023-10-17 Apple Inc. Methods and user interfaces for voice-based control of electronic devices
US11798553B2 (en) 2019-05-03 2023-10-24 Sonos, Inc. Voice assistant persistence across multiple network microphone devices
US11809886B2 (en) 2015-11-06 2023-11-07 Apple Inc. Intelligent automated assistant in a messaging environment
US11837237B2 (en) 2017-05-12 2023-12-05 Apple Inc. User-specific acoustic models
US11838734B2 (en) 2020-07-20 2023-12-05 Apple Inc. Multi-device audio adjustment coordination
US11838579B2 (en) 2014-06-30 2023-12-05 Apple Inc. Intelligent automated assistant for TV user interactions
US11862186B2 (en) 2013-02-07 2024-01-02 Apple Inc. Voice trigger for a digital assistant
US11862151B2 (en) 2017-05-12 2024-01-02 Apple Inc. Low-latency intelligent automated assistant
US11893992B2 (en) 2018-09-28 2024-02-06 Apple Inc. Multi-modal inputs for voice commands
US11899519B2 (en) 2018-10-23 2024-02-13 Sonos, Inc. Multiple stage network microphone device with reduced power consumption and processing load
US11907436B2 (en) 2018-05-07 2024-02-20 Apple Inc. Raise to speak
US11914848B2 (en) 2020-05-11 2024-02-27 Apple Inc. Providing relevant data items based on context
US11954405B2 (en) 2015-09-08 2024-04-09 Apple Inc. Zero latency digital assistant
US11961519B2 (en) 2022-04-18 2024-04-16 Sonos, Inc. Localized wakeword verification

Families Citing this family (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108476348B (en) * 2015-12-10 2021-07-02 麦克赛尔株式会社 Electric appliance communication system, portable terminal and electric appliance communication cooperation method
CN108446151B (en) * 2017-02-14 2022-01-25 阿里巴巴集团控股有限公司 Communication method and device
WO2018161014A1 (en) * 2017-03-03 2018-09-07 Orion Labs Phone-less member of group communication constellations
US11024303B1 (en) * 2017-09-19 2021-06-01 Amazon Technologies, Inc. Communicating announcements
US10706845B1 (en) 2017-09-19 2020-07-07 Amazon Technologies, Inc. Communicating announcements
US11373091B2 (en) * 2017-10-19 2022-06-28 Syntiant Systems and methods for customizing neural networks
US11271629B1 (en) * 2018-02-27 2022-03-08 Amazon Technologies, Inc. Human activity and transition detection
US11810567B2 (en) * 2018-04-09 2023-11-07 Maxell, Ltd. Speech recognition device, speech-recognition-device coordination system, and speech-recognition-device coordination method
JP7233868B2 (en) * 2018-08-08 2023-03-07 キヤノン株式会社 Learning system for information processing device, information processing device, control method for information processing device, and program
US11347940B2 (en) * 2018-10-16 2022-05-31 Soco, Inc. Asynchronous role-playing system for dialog data collection
TWI684874B (en) * 2018-10-18 2020-02-11 瑞軒科技股份有限公司 Smart speaker and operation method thereof
US11107468B2 (en) * 2019-03-27 2021-08-31 Lenovo (Singapore) Pte. Ltd. Apparatus, method, and program product for context-based communications
WO2020246640A1 (en) * 2019-06-05 2020-12-10 엘지전자 주식회사 Artificial intelligence device for determining location of user and method therefor
US11064297B2 (en) * 2019-08-20 2021-07-13 Lenovo (Singapore) Pte. Ltd. Microphone position notification
US11184298B2 (en) * 2019-08-28 2021-11-23 International Business Machines Corporation Methods and systems for improving chatbot intent training by correlating user feedback provided subsequent to a failed response to an initial user intent
US20210110646A1 (en) * 2019-10-11 2021-04-15 Citrix Systems, Inc. Systems and methods of geolocating augmented reality consoles
CN112839103B (en) * 2020-06-19 2022-12-16 支付宝(杭州)信息技术有限公司 Message processing method, device and system and electronic equipment
CN111988426B (en) * 2020-08-31 2023-07-18 深圳康佳电子科技有限公司 Communication method and device based on voiceprint recognition, intelligent terminal and storage medium
US11580959B2 (en) * 2020-09-28 2023-02-14 International Business Machines Corporation Improving speech recognition transcriptions
US11756357B2 (en) 2020-10-14 2023-09-12 1Ahead Technologies Access management system
CN112316424B (en) * 2021-01-06 2021-03-26 腾讯科技(深圳)有限公司 Game data processing method, device and storage medium

Citations (42)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5999105A (en) * 1998-04-30 1999-12-07 Gordon; Gary M. Multiple sensory message center apparatus
US6021181A (en) * 1997-02-24 2000-02-01 Wildfire Communications, Inc. Electronic voice mail message handling system
US20050097623A1 (en) * 2003-10-31 2005-05-05 Tecot Edward M. Multimedia presentation resumption within an environment of multiple presentation systems
US20060123053A1 (en) * 2004-12-02 2006-06-08 Insignio Technologies, Inc. Personalized content processing and delivery system and media
US7124167B1 (en) * 2000-01-19 2006-10-17 Alberto Bellotti Computer based system for directing communications over electronic networks
US20060251263A1 (en) * 2005-05-06 2006-11-09 Microsoft Corporation Audio user interface (UI) for previewing and selecting audio streams using 3D positional audio techniques
US20060262943A1 (en) * 2005-04-29 2006-11-23 Oxford William V Forming beams with nulls directed at noise sources
US7440895B1 (en) * 2003-12-01 2008-10-21 Lumenvox, Llc. System and method for tuning and testing in a speech recognition system
US20080261564A1 (en) * 2000-08-29 2008-10-23 Logan James D Communication and control system using location aware devices for audio message storage and transmission operating under rule-based control
US20110173244A1 (en) * 2009-08-26 2011-07-14 Olson John V State filter
US20120093344A1 (en) * 2009-04-09 2012-04-19 Ntnu Technology Transfer As Optimal modal beamformer for sensor arrays
US20130006973A1 (en) * 2011-06-28 2013-01-03 Microsoft Corporation Summarization of Conversation Threads
US20130054863A1 (en) * 2011-08-30 2013-02-28 Allure Energy, Inc. Resource Manager, System And Method For Communicating Resource Management Information For Smart Energy And Media Resources
US20130124984A1 (en) * 2010-04-12 2013-05-16 David A. Kuspa Method and Apparatus for Providing Script Data
US20130144616A1 (en) * 2011-12-06 2013-06-06 At&T Intellectual Property I, L.P. System and method for machine-mediated human-human conversation
US20140074481A1 (en) * 2012-09-12 2014-03-13 David Edward Newman Wave Analysis for Command Identification
US20140278387A1 (en) * 2013-03-14 2014-09-18 Vocollect, Inc. System and method for improving speech recognition accuracy in a work environment
US20140324203A1 (en) * 2014-07-14 2014-10-30 Sonos, Inc. Zone Group Control
US9047054B1 (en) * 2012-12-20 2015-06-02 Audible, Inc. User location-based management of content presentation
US20150310859A1 (en) * 2012-11-02 2015-10-29 Nuance Communications, Inc. Method and Apparatus For Passive Data Acquisition In Speech Recognition and Natural Language Understanding
US20150319407A1 (en) * 2014-05-05 2015-11-05 Cloudtalk Llc Intercom system utilizing wi-fi
US20150331666A1 (en) * 2014-05-15 2015-11-19 Tyco Safety Products Canada Ltd. System and Method for Processing Control Commands in a Voice Interactive System
US20150373477A1 (en) * 2014-06-23 2015-12-24 Glen A. Norris Sound Localization for an Electronic Call
US20160077710A1 (en) * 2014-09-16 2016-03-17 Google Inc. Continuation of playback of media content by different output devices
US20160119742A1 (en) * 2014-10-28 2016-04-28 Comcast Cable Communications, Llc Low energy network
US20160155443A1 (en) * 2014-11-28 2016-06-02 Microsoft Technology Licensing, Llc Device arbitration for listening devices
US20160241976A1 (en) * 2015-02-12 2016-08-18 Harman International Industries, Incorporated Media content playback system and method
US20160373269A1 (en) * 2015-06-18 2016-12-22 Panasonic Intellectual Property Corporation Of America Device control method, controller, and recording medium
US20160379638A1 (en) * 2015-06-26 2016-12-29 Amazon Technologies, Inc. Input speech quality matching
US20170045866A1 (en) * 2015-08-13 2017-02-16 Xiaomi Inc. Methods and apparatuses for operating an appliance
US20170116483A1 (en) * 2015-08-31 2017-04-27 Deako, Inc. Occupancy Sensing Apparatus Network
US20170245125A1 (en) * 2016-02-18 2017-08-24 Vivint, Inc. Event triggered messaging
US20180047394A1 (en) * 2016-08-12 2018-02-15 Paypal, Inc. Location based voice association system
US20180160652A1 (en) * 2016-06-08 2018-06-14 Terry Lee Torres Programmable Training System for Pets
US20180211658A1 (en) * 2017-01-20 2018-07-26 Essential Products, Inc. Ambient assistant device
US20180233139A1 (en) * 2017-02-14 2018-08-16 Microsoft Technology Licensing, Llc Intelligent digital assistant system
US20180240028A1 (en) * 2017-02-17 2018-08-23 International Business Machines Corporation Conversation and context aware fraud and abuse prevention agent
US20180288104A1 (en) * 2017-03-30 2018-10-04 Intel Corporation Methods, systems and apparatus to enable voice assistant device communication
US20180358009A1 (en) * 2017-06-09 2018-12-13 International Business Machines Corporation Cognitive and interactive sensor based smart home solution
US20190074008A1 (en) * 2016-10-19 2019-03-07 Sonos, Inc. Arbitration-Based Voice Recognition
US20190089934A1 (en) * 2017-09-20 2019-03-21 Google Llc Systems and Methods of Presenting Appropriate Actions for Responding to a Visitor to a Smart Home Environment
US10319365B1 (en) * 2016-06-27 2019-06-11 Amazon Technologies, Inc. Text-to-speech processing with emphasized output audio

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7643985B2 (en) * 2005-06-27 2010-01-05 Microsoft Corporation Context-sensitive communication and translation methods for enhanced interactions and understanding among speakers of different languages
FR2963132A1 (en) * 2010-07-23 2012-01-27 Aldebaran Robotics HUMANOID ROBOT HAVING A NATURAL DIALOGUE INTERFACE, METHOD OF USING AND PROGRAMMING THE SAME
US9641954B1 (en) 2012-08-03 2017-05-02 Amazon Technologies, Inc. Phone communication via a voice-controlled device
US9401153B2 (en) * 2012-10-15 2016-07-26 Digimarc Corporation Multi-mode audio recognition and auxiliary data encoding and decoding
WO2014078948A1 (en) * 2012-11-22 2014-05-30 Perch Communications Inc. System and method for automatically triggered synchronous and asynchronous video and audio communications between users at different endpoints
US9271111B2 (en) 2012-12-14 2016-02-23 Amazon Technologies, Inc. Response endpoint selection
CN106030697B (en) * 2014-02-26 2019-10-25 三菱电机株式会社 On-vehicle control apparatus and vehicle-mounted control method
US9355547B2 (en) * 2014-05-22 2016-05-31 International Business Machines Corporation Identifying a change in a home environment

Patent Citations (42)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6021181A (en) * 1997-02-24 2000-02-01 Wildfire Communications, Inc. Electronic voice mail message handling system
US5999105A (en) * 1998-04-30 1999-12-07 Gordon; Gary M. Multiple sensory message center apparatus
US7124167B1 (en) * 2000-01-19 2006-10-17 Alberto Bellotti Computer based system for directing communications over electronic networks
US20080261564A1 (en) * 2000-08-29 2008-10-23 Logan James D Communication and control system using location aware devices for audio message storage and transmission operating under rule-based control
US20050097623A1 (en) * 2003-10-31 2005-05-05 Tecot Edward M. Multimedia presentation resumption within an environment of multiple presentation systems
US7440895B1 (en) * 2003-12-01 2008-10-21 Lumenvox, Llc. System and method for tuning and testing in a speech recognition system
US20060123053A1 (en) * 2004-12-02 2006-06-08 Insignio Technologies, Inc. Personalized content processing and delivery system and media
US20060262943A1 (en) * 2005-04-29 2006-11-23 Oxford William V Forming beams with nulls directed at noise sources
US20060251263A1 (en) * 2005-05-06 2006-11-09 Microsoft Corporation Audio user interface (UI) for previewing and selecting audio streams using 3D positional audio techniques
US20120093344A1 (en) * 2009-04-09 2012-04-19 Ntnu Technology Transfer As Optimal modal beamformer for sensor arrays
US20110173244A1 (en) * 2009-08-26 2011-07-14 Olson John V State filter
US20130124984A1 (en) * 2010-04-12 2013-05-16 David A. Kuspa Method and Apparatus for Providing Script Data
US20130006973A1 (en) * 2011-06-28 2013-01-03 Microsoft Corporation Summarization of Conversation Threads
US20130054863A1 (en) * 2011-08-30 2013-02-28 Allure Energy, Inc. Resource Manager, System And Method For Communicating Resource Management Information For Smart Energy And Media Resources
US20130144616A1 (en) * 2011-12-06 2013-06-06 At&T Intellectual Property I, L.P. System and method for machine-mediated human-human conversation
US20140074481A1 (en) * 2012-09-12 2014-03-13 David Edward Newman Wave Analysis for Command Identification
US20150310859A1 (en) * 2012-11-02 2015-10-29 Nuance Communications, Inc. Method and Apparatus For Passive Data Acquisition In Speech Recognition and Natural Language Understanding
US9047054B1 (en) * 2012-12-20 2015-06-02 Audible, Inc. User location-based management of content presentation
US20140278387A1 (en) * 2013-03-14 2014-09-18 Vocollect, Inc. System and method for improving speech recognition accuracy in a work environment
US20150319407A1 (en) * 2014-05-05 2015-11-05 Cloudtalk Llc Intercom system utilizing wi-fi
US20150331666A1 (en) * 2014-05-15 2015-11-19 Tyco Safety Products Canada Ltd. System and Method for Processing Control Commands in a Voice Interactive System
US20150373477A1 (en) * 2014-06-23 2015-12-24 Glen A. Norris Sound Localization for an Electronic Call
US20140324203A1 (en) * 2014-07-14 2014-10-30 Sonos, Inc. Zone Group Control
US20160077710A1 (en) * 2014-09-16 2016-03-17 Google Inc. Continuation of playback of media content by different output devices
US20160119742A1 (en) * 2014-10-28 2016-04-28 Comcast Cable Communications, Llc Low energy network
US20160155443A1 (en) * 2014-11-28 2016-06-02 Microsoft Technology Licensing, Llc Device arbitration for listening devices
US20160241976A1 (en) * 2015-02-12 2016-08-18 Harman International Industries, Incorporated Media content playback system and method
US20160373269A1 (en) * 2015-06-18 2016-12-22 Panasonic Intellectual Property Corporation Of America Device control method, controller, and recording medium
US20160379638A1 (en) * 2015-06-26 2016-12-29 Amazon Technologies, Inc. Input speech quality matching
US20170045866A1 (en) * 2015-08-13 2017-02-16 Xiaomi Inc. Methods and apparatuses for operating an appliance
US20170116483A1 (en) * 2015-08-31 2017-04-27 Deako, Inc. Occupancy Sensing Apparatus Network
US20170245125A1 (en) * 2016-02-18 2017-08-24 Vivint, Inc. Event triggered messaging
US20180160652A1 (en) * 2016-06-08 2018-06-14 Terry Lee Torres Programmable Training System for Pets
US10319365B1 (en) * 2016-06-27 2019-06-11 Amazon Technologies, Inc. Text-to-speech processing with emphasized output audio
US20180047394A1 (en) * 2016-08-12 2018-02-15 Paypal, Inc. Location based voice association system
US20190074008A1 (en) * 2016-10-19 2019-03-07 Sonos, Inc. Arbitration-Based Voice Recognition
US20180211658A1 (en) * 2017-01-20 2018-07-26 Essential Products, Inc. Ambient assistant device
US20180233139A1 (en) * 2017-02-14 2018-08-16 Microsoft Technology Licensing, Llc Intelligent digital assistant system
US20180240028A1 (en) * 2017-02-17 2018-08-23 International Business Machines Corporation Conversation and context aware fraud and abuse prevention agent
US20180288104A1 (en) * 2017-03-30 2018-10-04 Intel Corporation Methods, systems and apparatus to enable voice assistant device communication
US20180358009A1 (en) * 2017-06-09 2018-12-13 International Business Machines Corporation Cognitive and interactive sensor based smart home solution
US20190089934A1 (en) * 2017-09-20 2019-03-21 Google Llc Systems and Methods of Presenting Appropriate Actions for Responding to a Visitor to a Smart Home Environment

Cited By (125)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11862186B2 (en) 2013-02-07 2024-01-02 Apple Inc. Voice trigger for a digital assistant
US11838579B2 (en) 2014-06-30 2023-12-05 Apple Inc. Intelligent automated assistant for TV user interactions
US11954405B2 (en) 2015-09-08 2024-04-09 Apple Inc. Zero latency digital assistant
US11809886B2 (en) 2015-11-06 2023-11-07 Apple Inc. Intelligent automated assistant in a messaging environment
US10971139B2 (en) 2016-02-22 2021-04-06 Sonos, Inc. Voice control of a media playback system
US10970035B2 (en) 2016-02-22 2021-04-06 Sonos, Inc. Audio response playback
US11514898B2 (en) 2016-02-22 2022-11-29 Sonos, Inc. Voice control of a media playback system
US11513763B2 (en) 2016-02-22 2022-11-29 Sonos, Inc. Audio response playback
US11405430B2 (en) 2016-02-22 2022-08-02 Sonos, Inc. Networked microphone device control
US11863593B2 (en) 2016-02-22 2024-01-02 Sonos, Inc. Networked microphone device control
US11556306B2 (en) 2016-02-22 2023-01-17 Sonos, Inc. Voice controlled media playback system
US11212612B2 (en) 2016-02-22 2021-12-28 Sonos, Inc. Voice control of a media playback system
US11947870B2 (en) 2016-02-22 2024-04-02 Sonos, Inc. Audio response playback
US11006214B2 (en) 2016-02-22 2021-05-11 Sonos, Inc. Default playback device designation
US11832068B2 (en) 2016-02-22 2023-11-28 Sonos, Inc. Music service selection
US11184704B2 (en) 2016-02-22 2021-11-23 Sonos, Inc. Music service selection
US10847143B2 (en) 2016-02-22 2020-11-24 Sonos, Inc. Voice control of a media playback system
US11750969B2 (en) 2016-02-22 2023-09-05 Sonos, Inc. Default playback device designation
US11726742B2 (en) 2016-02-22 2023-08-15 Sonos, Inc. Handling of loss of pairing between networked devices
US11736860B2 (en) 2016-02-22 2023-08-22 Sonos, Inc. Voice control of a media playback system
US11133018B2 (en) 2016-06-09 2021-09-28 Sonos, Inc. Dynamic player selection for audio signal processing
US11545169B2 (en) 2016-06-09 2023-01-03 Sonos, Inc. Dynamic player selection for audio signal processing
US11184969B2 (en) 2016-07-15 2021-11-23 Sonos, Inc. Contextualization of voice inputs
US11664023B2 (en) 2016-07-15 2023-05-30 Sonos, Inc. Voice detection by multiple devices
US11531520B2 (en) 2016-08-05 2022-12-20 Sonos, Inc. Playback device supporting concurrent voice assistants
US11641559B2 (en) 2016-09-27 2023-05-02 Sonos, Inc. Audio playback settings for voice interaction
US11516610B2 (en) 2016-09-30 2022-11-29 Sonos, Inc. Orientation-based playback device microphone selection
US10873819B2 (en) 2016-09-30 2020-12-22 Sonos, Inc. Orientation-based playback device microphone selection
US11727933B2 (en) 2016-10-19 2023-08-15 Sonos, Inc. Arbitration-based voice recognition
US11308961B2 (en) 2016-10-19 2022-04-19 Sonos, Inc. Arbitration-based voice recognition
US11183181B2 (en) 2017-03-27 2021-11-23 Sonos, Inc. Systems and methods of multiple voice services
US11837237B2 (en) 2017-05-12 2023-12-05 Apple Inc. User-specific acoustic models
US11862151B2 (en) 2017-05-12 2024-01-02 Apple Inc. Low-latency intelligent automated assistant
US11380322B2 (en) 2017-08-07 2022-07-05 Sonos, Inc. Wake-word detection suppression
US11900937B2 (en) 2017-08-07 2024-02-13 Sonos, Inc. Wake-word detection suppression
US11500611B2 (en) 2017-09-08 2022-11-15 Sonos, Inc. Dynamic computation of system response volume
US11080005B2 (en) 2017-09-08 2021-08-03 Sonos, Inc. Dynamic computation of system response volume
US11646045B2 (en) 2017-09-27 2023-05-09 Sonos, Inc. Robust short-time fourier transform acoustic echo cancellation during audio playback
US11769505B2 (en) 2017-09-28 2023-09-26 Sonos, Inc. Echo of tone interferance cancellation using two acoustic echo cancellers
US11302326B2 (en) 2017-09-28 2022-04-12 Sonos, Inc. Tone interference cancellation
US11538451B2 (en) 2017-09-28 2022-12-27 Sonos, Inc. Multi-channel acoustic echo cancellation
US10891932B2 (en) 2017-09-28 2021-01-12 Sonos, Inc. Multi-channel acoustic echo cancellation
US11288039B2 (en) 2017-09-29 2022-03-29 Sonos, Inc. Media playback system with concurrent voice assistance
US11893308B2 (en) 2017-09-29 2024-02-06 Sonos, Inc. Media playback system with concurrent voice assistance
US11175888B2 (en) 2017-09-29 2021-11-16 Sonos, Inc. Media playback system with concurrent voice assistance
US11451908B2 (en) 2017-12-10 2022-09-20 Sonos, Inc. Network microphone devices with automatic do not disturb actuation capabilities
US10880650B2 (en) 2017-12-10 2020-12-29 Sonos, Inc. Network microphone devices with automatic do not disturb actuation capabilities
US11676590B2 (en) 2017-12-11 2023-06-13 Sonos, Inc. Home graph
US11343614B2 (en) 2018-01-31 2022-05-24 Sonos, Inc. Device designation of playback and network microphone device arrangements
US11689858B2 (en) 2018-01-31 2023-06-27 Sonos, Inc. Device designation of playback and network microphone device arrangements
US11907436B2 (en) 2018-05-07 2024-02-20 Apple Inc. Raise to speak
US11797263B2 (en) 2018-05-10 2023-10-24 Sonos, Inc. Systems and methods for voice-assisted media content selection
US11175880B2 (en) 2018-05-10 2021-11-16 Sonos, Inc. Systems and methods for voice-assisted media content selection
US11715489B2 (en) 2018-05-18 2023-08-01 Sonos, Inc. Linear filtering for noise-suppressed speech detection
US10847178B2 (en) 2018-05-18 2020-11-24 Sonos, Inc. Linear filtering for noise-suppressed speech detection
US11792590B2 (en) 2018-05-25 2023-10-17 Sonos, Inc. Determining and adapting to changes in microphone performance of playback devices
US10959029B2 (en) 2018-05-25 2021-03-23 Sonos, Inc. Determining and adapting to changes in microphone performance of playback devices
US11197096B2 (en) 2018-06-28 2021-12-07 Sonos, Inc. Systems and methods for associating playback devices with voice assistant services
US11696074B2 (en) 2018-06-28 2023-07-04 Sonos, Inc. Systems and methods for associating playback devices with voice assistant services
US11482978B2 (en) 2018-08-28 2022-10-25 Sonos, Inc. Audio notifications
US11563842B2 (en) 2018-08-28 2023-01-24 Sonos, Inc. Do not disturb feature for audio notifications
US11432030B2 (en) 2018-09-14 2022-08-30 Sonos, Inc. Networked devices, systems, and methods for associating playback devices based on sound codes
US10878811B2 (en) 2018-09-14 2020-12-29 Sonos, Inc. Networked devices, systems, and methods for intelligently deactivating wake-word engines
US11778259B2 (en) 2018-09-14 2023-10-03 Sonos, Inc. Networked devices, systems and methods for associating playback devices based on sound codes
US11551690B2 (en) 2018-09-14 2023-01-10 Sonos, Inc. Networked devices, systems, and methods for intelligently deactivating wake-word engines
US11790937B2 (en) 2018-09-21 2023-10-17 Sonos, Inc. Voice detection optimization using sound metadata
US11024331B2 (en) 2018-09-21 2021-06-01 Sonos, Inc. Voice detection optimization using sound metadata
US11031014B2 (en) 2018-09-25 2021-06-08 Sonos, Inc. Voice detection optimization based on selected voice assistant service
US10811015B2 (en) 2018-09-25 2020-10-20 Sonos, Inc. Voice detection optimization based on selected voice assistant service
US11727936B2 (en) 2018-09-25 2023-08-15 Sonos, Inc. Voice detection optimization based on selected voice assistant service
US11790911B2 (en) 2018-09-28 2023-10-17 Sonos, Inc. Systems and methods for selective wake word detection using neural network models
US11100923B2 (en) 2018-09-28 2021-08-24 Sonos, Inc. Systems and methods for selective wake word detection using neural network models
US11893992B2 (en) 2018-09-28 2024-02-06 Apple Inc. Multi-modal inputs for voice commands
US11501795B2 (en) 2018-09-29 2022-11-15 Sonos, Inc. Linear filtering for noise-suppressed speech detection via multiple network microphone devices
US11899519B2 (en) 2018-10-23 2024-02-13 Sonos, Inc. Multiple stage network microphone device with reduced power consumption and processing load
US11200889B2 (en) 2018-11-15 2021-12-14 Sonos, Inc. Dilated convolutions and gating for efficient keyword spotting
US11741948B2 (en) 2018-11-15 2023-08-29 Sonos Vox France Sas Dilated convolutions and gating for efficient keyword spotting
US11881223B2 (en) * 2018-12-07 2024-01-23 Sonos, Inc. Systems and methods of operating media playback systems having multiple voice assistant services
US11557294B2 (en) 2018-12-07 2023-01-17 Sonos, Inc. Systems and methods of operating media playback systems having multiple voice assistant services
US11183183B2 (en) * 2018-12-07 2021-11-23 Sonos, Inc. Systems and methods of operating media playback systems having multiple voice assistant services
US20230215433A1 (en) * 2018-12-07 2023-07-06 Sonos, Inc. Systems and methods of operating media playback systems having multiple voice assistant services
US11538460B2 (en) 2018-12-13 2022-12-27 Sonos, Inc. Networked microphone devices, systems, and methods of localized arbitration
US11132989B2 (en) 2018-12-13 2021-09-28 Sonos, Inc. Networked microphone devices, systems, and methods of localized arbitration
US11540047B2 (en) 2018-12-20 2022-12-27 Sonos, Inc. Optimization of network microphone devices using noise classification
US11159880B2 (en) 2018-12-20 2021-10-26 Sonos, Inc. Optimization of network microphone devices using noise classification
US11646023B2 (en) 2019-02-08 2023-05-09 Sonos, Inc. Devices, systems, and methods for distributed voice processing
US11315556B2 (en) 2019-02-08 2022-04-26 Sonos, Inc. Devices, systems, and methods for distributed voice processing by transmitting sound data associated with a wake word to an appropriate device for identification
US20200302937A1 (en) * 2019-03-22 2020-09-24 Honda Motor Co., Ltd. Agent system, server device, method of controlling agent system, and computer-readable non-transient storage medium
CN111726772A (en) * 2019-03-22 2020-09-29 本田技研工业株式会社 Intelligent system, control method thereof, server device, and storage medium
US10997977B2 (en) * 2019-04-30 2021-05-04 Sap Se Hybrid NLP scenarios for mobile devices
US11798553B2 (en) 2019-05-03 2023-10-24 Sonos, Inc. Voice assistant persistence across multiple network microphone devices
US20220093087A1 (en) * 2019-05-31 2022-03-24 Huawei Technologies Co.,Ltd. Speech recognition method, apparatus, and device, and computer-readable storage medium
US11790914B2 (en) 2019-06-01 2023-10-17 Apple Inc. Methods and user interfaces for voice-based control of electronic devices
US11854547B2 (en) 2019-06-12 2023-12-26 Sonos, Inc. Network microphone device with command keyword eventing
US11200894B2 (en) 2019-06-12 2021-12-14 Sonos, Inc. Network microphone device with command keyword eventing
US11501773B2 (en) 2019-06-12 2022-11-15 Sonos, Inc. Network microphone device with command keyword conditioning
US11361756B2 (en) 2019-06-12 2022-06-14 Sonos, Inc. Conditional wake word eventing based on environment
US11551669B2 (en) 2019-07-31 2023-01-10 Sonos, Inc. Locally distributed keyword detection
US11714600B2 (en) 2019-07-31 2023-08-01 Sonos, Inc. Noise classification for event detection
US11710487B2 (en) 2019-07-31 2023-07-25 Sonos, Inc. Locally distributed keyword detection
US11354092B2 (en) 2019-07-31 2022-06-07 Sonos, Inc. Noise classification for event detection
US11862161B2 (en) 2019-10-22 2024-01-02 Sonos, Inc. VAS toggle based on device orientation
US11189286B2 (en) 2019-10-22 2021-11-30 Sonos, Inc. VAS toggle based on device orientation
US20220301567A1 (en) * 2019-12-09 2022-09-22 Google Llc Relay Device For Voice Commands To Be Processed By A Voice Assistant, Voice Assistant And Wireless Network
US11869503B2 (en) 2019-12-20 2024-01-09 Sonos, Inc. Offline voice control
US11200900B2 (en) 2019-12-20 2021-12-14 Sonos, Inc. Offline voice control
US11562740B2 (en) 2020-01-07 2023-01-24 Sonos, Inc. Voice verification for media playback
US11556307B2 (en) 2020-01-31 2023-01-17 Sonos, Inc. Local voice data processing
US11308958B2 (en) 2020-02-07 2022-04-19 Sonos, Inc. Localized wakeword verification
US11763831B2 (en) * 2020-03-19 2023-09-19 Yahoo Japan Corporation Output apparatus, output method and non-transitory computer-readable recording medium
US20210358511A1 (en) * 2020-03-19 2021-11-18 Yahoo Japan Corporation Output apparatus, output method and non-transitory computer-readable recording medium
US11928390B2 (en) * 2020-04-17 2024-03-12 Harman International Industries, Incorporated Systems and methods for providing a personalized virtual personal assistant
US20230325146A1 (en) * 2020-04-17 2023-10-12 Harman International Industries, Incorporated Systems and methods for providing a personalized virtual personal assistant
US20210350810A1 (en) * 2020-05-11 2021-11-11 Apple Inc. Device arbitration for digital assistant-based intercom systems
US11914848B2 (en) 2020-05-11 2024-02-27 Apple Inc. Providing relevant data items based on context
US11810578B2 (en) * 2020-05-11 2023-11-07 Apple Inc. Device arbitration for digital assistant-based intercom systems
US11308962B2 (en) 2020-05-20 2022-04-19 Sonos, Inc. Input detection windowing
US11482224B2 (en) 2020-05-20 2022-10-25 Sonos, Inc. Command keywords with input detection windowing
US11694689B2 (en) 2020-05-20 2023-07-04 Sonos, Inc. Input detection windowing
US11727919B2 (en) 2020-05-20 2023-08-15 Sonos, Inc. Memory allocation for keyword spotting engines
US11838734B2 (en) 2020-07-20 2023-12-05 Apple Inc. Multi-device audio adjustment coordination
US11750962B2 (en) 2020-07-21 2023-09-05 Apple Inc. User identification using headphones
US11698771B2 (en) 2020-08-25 2023-07-11 Sonos, Inc. Vocal guidance engines for playback devices
US11551700B2 (en) 2021-01-25 2023-01-10 Sonos, Inc. Systems and methods for power-efficient keyword detection
US11961519B2 (en) 2022-04-18 2024-04-16 Sonos, Inc. Localized wakeword verification

Also Published As

Publication number Publication date
EP3622510A1 (en) 2020-03-18
US10083006B1 (en) 2018-09-25
JP6947852B2 (en) 2021-10-13
EP3622510B1 (en) 2022-04-13
WO2019055372A1 (en) 2019-03-21
KR20200007011A (en) 2020-01-21
CN110741433A (en) 2020-01-31
KR102314096B1 (en) 2021-10-19
JP2020532757A (en) 2020-11-12
CN110741433B (en) 2024-02-02

Similar Documents

Publication Publication Date Title
EP3622510B1 (en) Intercom-style communication using multiple computing devices
US11443120B2 (en) Multimodal entity and coreference resolution for assistant systems
US11810557B2 (en) Dynamic and/or context-specific hot words to invoke automated assistant
US9293133B2 (en) Improving voice communication over a network
US11423885B2 (en) Utilizing pre-event and post-event input streams to engage an automated assistant
US11789695B2 (en) Automatic adjustment of muted response setting
US10861453B1 (en) Resource scheduling with voice controlled devices
US20210264910A1 (en) User-driven content generation for virtual assistant
US20240127799A1 (en) Processing continued conversations over multiple devices

Legal Events

Date Code Title Description
AS Assignment

Owner name: GOOGLE INC., CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:FEUZ, SANDRO;MILLIUS, SEBASTIAN;ALTHAUS, JAN;SIGNING DATES FROM 20170908 TO 20170911;REEL/FRAME:046724/0022

Owner name: GOOGLE LLC, CALIFORNIA

Free format text: CHANGE OF NAME;ASSIGNOR:GOOGLE INC.;REEL/FRAME:046959/0566

Effective date: 20170929

STPP Information on status: patent application and granting procedure in general

Free format text: APPLICATION DISPATCHED FROM PREEXAM, NOT YET DOCKETED

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER

STPP Information on status: patent application and granting procedure in general

Free format text: FINAL REJECTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: FINAL REJECTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: RESPONSE AFTER FINAL ACTION FORWARDED TO EXAMINER

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STPP Information on status: patent application and granting procedure in general

Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER

STPP Information on status: patent application and granting procedure in general

Free format text: FINAL REJECTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER

STPP Information on status: patent application and granting procedure in general

Free format text: FINAL REJECTION MAILED

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION