CN117099063A

CN117099063A - Robotic computing device with adaptive user interaction

Info

Publication number: CN117099063A
Application number: CN202180096667.4A
Authority: CN
Inventors: 维克托·克尔布内; 马修·谢里菲
Original assignee: Google LLC
Current assignee: Google LLC
Priority date: 2021-11-19
Filing date: 2021-12-13
Publication date: 2023-11-21

Abstract

The embodiments set forth herein relate to robotic computing devices that are capable of performing certain operations, such as communication between users in a public space, according to certain preferences of the users. When interacting with a particular user, the robotic computing device is able to perform operations at a preferred location relative to the particular user based on explicit or implicit preferences of the particular user. For example, certain types of operations can be performed at a first location within a room, while other types of operations can be performed at a second location within the room. When the operation involves following or guiding the user, parameters for driving the robotic computing device can be selected based on the user's preferences and/or the context in which the robotic computing device interacts with the user (e.g., whether the context indicates a degree of urgency).

Description

Robotic computing device with adaptive user interaction

Background

Because computing devices facilitate interactions between automated assistants and users, most computing devices are unable to autonomously navigate to various destinations without manual control by the user. This may limit the ability of certain automated assistants to provide assistance for certain tasks that may involve navigating towards and/or away from a user. For example, users having their automated assistants render certain audio content may be limited to locations where the user is able to perceive the audio content. This may be due to the audio content being rendered via a separate speaker device and/or having to be rendered by other computing devices that are manually positioned by the user-typically in proximity to an electrical outlet.

In some cases, when a user requests an automated assistant to perform a particular action that may need to be moved between geolocations (e.g., between different rooms within a house), certain tasks may be delegated to devices that are capable of performing the action. However, such devices often cannot handle any kind of action. For example, an autonomous vacuum cleaner may be able to initiate a default vacuum cleaning operation at the request of an automated assistant, but may not be able to perform any particular other vacuum cleaning related operation. This may be the result of the automated assistant and/or autonomous vacuum cleaner not having a mechanism for translating the requests submitted to the automated assistant by the user. This may be particularly inefficient when the robotic home device has multiple interfaces (e.g., speakers, sensors, etc.) necessary to fulfill a particular assistant request, but the robotic home device does not have a mechanism for translating them into executable operations that can be performed by the robotic home device.

Disclosure of Invention

The embodiments set forth herein relate to robotic devices capable of performing various tasks that can involve the robotic device navigating to and/or communicating certain information between users for specific purposes. For example, the robotic device can operate in a user's home and receive spoken utterances from the user, such as "Robot, ask Eliif he is ready for school" (Robot asking if iritay is ready to learn). In this case, the user can be the father of the human "illite" who is the subject of the user's query. In response to the spoken utterance, and with prior permission from the home person, the robotic device is able to navigate from the living room in which the robotic device and the user were initially located, and maneuver toward the room of illite. In some embodiments, the robotic device is able to determine possible locations of illite by accessing data that can correlate certain locations in the home with certain titles (e.g., "Eli's Room"). Alternatively or additionally, the robotic device can determine a possible location of the illite based on previous interactions between the robotic device and the illite. For example, during the most recent interaction between the robotic device and illite, another user illite may already be located in an office area of the home. Based on the determination, and in response to the spoken utterance from the father, the robotic device is able to navigate to an office area of the home for another user, illicit.

In some implementations, the robotic computing device can determine a likely location of another user's "illite" by accessing interaction data that can indicate where the other user has interacted with another device recently. For example, just before the user provides the aforementioned spoken utterance, another user may have interacted with a separate display device in the kitchen in the user's home. Based at least on a separate display device (such as a "kitchen display") having a label in the household map data, the household map data is capable of providing a correlation with the separate display device and the household kitchen. Alternatively or additionally, the semantic tags of the room can be inferred from features of the room, e.g., marked as "kitchen" based on the room being the only room in the house having the dish washing device and the microwave oven (e.g., determined using one or more sensors and one or more object recognition techniques). In some implementations, the robotic computing device or the separate display device can determine that the previous interaction involves another user, illicit, using one or more user authentication techniques (e.g., voice authentication, facial recognition, etc.). Thus, using this data, the robotic computing device can determine that another user has recently interacted with a particular device (e.g., a kitchen display) and that the particular device is located in a kitchen in the home. Based on the determination, the robotic computing device is able to respond to the aforementioned spoken utterance from the user by navigating towards the kitchen in the home to further initiate a dialog with another user.

When the robotic device has identified the location of another user as the subject of the spoken utterance, the robotic device is able to issue an output via an output interface (e.g., display interface, audio interface, etc.) of the robotic device, with the other user's prior permission. For example, the output sent to another user can be a familiar audio output, such as "Hi Eli, your father kindly wants to know if you are ready for school" (hello, ilei, you father wants to know if you are ready to learn. When the robotic device has issued an audio output and/or has identified the localization of illite, the robotic device can activate one or more input interfaces to receive input from illite. For example, illite may inadvertently hear a user asking the robotic device to ask whether illite is ready to learn, and thus may provide response input to the robotic device before the robotic device has an opportunity to render audio output. In such cases, the robotic device will be able to capture input from illite because one or more microphones of the robotic device will be preemptively activated before reaching the location of another user. Otherwise, when another user's illite provides input after the robotic device renders an audible output, the other user can provide responsive input such as "toll him,' I'm almost alady" (telling him that' I have been fast enough).

When the robotic device has received a response input from another user, the robotic device can cause the response input to be processed and also begin navigating back to the user providing the initial spoken utterance. In some implementations, the robotic device can access a voice processing module that can process audio data and/or text data using one or more trained machine learning models. For example, the one or more trained machine learning models can include a transformer neural network and/or other language model that can be used to convert response inputs to meaningful outputs. For example, the robotic device can cause audio input data corresponding to a spoken utterance from a user and/or another spoken utterance from another user to generate audio output data. The audio output data can characterize a natural language output that is intimate, such as "Eli has kindly indicated that he is almost ready to go. (illicit means that he is more or less ready). In this way, rather than exclusively providing a word-by-word rendition of what may have been stated in response to a user, a more natural type of conversation can be created between the robotic device and each user. This can allow for easier understood interactions between users and their robotic devices and reduce instances in which the robotic device is required to repeat what another user may have already stated for the robotic device. Thus, certain resources, such as battery life and processing bandwidth, can be saved.

In some implementations, the robotic computing device can assist the user in identifying a particular device and/or a location of a particular device without the user explicitly asking the robotic computing device to find a particular device. In some cases, the robotic computing device is able to perform such operations when a particular device is rendering notifications that the user may not confirm because it is not sufficiently close to the particular device and/or the device is operating in a silent mode. For example, when an incoming call is received by a cellular telephone in a user's home, the cellular telephone can operate in a silent mode. Although the cellular telephone may vibrate in a silent mode, the user may not be able to determine that the cellular telephone is vibrating when the user and the cellular telephone are in different rooms. However, the robotic computing device is able to receive notification that the cellular phone is receiving an incoming call via a local area network (e.g., wi-Fi) and with the advance permission of the user.

In response to receiving the notification, the robotic computing device can render an output such as "Sir, your phone is ringing on silent (mr. Your phone is silent ringing"). The output can be rendered when the robotic computing device is within a threshold distance of the user and/or after the robotic computing device navigates to the user in response to the notification. When the user hears the output from the robotic computing device, the user can respond with a spoken utterance such as "Oh thanks, I thought my phone was right here," thank you, i am my phone. The spoken utterance can be captured by the robotic computing device via an audio interface of the robotic computing device, and can be converted into audio data that can be processed at the robotic computing device and/or another particular computing device (e.g., a network device such as a server) with prior permission of the user. The audio data can be processed to determine whether the user is willing to get assistance from the robotic computing device-although assistance from the robotic computing device is not explicitly requested.

For example, audio data and/or other data can be processed using one or more heuristic processes and/or one or more trained machine learning models (e.g., transformer neural network models, convolutional neural networks, recurrent neural networks, and/or other models). In some implementations, the robotic computing device can employ a neural network-based sequence classification model to determine whether the user exhibits an interrogation of the mood and/or an amount of uncertainty under the user's prior permissions. For example, the audio data can be processed to generate a measure of the amount of uncertainty that the user may exhibit about the subject matter embodied in their spoken utterance and/or output from the robotic computing device. When the metrics meet a particular metric threshold, the audio data can be further processed to determine information and/or identify operations that can assist the user in resolving their uncertainty and/or query. For example, the operation of identifying the location of the cellular telephone can be determined by the robotic computing device to help resolve the uncertainty of the detected user.

Based on this determination, the robotic computing device can propose to perform an operation by rendering another output such as "If you'd like I can take you to your phone" (i can bring you to find your phone If you are willing). In some implementations, the users each agree to allow the robotic computing device to take the user to the cellular phone by providing an explicit response input such as "sure (of course)". Alternatively or additionally, the user can provide their approval of the robotic computing device to take the user to the cellular telephone by presenting a body language and/or other features that indicate that the user is willing to be directed to the cellular telephone by the robotic computing device. For example, in response to the robotic computing device providing another output, the user can stand up from where they are sitting and walk to the robotic computing device. In some implementations, audio data and/or image data captured by the robotic computing device under the user's prior permissions can be processed to determine whether the user exhibits a positive and/or approved response to the offer (i.e., another output) from the robotic computing device. In some implementations, the audio data and/or the image data can be processed using one or more same or different trained machine learning models that are used to process spoken utterances from the user. For example, where a user's pre-permission is obtained, a plurality of images captured when the user rises from the seat can be processed to determine that the user's trajectory is toward the robotic computing device. Based on the determination, the robotic computing device can conclude that the user has exhibited a desire to direct by the robotic computing device to the cellular telephone.

In some implementations, the speed or acceleration of the robotic computing device and/or the urgency of the robotic computing device reaction can be based on one or more features of the context in which the robotic computing device has initialized the maneuver. For example, the type of notification received by the robotic computing device from another computing device can indicate the urgency of the notification and thus provide a basis for the robotic computing device travel speed. For example, when the robotic computing device attempts to find a cellular telephone that is ringing, the robotic computing device can operate according to the first speed. However, when the robotic computing device attempts to find a cellular telephone in response to an incoming text message, the robotic computing device is able to operate according to a second speed that is lower than the first speed. Alternatively or additionally, the robotic computing device can establish the speed of maneuvering to a particular location based on one or more different factors, such as: the urgency detected in the user's voice, the content of the particular notification, the source of the particular notification (e.g., the robotic computing device may manipulate faster when the spouse sends the text message than when the acquaintance sends the text message), the time of day the notification was received, the application that is the source of the notification (e.g., the robotic computing device may manipulate faster when the delivery notification is received from the shopping application than the social media application that provided the notification), and/or any other data source that can be the basis for the robotic computing device to manipulate to a different location.

In some implementations, the robotic computing device is able to maneuver to a particular location within a particular room depending on one or more characteristics of the operation being performed by the robotic computing device and/or the context in which the robotic computing device is performing the operation. For example, a user who is preparing dinner for a guest can resort to a robotic computing device playing music in a kitchen in the user's home. In response to explicitly asking the robotic computing device (e.g., by providing a spoken utterance such as "Play music.)", the robotic computing device can maneuver to a particular location in the living room of the home to Play music while the guest is at home. In some implementations, the particular location can be learned by the robotic computing device and/or explicitly identified by the user (e.g., "When guests are here, play music at this particular location in the living room. (music played at the particular location in the living room when the guest is here)").

Further, in various embodiments, the particular location can be specific to the operation being performed. For example, when performing a "play music" operation, the robotic computing device can maneuver to a first location in the room; when performing a "read me the news" operation, the robotic computing device can maneuver to a different second location in the room; and when performing the "stream video from a smart camera (streaming video from smart camera)" operation, the robotic computing device is able to maneuver to another, different, third location in the room. In some implementations, a user can specify a location within a room for performing an operation or type of operation by providing descriptive natural language input (e.g., "Whenever you play videos, play the 1meter in front of the north side of the couch. (to play at 1meter north of the sofa whenever video is played)). Alternatively or additionally, the user can specify a location within the room for performing an operation or type of operation by repositioning themselves to the desired location (e.g., "Whenever you tell me the news, tell me the news right here [ user walks over to desired location and stands there ] (whenever told me news, told me news here [ the user walks to the desired location and stands there ]). The robotic computing device can then capture image data (and/or other sensor data) with prior permission from the user, and process the image data to determine coordinates of the location where the user is standing and/or facing and/or inferred semantic tags. Alternatively or additionally, the user can specify locations within the room for performing operations or types of operations by interacting with a Graphical User Interface (GUI) of the robotic computing device or a separate device. For example, a first user can interact with the GUI to annotate a map of their home to specify where certain types of operations should be performed (e.g., audibly rendering news). In some implementations, the map can be a graphical representation of a home or other structure with semantic tags generated by processing data from one or more sensors of the robotic computing device. Alternatively or additionally, the second user can also interact with an instance of the GUI to specify where certain other types of operations should be performed (e.g., facilitating a video call).

In various implementations, preferences regarding the positioning of the robotic computing device to perform certain operations and/or types of operations can be stored as user-specific preferences. For example, a first user can direct the robotic computing device to perform operations of a music type at a first location of the living room in the home, and a second user can direct the robotic computing device to perform operations of a music type at a second location of the living room that is different from the first location. Thus, in response to a first user requesting the robotic computing device "play music in the living room (play music in living room)", the robotic computing device can verify that the first user is providing the request and relocate to a first location within the living room according to the first user's specified preferences.

For example, before a guest arrives, the user can provide instructions to the robotic computing device to stay in the living room while performing any audio and/or video rendering. The user can provide these instructions via spoken utterances such as "Please only play music, thene, next to the couch, the guide arive" (please play music only beside the sofa when the guest arrives). The user can optionally point to the location (e.g., when "… thene … (…, …)" is spoken), and the robotic computing device can capture image data of the user's pointing (if pre-permission is obtained from the user) to determine the precise location pointed by the user. For example, the geographic layout data characterizing the user's home can be compared to an estimated trajectory of the user's "pointing" finger in order to determine the preferred "music" location pointed by the user. Alternatively or additionally, the user can interact with the GUI to annotate a map of their home to specify where the user prefers to perform certain types of operations. Thereafter, before the guest arrives, the robotic computing device may follow the user around the home (play or not play music) as the user prepares for the arrival of the guest. When the guest arrives, the robotic computing device can maneuver to a location adjacent to the sofa in the living room and Play music, or wait for the user to explicitly request the robotic computing device to Play music (e.g., "Play some music").

In some implementations, the request from the user can be processed to generate preference data that can be utilized in responding to subsequent requests from the user. For example, when a user has a different guest on the next weekend, the user can respond to a request to play music by manipulating to a position adjacent to a sofa in their living room. In some implementations, user authentication can be performed by the robotic computing device before certain preferences for certain operations are implemented. For example, voice verification and/or facial recognition can be performed in response to a user requesting music to be played. When the requesting user does not correspond to any user with particular positioning preferences for playing music, the robotic computing device can select a positioning based on one or more heuristic processes and/or one or more trained machine learning models. For example, a user that has not provided explicit preferences for the positioning of certain operations may have certain preferences inferred based on previous interactions between the user and the robotic computing device.

In some implementations, the positioning of the robotic computing device can be dynamic for certain operations and/or for certain preferences of the user. For example, a user can provide a request to the robotic computing device to issue an alert after a certain duration (e.g., 10 minutes). In response, the robotic computing device may not initially follow the user, but may initiate a "track" operation after a duration of time before the alert reaches a particular value. In some implementations, the particular value can be based on one or more characteristics of the alert request, the cause of the requested alert, the estimated relative positioning of the user with respect to the robotic computing device, and/or the context in which the user requested the alert. In some implementations, the user can provide a request to the robotic computing device that can cause the robotic computing device to track (e.g., follow) the user without the user explicitly requesting that the robotic computing device follow the user. Such behavior can be learned over time through interaction with the robotic computing device and/or based on explicit requests from the user. For example, the user can explicitly request that the robotic computing device track the user in some cases, or stop tracking the user in other cases. Such instances can provide feedback data that can be used to train the robotic computing device to track and/or not track the user in certain situations without the user explicitly providing a request for the robotic computing device to track the user.

For example, a user can request the robotic computing device to play music, and in response, the robotic computing device can initiate a tracking operation and a music rendering operation. In some implementations, the tracking operation can be initiated in certain contexts when it is determined that there is no other audio device in the vicinity of the user and/or when it is determined that the user is not heading to an audio device. For example, when a user walks through a hallway and provides a spoken utterance such as "Play name music," the robotic computing device can determine that the hallway is free of audio devices. Alternatively or additionally, the robotic computing device can determine that the user is walking through the hallway toward a room without the audio device. In response to the spoken utterance, the robotic computing device can initiate rendering of the music and also initiate a tracking operation so that the user can hear the music as they walk through the corridor and enter the room. Otherwise, when it is determined that the room has an audio device, the robotic computing device can initiate playing music while the user is in the hallway, but delegate playing music to the audio device in the room once the user enters the room. Thereafter, when the user is in the room, the robotic computing device can optionally remain outside the room and cease tracking operations based at least on the determined user preferences.

In some implementations, the characteristics of the tracking operation can depend on the robotic computing device and/or the action being performed by a particular user providing a request to the robotic computing device to perform the action and/or the type of action being performed. For example, the robotic computing device can track the user a distance "x" when performing operations to play music, but a different distance "y" when facilitating a video call or an audio call. Alternatively or additionally, the robotic computing device can track the user at a particular speed according to the preferences of the particular user and/or the action or type of action being performed. In some implementations, the characteristics of the tracking operation can be based on where the robotic computing device is located and/or whether the robotic computing device is located in a room with a particular inferred semantic tag. For example, when in a "kitchen," the robotic computing device can perform tracking operations as a function of distance and/or speed, but when in a "garage," can perform as a function of different distances and different speeds.

The above description is provided as an overview of some embodiments of the present disclosure. Further descriptions of those embodiments and other embodiments are described in more detail below.

Other embodiments may include a non-transitory computer-readable storage medium storing instructions executable by one or more processors (e.g., central Processing Units (CPUs)), graphics Processing Units (GPUs), and/or Tensor Processing Units (TPUs) to perform a method such as one or more of the methods described above and/or elsewhere herein. Other embodiments may include a system of one or more computers comprising one or more processors operable to execute stored instructions to perform methods such as one or more of the methods described above and/or elsewhere herein.

It should be appreciated that all combinations of the foregoing concepts and additional concepts described in more detail herein are considered a part of the subject matter disclosed herein. For example, all combinations of claimed subject matter appearing at the end of this disclosure are considered part of the subject matter disclosed herein.

Drawings

FIGS. 1A and 1B illustrate views of a user interacting with a robotic computing device capable of inferring whether the user would be directed to a device providing notification without explicit appeal.

Fig. 2A, 2B, and 2C illustrate views of a user interacting with a robotic computing device capable of communicating between users and interacting with the users at learned preferred locations.

FIG. 3 illustrates a system that operates a robotic computing device to facilitate communication between users and is able to travel to a particular location to complete an operation without explicit appeal from the users.

Fig. 4 illustrates a method for operating a robotic computing device to relay user inputs and/or user responses that may be relayed by the robotic computing device from a first user to a second user.

FIG. 5 illustrates a method for operating a robotic computing device to autonomously assign semantic tags to regions of a space or structure to further ensure that specific actions are performed in specific regions according to user preferences.

FIG. 6 is a block diagram of an exemplary computer system.

Detailed Description

Fig. 1A and 1B illustrate views 100 and 120 of a user 102 interacting with a robotic computing device 104, the robotic computing device 104 being able to infer whether the user is willing to be directed to a device providing notification. Alternatively or additionally, the robotic computing device can be maneuvered to the device and/or specific positioning according to driving operational parameters selected based on one or more characteristics of the context in which the notification is provided. For example, when the cellular telephone is operating in a silent mode (e.g., vibration only mode), the robotic computing device 104 can determine that the user 102 has received a text message at a separate device such as the cellular telephone. Each of the cellular telephone and the robotic computing device 104 has access to a local area network that the robotic computing device 104 in the room 106 of the user 102 has wireless access to. In some implementations, and with the advance permission of the user 102, the robotic computing device 104 can determine whether the user 102 has acknowledged the text message and/or whether the user 102 is located in an area where the user 102 can detect receipt of the text message. When the robotic computing device 104 predicts that the user 102 has not confirmed the text message and/or is not positioned in an area where the text message may be detected, the robotic computing device 104 can render an audible output 108, such as "You received a text message from julian" (you received a text message from cinnabar). Alternatively or additionally, when the cellular telephone receives the text message, the robotic computing device 104 can actively render the audible output 108.

In some implementations, the robotic computing device 104 can generate predictions as to whether the user 102 knows where the cellular telephone is located if a prior license to the user 102 is obtained. The prediction can be based on, for example, spoken utterance 110 indicating that user 102 is not aware of the location of the cellular telephone. The spoken utterance 110 can include content such as "I don't even know where my phone is. (I don't even know where my phone is.)". Alternatively or additionally, the image data and/or audio data can be processed to determine if the user 102 knows where the cellular telephone is, given prior permission from the user 102. For example, in response to the audible output 108, the user 102 can look around their cellular telephone (e.g., looking to their left and right), which can be an indication that they do not know where their cellular telephone is located. Based on one or more of these contextual features, the robotic computing device 104 can determine where the user 102 does not know where their cellular telephone is located and provide another audible output 112, such as "I can show you if you'd like" (i can refer to you looking if you are willing).

In some implementations, the robotic computing device 104 can infer that the user is interested in being directed to their cellular telephone without the user 102 providing input that explicitly directs the robotic computing device 104 to direct them to the cellular telephone. For example, the characteristics 114 of the context can include the user 102 rising from their seat after another audible output from the robotic computing device 104. The feature 114 can be a positive indication that the user 102 is willing to be directed to their cellular telephone by the robotic computing device 104. For example, as shown in view 120 of fig. 1B, the robotic computing device 104 can leave room 106 and enter another room 126 to further direct the user 102 to a device 124, such as their cellular telephone. In this way, the user 102 does not have to explicitly talk to their assistant device in order to obtain certain benefits. This can preserve the computing resources of the robotic computing device 104 and/or other devices because less processing and storage can be consumed prior to execution of the initialization operation.

Fig. 2A, 2B, and 2C illustrate views 200, 220, and 240 of a user 202 interacting with a robotic computing device 204, the robotic computing device 204 being capable of communicating between users and interacting with users at learned preferred locations. For example, the user 202 can provide a spoken utterance 208, such as "Go see if Jimmy cleaned his room" (to see if Jimi cleaned his room). The spoken utterance can be directed to a robotic computing device 204, which robotic computing device 204 can be located in a room 206 with the user 202. In response to detecting the spoken utterance 208, the robotic computing device 204 can render an auditory output 210, such as "Ok, I'll go to his room and check" (good, I will go to his room check). In some implementations, the robotic computing device 204 can determine that the spoken utterance 208 embodies one or more requests to perform one or more operations for the robotic computing device 204. The one or more operations can include determining a location of a "Jimmy" room and determining whether the room is "cleaned.

In some implementations, the robotic computing device 204 can determine the location of the Jimmy's room by communicating with other computing devices that are within a threshold distance of the robotic computing device 204 and/or that are connected to the public network by the robotic computing device 204. For example, in response to determining the requested operation, the robotic computing device 204 can cause one or more devices within the room 206 and outside of the room 206 to render one or more different types of output (e.g., visual, audio, antenna, etc.) that can be detected by the robotic computing device 204. The robotic computing device 204 can determine, based on these outputs, that one or more devices correspond to a "Jimmy" room. Alternatively or additionally, the robotic computing device 204 can determine the relative distance of the robotic computing device 204 from one or more smart devices based on one or more signal metrics (e.g., signal quality) optionally detected over a time window or duration. In some implementations, the user-specified name of a particular device can be "Jimmy's speaker" and/or "Jimmy's smart light", and thus can provide evidence that the location of the particular device corresponds to the Jimmy's room. Alternatively or additionally, one or more signal metrics associated with the communication between a particular device and the robotic computing device 204 can indicate the relative distance and/or relative positioning of the particular device and the robotic computing device 204.

In some implementations, such tags for devices can be identified via a smart home graph that can be accessed by the robotic computing device 204, an assistant application, and/or any other application or module that can be associated with the user 202. Alternatively or additionally, the robotic computing device 204 can determine that Jimmy has been previously located and/or is currently located in a particular room if pre-permission is obtained for one or more users. For example, the historical interaction data accessible to the robotic computing device 204 can indicate that most interactions between the robotic computing device 204 and another user (i.e., jimmy) occur in another room 226, as shown in fig. 2B. Based on the determination, and without conflicting data (e.g., data indicating that the room 226 is not Jimmy), the robotic computing device 204 can navigate to the room 226 to fulfill the request from the user 202.

When the robotic computing device 204 reaches another room 226, the robotic computing device 204 can use one or more interfaces of the robotic computing device 204 and/or interact with one or more other devices within the other room 226 to collect data about the other room 226. For example, the robotic computing device 204 can utilize one or more cameras to capture one or more images of another room 226. One or more images can be processed using one or more trained machine learning models to determine whether another room 226 should be classified as "clean. In some implementations, the robotic computing device 204 can locate itself in another room 226 based on the preferences of the user 202 and/or another user 222. For example, a first positioning preference of the robotic computing device 204 within another room 226 can correspond to a positioning where the robotic computing device 204 should be placed to collect data. Alternatively or additionally, the second positioning preference of the robotic computing device 204 within the other room 226 can correspond to another positioning where the robotic computing device 204 should be placed to interact with the other user 222.

In some implementations, the robotic computing device 204 can infer location preferences based on the frequency with which the user engages the robotic computing device 204 at certain locations within the room. Alternatively or additionally, the robotic computing device 204 can determine the user's positioning preferences based on explicit instructions from the user. For example, the user can provide an explicit request to the robotic computing device 204 to facilitate visual output and/or video calls only at certain preferred distances and/or specific portions of the room. Alternatively or additionally, the user can provide an explicit request to the robotic computing device 204 to facilitate audio output and/or phone calls only at other distances and/or another area of the room. These instructions can be received and generated as preference data that can be utilized by the robotic computing device 204 in subsequent interactions with one or more users.

In accordance with the foregoing examples, the robotic computing device 204 can enter another room 226 and optionally render an output 228 for another user 228 to further fulfill the request from the user 202. For example, another output 228 can be, "Have you cleaned your room? (do you clean the room. This can be a resort to information that can assist the robotic computing device 204 in fulfilling requests from the user 202. In response, another user 222 (i.e., jimmy) can provide a spoken utterance 230, such as "Yeah I'm working on it. (yes, I am sweeping). This information can optionally be used in combination with other data generated by the robotic computing device 204 to fulfill the request from the user 202. For example, upon receiving the information and/or data, the robotic computing device 204 can navigate to the user 202 as shown in fig. 2C.

In some implementations, the robotic computing device 204 can navigate to a particular location 246 within the room 206 based on the user's 202 preferences. For example, the user 202 can prefer to provide an audible response when the robotic computing device 204 enters the room 206, the robotic computing device 204 should provide a response at a particular location 246. Alternatively or additionally, the user 202 can have a history of providing an extraction command (e.g., "name over here") to direct the robotic computing device 204 to a particular location 246 by combining a pointing gesture (e.g., a pointing motion toward the particular location 246). Based on these historical interactions, the robotic computing device 204 can determine that the user 202 prefers to hear a response from the robotic computing device 204 at a particular location 246 within the room 206. In some embodiments, the preferred distance and/or preferred positioning can vary from room to room, and/or for different operations and/or different types of output. For example, when the robotic computing device 204 is in the bedroom of the user 202, the user 202 can prefer that the robotic computing device 204 provide "news" from the bed tail, but facilitate audio telephone calls from one side of the bed when the robotic computing device 204 is in the bedroom of the user 202.

When the robotic computing device 204 reaches a particular location 246 for fulfilling the request, the robotic computing device 204 can render an audible output 242, such as "The room looks clean and Jimmy says he's working on it. (the room looks clean and the Jimi speaking he is sweeping). The auditory output 242 can be generated using one or more language models (e.g., recurrent neural networks, transformer network models, etc.). In some implementations, the auditory output 242 can include content that is different from that provided by other users 222, but conveys similar conclusions. Alternatively or additionally, the audible output 242 can include data generated based on content of the response from the user 222 and using one or more interfaces (e.g., cameras) of the robotic computing device 204. In this way, the rendered content from the robotic computing device 204 can embody natural language content characterized by data and information obtained to further fulfill the request from the user 202.

FIG. 3 illustrates a system 300 that operates a robotic computing device to facilitate communication between users and is capable of traveling to a particular location to complete an operation without explicit appeal from the users. The system 300 can include a computing device 302, which computing device 302 can be a robotic computing device that includes one or more applications for allowing the robotic computing device to interface with a user. For example, the robotic computing device can include an automated assistant 304. The automated assistant 304 can operate as part of an assistant application provided at one or more other computing devices and/or server devices. The user can interact with the automated assistant 304 via an assistant interface 320, which can be a microphone, a camera, a touch screen display, a user interface, and/or any other device that can provide an interface between the user and an application. For example, a user can initialize the automated assistant 304 by providing verbal, textual, and/or graphical input to the assistant interface 320 to cause the automated assistant 304 to initialize one or more actions (e.g., provide data, control peripherals, access agents, generate inputs and/or outputs, etc.). Alternatively, the automated assistant 304 can initialize based on processing of the context data 336 using one or more trained machine learning models.

The context data 336 can characterize one or more features of an environment in which the automated assistant 304 can access, and/or one or more features of a user predicted to want to interact with the automated assistant 304. The computing device 302 can include a display device, which can be a display panel that includes a touch interface for receiving touch inputs and/or gestures to allow a user to control applications 334 of the computing device 302 via the touch interface. In some implementations, the computing device 302 may not have a display device, providing an audible user interface output, and not a graphical user interface output. In addition, the computing device 302 can provide a user interface, such as a microphone, for receiving spoken natural language input from a user. In some implementations, the computing device 302 can include a touch interface and can be devoid of a camera, but can optionally include one or more other sensors.

The computing device 302 and/or other third party client devices are capable of communicating with the server device over a network, such as the internet. Additionally, the computing device 302 and any other computing devices can communicate with each other over a Local Area Network (LAN), such as a Wi-Fi network. The computing device 302 is capable of offloading computing tasks to a server device in order to save computing resources at the computing device 302. For example, the server device can host the automated assistant 304, and/or the computing device 302 can transmit input received at the one or more assistant interfaces 320 to the server device. However, in some implementations, the automated assistant 304 can be hosted at the computing device 302, and various processes that can be associated with automated assistant operations can be performed at the computing device 302.

In various implementations, all or less than all aspects of the automated assistant 304 can be implemented on the computing device 302. In some of these implementations, aspects of the automated assistant 304 are implemented via the computing device 302 and are capable of interfacing with a server device that may implement other aspects of the automated assistant 304. The server device is capable of servicing multiple users and their associated assistant applications, optionally via multiple threads. In implementations where all or less than all aspects of the automated assistant 304 are implemented via the computing device 302, the automated assistant 304 can be an application separate from (e.g., installed on top of) the operating system of the computing device 302—or alternatively implemented directly by (e.g., considered to be an application of, but integrated with) the operating system of the computing device 302.

In some implementations, the automated assistant 304 can include an input processing engine 306, which input processing engine 306 can utilize a plurality of different modules to process inputs and/or outputs of the computing device 302 and/or server device. For example, the input processing engine 306 can include a voice processing engine 308 that can process audio data received at the assistant interface 320 to identify text embodied in the audio data. For example, audio data can be transmitted from the computing device 302 to a server device in order to preserve computing resources at the computing device 302. Additionally or alternatively, the audio data can be specifically processed at the computing device 302.

The process for converting audio data to text can include a voice recognition algorithm that can employ a neural network, and/or statistical model, to identify the set of audio data corresponding to the word or phrase. Text converted from the audio data can be parsed by the data parsing engine 310 and made available to the automated assistant 304 as text data that can be used to generate and/or identify command phrases, intents, actions, slot values, and/or any other content specified by a user. In some implementations, the output data provided by the data profiling engine 310 can be provided to the parameter engine 312 to determine whether the user provides input corresponding to a particular intent, action, and/or routine that can be performed by the automation assistant 304 and/or an application or agent that can be accessed via the automation assistant 304. For example, the assistant data 338 can be stored at the server device and/or the computing device 302 and can include data defining one or more actions that can be performed by the automated assistant 304, as well as parameters necessary to perform the actions. The parameter engine 312 can generate one or more parameters for intent, action, and/or slot values and provide the one or more parameters to the output generation engine 314. The output generation engine 314 can use one or more parameters to communicate with the assistant interface 320 to provide output to a user and/or with one or more applications 334 to provide output to one or more applications 334.

In some implementations, the automated assistant 304 can be an application that can be installed on top of the operating system "of the computing device 302 and/or can itself form part (or all) of the operating system of the computing device 302. Automated assistant applications include and/or have access to on-device voice recognition, on-device natural language understanding, and on-device fulfillment. For example, on-device voice recognition can be performed using an on-device voice recognition module that processes audio data (detected by a microphone) using an end-to-end voice recognition machine learning model stored locally at computing device 302. On-device voice recognition generates recognized text for spoken utterances (if any) present in the audio data. Also, for example, an on-device Natural Language Understanding (NLU) module can be used to execute an on-device NLU that processes recognition text generated using on-device voice recognition, and optionally context data, to generate NLU data.

The NLU data can include an intent corresponding to the spoken utterance and optionally parameters (e.g., slot values) for the intent. On-device fulfillment can be performed using an on-device fulfillment module that utilizes NLU data (from the on-device NLU) and optionally other local data to determine actions to be taken to parse the intent (and optionally parameters of the intent) of the spoken utterance. This can include determining local and/or remote responses (e.g., answers) to the spoken utterance, interactions with locally installed applications to perform based on the spoken utterance, commands transmitted to internet of things (IoT) devices (directly or via a corresponding remote system) based on the spoken utterance, and/or other parsing actions performed based on the spoken utterance. The on-device fulfillment can then initiate local and/or remote execution/execution of the determined action to parse the spoken utterance.

In various embodiments, remote voice processing, remote NLU, and/or remote fulfillment can be utilized at least selectively. For example, the recognized text can be at least selectively transmitted to a remote automated assistant component for remote NLU and/or remote fulfillment. For example, the recognized text can optionally be transmitted for remote execution in parallel with on-device execution, or in response to failure of on-device NLU and/or on-device fulfillment. However, on-device voice processing, on-device NLU, on-device fulfillment, and/or on-device enforcement can be prioritized, at least because of the reduced latency they provide in parsing the spoken utterance (because no client server round trip is required to parse the spoken utterance). Furthermore, the on-device functionality may be the only functionality available without network connectivity or with limited network connectivity.

In some implementations, the computing device 302 can include one or more applications 334, which one or more applications 334 can be provided by a third party entity that is different from the entity that provides the computing device 302 and/or the automated assistant 304. The application state engine of the automation assistant 304 and/or the computing device 302 can access the application data 330 to determine one or more actions that can be performed by the one or more applications 334, as well as a state of each of the one or more applications 334 and/or a state of a respective device associated with the computing device 302. The device state engine of the automation assistant 304 and/or the computing device 302 can access the device data 332 to determine one or more actions that can be performed by the computing device 302 and/or one or more devices associated with the computing device 302. In addition, the application data 330 and/or any other data (e.g., the device data 332) can be accessed by the automation assistant 304 to generate context data 336 that can characterize the context in which a particular application 334 and/or device is executing, and/or the context in which a particular user is accessing the computing device 302, accessing the application 334, and/or any other device or module.

When one or more applications 334 are executing at the computing device 302, the device data 332 can characterize the current operating state of each application 334 executing at the computing device 302. In addition, the application data 330 can characterize one or more features of the application 334 being performed, such as content of one or more graphical user interfaces rendered under the direction of the one or more applications 334. Alternatively or additionally, the application data 330 can characterize an action pattern that can be updated by the respective application and/or by the automation assistant 304 based on a current operating state of the respective application. Alternatively or additionally, one or more action patterns of one or more applications 334 can remain static, but can be accessed by an application state engine to determine appropriate actions for initialization via the automated assistant 304.

Computing device 302 can further include an assistant invocation engine 322, which assistant invocation engine 332 can use one or more trained machine learning models to process application data 330, device data 332, context data 336, and/or any other data accessible to computing device 302. The assistant invocation engine 322 can process the data to determine whether to wait for the user to explicitly speak a invocation phrase to invoke the automated assistant 304, or treat the data as indicating the user's intent to invoke the automated assistant-instead of requiring the user to explicitly speak the invocation phrase. For example, one or more trained machine learning models can be trained using instances of training data based on scenarios in which a user is in an environment in which multiple devices and/or applications exhibit various operating states. An instance of training data may be generated to capture training data characterizing a context in which a user invokes an automated assistant and other contexts in which a user does not invoke an automated assistant. When training one or more trained machine learning models from these instances of training data, the assistant invocation engine 322 can cause the automated assistant 304 to detect or limit the detection of spoken invocation phrases from the user based on characteristics of the context and/or environment.

In some implementations, the assistant invocation engine 322 can process data generated using the one or more assistant interfaces 320 to determine whether the user expressed a desire to benefit from the operation of the automated assistant 304 and/or the robotic computing device. For example, data captured via the one or more assistant interfaces 320 can be processed by the assistant invocation engine 322 to determine whether a user's spoken utterance, nonverbal gestures, disfluencies, and/or other motions can be considered invocation of the robotic computing device. Alternatively or additionally, the data can be processed to determine the context in which the user provides such input and/or motion. Based on the determined context and/or expression of the user, the robotic computing device can determine whether the user is willing to allow the robotic computing device to perform a particular operation. The operations can be, for example, providing information to the user and/or directing the user to a particular location, although the user does not provide an explicit appeal to the robotic computing device to do so.

In some implementations, the system 300 can include a drive parameter engine 316, which drive parameter engine 316 can determine one or more parameters for manipulating the robotic computing device in certain contexts and/or based on certain data. For example, data that provides a basis for a particular operation to be performed by the robotic computing device can be processed by the drive parameter engine 316 to determine how to manipulate the robotic computing device while performing the particular operation. For example, the application data 330, the device data 332, and/or the context data 336 can be processed to determine whether there are any urgency and/or time constraints associated with a particular request from a user. In some implementations, this can be determined using one or more heuristic processes and/or one or more trained machine learning models. Alternatively or additionally, data associated with the request can be processed by the drive parameter engine 316 to generate an embedding that can be mapped to a potential space, where a distance to a point and/or region in the potential space can indicate whether the request is urgent. This processing of the drive parameter engine 316 can be used to determine drive parameters such as speed, acceleration, travel time, power limits, and/or any other parameters that can be associated with driving the robotic device.

For example, the application data 330 can indicate that the user has requested a location directed to a device that has provided an emergency notification. The driving parameter engine 316 can process the application data 330 to determine that the application is exhibiting a particular state, and that particular state is predicted to be particularly urgent with respect to other notifications and/or other application states. In some implementations, this determination can be based on whether the application state has a time quality (e.g., the phone is ringing and important contacts are calling, so there is only a certain amount of time to answer the phone call). Based on this determination, the drive parameter engine 316 can identify a speed parameter for controlling one or more motors of the robotic computing device when fulfilling the request directed to the location of the device. In some implementations, the system 300 can include a layout detection engine 318 that can allow the robotic computing device to determine the relative positioning of rooms and/or other features within a space or structure in which the robotic computing device is located. For example, the device detection engine 318 can be utilized by a robotic computing device when attempting to identify a location of a device, user, room, and/or feature of space and/or structure in response to input from a user.

In some implementations, the layout detection engine 318 can cause other devices to provide outputs for assisting in determining the current location of the robotic computing device relative to other portions of the space in which the robotic computing device is located. For example, the layout detection engine 318 can process the application data 330 to determine that a device in a particular room has a user-defined tag (e.g., "laundry speaker") that can indicate a title of the particular room (e.g., "laundry"). When the robotic computing device is directed into a particular room, the layout detection engine 318 can cause the device to provide an output (e.g., illuminate a light or display, render audio, transmit antenna signals, etc.). The layout detection engine 318 can identify one or more different characteristics (e.g., signal metrics) of the output (e.g., signal quality, amplitude, audio frequency, light frequency, etc.) from one or more different devices received over a time window or duration to determine the relative positioning of a particular room with respect to the current positioning of the robotic computing device.

For example, the robotic computing device can determine whether the robotic computing device is co-located with one or more devices having semantic tags associated with rooms specified by or inferred from user requests. Based on this determination, the layout detection engine 318 can determine how to maneuver the robotic computing device from the current location of the robotic computing device to the location of the desired room. In some implementations, public knowledge graph and/or personal knowledge graph analysis can be used to map locations within a structure (e.g., a home or business). Public knowledge maps can be generated based on previous interactions between one or more users, individuals, and one or more other applications. Alternatively or additionally, the personal knowledge graph can be generated based on previous interactions between the user and one or more applications (e.g., assistant applications and/or IoT applications).

In some implementations, the system 300 can include a location preference engine 326, which location preference engine 326 can determine a preferred location of the robotic computing device for performing certain operations, interacting with particular users, and/or otherwise locating the robotic computing device. The location preference engine 326 can process the data using one or more heuristic processes and/or a trained machine learning model to determine whether the current location of the robotic computing device is suitable to fulfill the request from the user. In some implementations, the preferred location can be requested directly by the user and/or inferred from one or more previous interactions with one or more users. For example, a user can explicitly or indirectly request that the robotic computing device perform certain operations (e.g., facilitate a phone call or video call) at a first location within the room, and perform other operations (e.g., play music) at a second location within the room. These preferred locations can be different for different users and/or different rooms.

For example, when a first user receives a voice call via the robotic computing device, the robotic computing device may navigate to a first location in the room. However, when the second user receives the voice call, the robotic computing device may navigate to a second location in the room such that the second user is able to receive the voice call via the robotic computing device. In some implementations, the location preference engine 326 can determine the dynamic location of the robotic computing device. For example, the first user can prefer that the robotic computing device track the first user a distance x (where y is any distance value) while playing music. However, the second user can prefer that the robotic computing device track a different distance y (where y is any distance value) for the second user when rendering the news story and the teleconference.

Fig. 4 illustrates a method 400 for operating a robotic computing device to relay user inputs and/or user responses that may be relayed by the robotic computing device from a first user to a second user. The robotic computing device is capable of relaying messages between locations at different speeds depending on the type of input explicitly provided to and/or inferred by the robotic computing device. The method 400 can be performed by one or more computing devices, applications, and/or any other apparatus or module that can be associated with an automated assistant. The method 400 can include an operation 402 of determining whether a spoken utterance and/or other user input has been provided to a robotic computing device. The robotic computing device can be a computing device capable of maneuvering to various locations within a user's home and connecting to one or more different networks to send and receive data. When a spoken utterance or other user input is detected at the robotic computing device, the method 400 can proceed to operation 404. Otherwise, the robotic computing device can continue to determine whether the user has provided input.

Operation 404 can include determining whether a user directs the robotic computing device to perform an operation that may require travel. Operations that may require travel can include relaying a message to another user that may not be within a distance that the robotic computing device can project audible sound and/or visual audio. For example, the spoken utterance from the first user can include a request that the robotic computing device provide an audible message to a second user located in a different room than the first user. The spoken utterance from the first user can be, for example, "Tell Phoenix that I need to leave because Sherri just pulled up to the house" (telling phoenix me to leave because Xie Li just stopped the car at home). The spoken utterance can embody a request to the robotic computing device to perform at least an operation to maneuver to a location of a second user (e.g., "Phoenix"), and another operation to provide a message to the second user (e.g., "Sherri has just pulled up to the house, so Mark is going to leave now. (Xie Li just stopped the car at a home gate, so mark now has left.)"). Upon determining that a request for the robotic computing device to travel to another location has been received, the method 400 can proceed from operation 404 to operation 406. Otherwise, the method 400 can proceed from operation 404 to operation 410.

Operation 406 can include determining whether the request from the user has an increased relative importance. The increased relative importance can refer to the priority or severity of the request relative to other requests submitted by the user and/or one or more other users. For example, in some implementations, the sound characteristics (e.g., cadence, words per second, etc.) of the spoken utterance can be recognized and processed to determine whether the spoken utterance is intended to have increased relative importance. In some implementations, an operation request corresponding to a temporal event can indicate the relative importance of the request. For example, a request associated with an environment having at least a particular probability of not changing for a threshold duration can be considered to have a relative importance that is not increased. However, different requests associated with environments having at least a particular probability of changing within a threshold duration can be considered to have increased relative importance.

When the robotic computing device receives the spoken utterance "Tell Phoenix that I need to leave because Sherri just pulled up to the house," the spoken utterance can be determined to embody a request of increased relative importance. The determination can be based at least in part on one or more heuristic processes and/or one or more trained machine learning models. For example, the robotic computing device and/or other computing device can determine that the request embodied in the spoken utterance characterizes a temporal event (e.g., the user needs to leave), and/or is provided in a mood that indicates urgency (e.g., a cadence of the user's voice, such as "… I … need … to … leave … (… I … needs … to leave …)"). Based on historical interactions between the user and the robotic computing device, and/or one or more other applications, the robotic computing device can determine that the cadence is different from a typical cadence of the user, and indicate a sense of urgency.

When it is determined that the user has provided a request with increased relative importance, the method 400 can proceed to operation 408. Otherwise, the method 400 may proceed to operation 412. Operation 408 can include causing the robotic computing device to travel according to the first drive parameter. For example, the first drive parameter can include, but is not limited to, an amount of acceleration, an amount of speed, and/or an amount of energy for fulfilling a request from a user. When the first drive parameter is selected, the amount and/or speed of the consumed energy of the robotic computing device can be selected to be increased relative to other values selected for the request of non-increased relative importance. For example, operation 412 can include causing the robotic computing device to travel according to a second drive parameter, which can be different from the first drive parameter. The second drive parameter can characterize speed settings and/or acceleration settings that can be lower than those corresponding to the first drive parameter. When the first drive parameter and/or the second drive parameter are used to control the travel operation of the robotic computing device, the method 400 can proceed from either operation 408 or operation 412 to operation 410.

Operation 410 can include causing the robotic computing device to perform the requested operation. Operation 410 can be performed when a user arrives at a destination and/or is on the way to the corresponding destination. The requested operation can be, for example, rendering an output to another user, identifying a location of another computing device, retrieving information available at a destination, and/or any other operation that can be requested to be performed by the computing device. In furtherance of the foregoing example, the robotic computing device is capable of rendering audible output to a second user (i.e., phoenix), such as "Sherri has just pulled up to the house, so Mark is going to leave now (Xie Li just stopped the vehicle at the home gate so the mug is now going to leave)". In some implementations, the output from the robotic computing device can be different from the spoken utterance from the first user, but can convey information provided by the first user. From operation 410, the method 400 can proceed to operation 414 where it is determined whether the first user, the second user, or the other user is predicted to provide additional requests to the robotic computing device.

The robotic computing device can determine whether the user is predicted to provide additional requests to the robotic computing device based on one or more direct and/or indirect gestures performed by the user. For example, the robotic computing device can determine that the user has gazed at the device, directed their voice to the device, moved to the device, and/or performed a gesture indicating that the user is interested in providing additional requests to the robotic computing device, given the user's prior permissions. In furtherance of the foregoing example, operation 414 can be performed with respect to the first user and/or the second user. For example, when the robotic computing device reaches the second user and provides an audible output, the robotic computing device can determine if the second user is predicted to provide input to the robotic computing device if a prior permission for the second user is obtained. Alternatively or additionally, operation 414 can be performed when the robotic computing device returns to the first user after the audible output has been provided to the second user. When it is predicted that the user will provide additional input to the robotic computing device, the method 400 can proceed to operation 416. Otherwise, the method 400 can return to operation 402 for determining whether input has been provided to the robotic computing device.

Operation 416 can include causing the robotic computing device to track or otherwise maneuver with the user according to the type of predicted input. For example, the robotic computing device can be maneuvered with the second user when the robotic computing device has predicted that the second user will provide input in response to the audible output. In some implementations, the robotic computing device can be maneuvered with the second user for a duration corresponding to a type of predicted input and/or a confidence score of the input prediction. For example, when the robotic computing device predicts an upcoming input using the first confidence score, the robotic computing device can track the user for a first duration. However, when the robotic computing device predicts another upcoming input using a second confidence score that is greater than the first confidence score, the robotic computing device is able to track the user for a second duration that is longer than the first duration. Thereafter, the method 400 can return to operation 402 for determining whether the user provided input to the robotic computing device.

Fig. 5 illustrates a method 500 for operating a robotic computing device to autonomously assign semantic tags to one or more regions within a space or structure occupied by one or more users and/or robotic computing devices. Semantic tags can be assigned to certain locations to further ensure that the robotic computing device performs certain actions in certain areas according to user preferences. The method 500 can be performed by any computing device, application, and/or any apparatus or module capable of interacting with a robotic computing device. The method 500 can include an operation 502 of determining whether a region of space occupied by a robotic computing device is not associated with a semantic tag. For example, a robotic computing device can operate in a household occupied by one or more users (e.g., a mother user and a daughter user), and the household can include various different areas (e.g., different portions of a living room, kitchen, office, bedroom, etc.). Operation 502 can be initiated when the robotic computing device is located in a kitchen of a home, either before the user provides input to the robotic computing device or in response to the user providing input to the robotic computing device. In this way, each semantic tag can assist the robotic computing device in performing a particular action that can be performed more efficiently using information related to positioning at and/or near the robotic computing device. For example, when a first user instructs the robotic computing device to communicate with a second user in the home, the robotic computing device can access semantic tags of different parts of the home to predict the location of the second user if pre-approved by the second user, and optionally, thereafter predict the location of the first user.

When the robotic computing device determines that space occupied by or nearby the robotic computing device is not associated with a semantic tag, the method 500 can proceed from operation 502 to operation 504. Otherwise, the robotic computing device can continue to determine whether the space at or near the robotic computing device is associated with a semantic tag. Operation 504 can include causing a first set of smart devices to transmit one or more first outputs during a first time window. The first set of smart devices can be selected for transmitting one or more first outputs based on the first set of smart devices having assigned tags (e.g., "kitchen counter speakers", "refrigerator smart display", etc.) with related content. Alternatively or additionally, the first set of smart devices can include certain devices located at one or more specific locations on a map generated over time using data from one or more sensors of the robotic computing device. For example, when the robotic computing device maneuvers to different locations in the first user's home, the robotic computing device can capture (with prior permission from the user in the home) data characterizing certain locations of certain devices in the home. The data can be used to associate the devices with locations on a map generated by the robotic computing device and/or one or more other devices.

From operation 504, the method 500 can proceed to operation 506, which operation 506 can include causing the second set of smart devices to transmit one or more second outputs during a second time window. In some implementations, the one or more first outputs and the one or more second outputs can include an audio output and/or a visual output. For example, the one or more first outputs can include one or more characteristics that are the same as or different from one or more other characteristics of the one or more second outputs. For example, the one or more first outputs can embody one or more frequencies (e.g., audio and/or visual) that are different from one or more other frequencies (e.g., audio and/or visual) embodied by the one or more second outputs. In some implementations, the second set of smart devices can be selected for transmitting the one or more second outputs based on the robotic computing device determining that the second set of smart devices is located in a different room and/or a different spatial region than the first set of smart devices.

From operation 506, the method 500 can proceed to operation 508, which operation 508 can include processing sensor data generated by one or more sensors of the robotic computing device. In some implementations, the first time window and the second time window can be at least partially overlapping or non-overlapping durations. For example, the sensor data captured during the first time window can have the same or different time stamps as other sensor data captured during the second time window. In some implementations, sensor data captured by the robotic computing device and/or one or more other computing devices can be processed to determine whether the first set of smart devices and/or the second set of smart devices are co-located with the robotic computing device. For example, the amplitude of the output detected during the first time window can be compared to the amplitude of another detected output during the second time window. The comparison can indicate whether the first set of smart devices or the second set of smart devices are co-located at or near an area occupied by the robotic computing device. Alternatively or additionally, one or more characteristics of the detected output can be determined and compared to a first set of characteristics and/or a second set of characteristics. When the first set of characteristics is detected as embodied in an output detected by the robotic computing device, the robotic computing device can determine that the first set of smart devices are co-located at or near the robotic computing device. Alternatively or additionally, when the second set of characteristics is detected as embodied in the output detected by the robotic computing device, and optionally the second set of characteristics is not embodied, the robotic computing device can determine that the second set of smart devices are co-located at or near the robotic computing device.

In some implementations, the detected characteristics can include an amplitude of the output, a frequency of the output, a change in the amplitude of the output (e.g., as compared to a set amplitude), a change or modulation of the frequency of the output (e.g., as compared to a set frequency), and/or any other characteristic that can be associated with the rendered output. In some implementations, the audio and/or visual output can be rendered at one or more frequencies that may not be visually and/or audibly detected by a natural person and/or a person not aided by extrinsic features. In some implementations, the output rendered by one set of devices can include a combination of one or more different outputs (e.g., outputs rendered by different interface modalities) that can be distinguished from other outputs rendered by another set of devices.

When the robotic computing device is determined to be co-located with a particular set of smart devices (e.g., the first set and/or the second set), the method 500 can proceed from operation 510 to operation 512. Operation 512 can include generating semantic tags for regions within the space currently occupied by the robotic computing device. In some implementations, the set of smart devices can include, but are not limited to, smart lights, smart televisions, smart thermostats, smart speakers, and/or any other device that can be associated with a user and that can be controlled via a separate device and/or application. In some implementations, the semantic tags can be based on existing descriptors of rooms within a space or structure, and they were previously assigned to the detected set of smart devices. For example, in response to explicit user input to the automated assistant and/or other devices or applications, the user may assign a descriptor (e.g., "Sam's room [ device type ] (room of mountain [ device type ])") to the detected set of smart devices. Alternatively or additionally, the detected set of smart devices can be assigned descriptors that can be generated based on processing data using one or more heuristic processes and/or one or more machine learning models. When an area occupied by a robotic computing device is assigned a semantic tag, the semantic tag can be stored in association with a location on a generated map that the robotic computing device can utilize to manipulate between locations within a space or structure.

When a user provides a verbal input, for example, synonymous with a semantic tag, the robotic computing device can associate the verbal input with the region to which the semantic tag is assigned. In this way, the robotic computing device is able to map the user's home using existing data and/or other information without the user having to explicitly identify areas within the home to the robotic computing device. This can allow the robotic computing device to fulfill certain requests with the user explicitly providing less information, thereby reserving the computing resources of the robotic computing device. For example, when a semantic tag (e.g., "Shermi's room (Xie Li)") is assigned to an area of a map, the robotic computing device can navigate to the area in some cases when the semantic tag (or a synonymous term and/or a portion of the semantic tag) is identified in a natural language input (e.g., "Go tell Sherri that dinner is ready (to tell Xie Li dinner ready)") provided to the robotic computing device. When the robotic computing device determines that no set of smart devices is co-located with the robotic computing device, the method 500 can proceed from operation 510 to operation 514. Operation 514 can include causing the robotic computing device to be relocated to a different area of space (e.g., a home) occupied by the user and/or the robotic computing device, and optionally performing operation 502.

Fig. 6 is a block diagram 600 of an example computer system 610. Computer system 610 typically includes at least one processor 614 that communicates with a number of peripheral devices via a bus subsystem 612. These peripheral devices may include a storage subsystem 624 (including, for example, a memory 626 and a file storage subsystem 626), a user interface output device 620, a user interface input device 622, and a network interface subsystem 616. Input and output devices allow users to interact with computer system 610. Network interface subsystem 616 provides an interface to external networks and couples to corresponding interface devices in other computer systems.

User interface input devices 622 may include a keyboard, a pointing device (such as a mouse, trackball, touch pad, or tablet), a scanner, a touch screen incorporated into a display, an audio input device (such as a voice recognition system, microphone, and/or other types of input devices). In general, the term "input device" is intended to include all possible types of devices and ways to input information into computer system 610 or onto a communication network.

The user interface output device 620 may include a display subsystem, a printer, a facsimile machine, or a non-visual display (such as an audio output device). The display subsystem may include a Cathode Ray Tube (CRT), a flat panel device such as a Liquid Crystal Display (LCD), a projection device, or some other mechanism for creating a visual image. The display subsystem may also provide for non-visual displays, such as via an audio output device. In general, the term "output device" is intended to include all possible types of devices and ways to output information from computer system 610 to a user or to another machine or computer system.

Storage subsystem 624 stores programming and data structures that provide the functionality of some or all of the modules described herein. For example, the storage subsystem 624 may include logic to perform one or more of the selected aspects of the method 400, the method 500, and/or the system 300, the robotic computing device 104, the robotic computing device 204, the automated assistant, and/or any other applications, devices, apparatuses, and/or modules discussed herein.

These software modules are typically executed by the processor 614 alone or in combination with other processors. The memory 626 used in the storage subsystem 624 can include a number of memories including a main Random Access Memory (RAM) 630 for storing instructions and data during program execution and a Read Only Memory (ROM) 632 in which fixed instructions are stored. File storage subsystem 626 may provide persistent storage for program and data files, and may include a hard disk drive, a floppy disk drive, and associated removable media, CD-ROM drive, optical drive, or removable media cartridge. Modules implementing the functionality of particular embodiments may be stored by file storage subsystem 626 in storage subsystem 624, or in other machines accessible to processor 614.

Bus subsystem 612 provides a mechanism for letting the various components and subsystems of computer system 610 communicate with each other as intended. Although bus subsystem 612 is shown schematically as a single bus, alternative implementations of the bus subsystem may use multiple buses.

Computer system 610 can be of different types including a workstation, a server, a computing cluster, a blade server, a server farm, or any other data processing system or computing device. Due to the ever-changing nature of computers and networks, the description of computer system 610 depicted in FIG. 6 is intended only as a specific example for purposes of illustrating some embodiments. Many other configurations of computer system 610 are possible with more or fewer components than the computer system depicted in FIG. 6.

Where the systems described herein collect or may utilize personal information about a user (or often referred to herein as a "participant"), the user may be provided with an opportunity to control whether programs or features collect user information (e.g., information about the user's social network, social actions or activities, profession, user preferences, or the user's current geographic location), or whether and/or how content that may be more relevant to the user is received from a content server. Moreover, certain data may be processed in one or more ways to remove personal identification information prior to storage or use. For example, the identity of the user may be processed such that personal identity information of the user cannot be determined, or the user's geolocation may be generalized where geolocation information is obtained (such as to a city, zip code, or state level) such that a particular geolocation of the user cannot be determined. Thus, the user may control how information about the user is collected and/or used.

Although several embodiments have been described and illustrated herein, various other components and/or structures for performing functions and/or obtaining results and/or one or more of the advantages described herein may be utilized and each such change and/or modification is considered to be within the scope of the embodiments described herein. Those skilled in the art will recognize, or be able to ascertain using no more than routine experimentation, many equivalents to the specific embodiments described herein. It is, therefore, to be understood that the foregoing embodiments are presented by way of example only and that, within the scope of the appended claims and equivalents thereto, the embodiments may be practiced otherwise than as specifically described and claimed. Embodiments of the present disclosure relate to each individual feature, system, article, material, kit, and/or method described herein. In addition, if two or more such features, systems, articles, materials, kits, and/or methods are not mutually inconsistent, any combination of such features, systems, articles, materials, kits, and/or methods is included within the scope of the present disclosure.

In some implementations, a method implemented by one or more processors is provided that includes determining, by a robotic computing device, that a user has spoken a spoken utterance indicating that the user is not determining a location of a particular computing device. The spoken utterance does not embody an explicit request for the robotic computing device to recognize the location of the particular computing device. The method can further include causing, by the robotic computing device, an output interface of the robotic computing device to provide an indication to the user that the robotic computing device is capable of determining the location of the particular computing device. The method can further include processing, by the robotic computing device, input data from one or more input interfaces of the robotic computing device to further determine whether the user is willing to allow the robotic computing device to direct the user to a location of a particular computing device. The method can further include, when the robotic computing device has determined that the user is willing to allow the robotic computing device to direct the user to a location of a particular computing device: the method further includes causing the robotic computing device to communicate with the particular computing device to further estimate a relative positioning of the particular computing device with respect to the robotic computing device, and causing the robotic computing device to manipulate the relative positioning toward the particular computing device.

These and other embodiments of the technology disclosed herein can optionally include one or more of the following features.

In some implementations, processing the input data to further determine whether the user is willing to allow the robotic computing device to direct the user to a location of a particular computing device includes: image data indicative of movement of a user toward the robotic computing device is processed. In some implementations, the input data is devoid of audio data that characterizes an explicit appeal from the user to the robotic computing device to determine the relative positioning of the particular computing device. In some implementations, causing the robotic computing device to manipulate the relative positioning toward the particular computing device includes: the robotic computing device is caused to manipulate the relative positioning towards the particular computing device at a speed selected based on the state of the application accessible via the particular computing device. In some of these embodiments, the application comprises a voice call application, and the status of the application indicates that the user missed a call from a particular contact. In some implementations, causing the robotic computing device to communicate with the particular computing device to further estimate a relative positioning of the particular computing device with respect to the robotic computing device includes: a signal metric based on communications between the robotic computing device and the particular computing device is determined, wherein the signal metric is indicative of a relative distance of the particular computing device from the robotic computing device. In some of these implementations, the signal metrics include audio amplitudes of audio output rendered by the particular computing device.

In some implementations, a method implemented by one or more processors is provided and includes receiving, by a robotic computing device, a spoken utterance from a first user, the first user being located in a space with the robotic computing device and a second user. The method can further include determining, based on the spoken utterance, that the first user has directed the robotic computing device to communicate with the second user. The second user is located at a second user location that is different from the first user location of the first user. The method can further include, in response to the spoken utterance, causing the robotic computing device to maneuver to a second user location and render an output for the second user. The output embodies a natural language query based on the spoken utterance from the first user. The method can further include receiving, by the robotic computing device, a response input from the second user. The response input embodies natural language content responsive to natural language queries embodied in output from the robotic computing device. The method can further include, after the robotic computing device provides the output for the second user, causing the robotic computing device to maneuver to the first user location and render another output for the first user. The other output characterizes the response input from the second user and embodies other natural language content different from the natural language content embodied in the response input from the second user.

In some implementations, maneuvering the robotic computing device to the second user location includes: a positioning preference associated with the second user is determined. In some versions of these implementations, maneuvering the robotic computing device to the second user location includes maneuvering the robotic computing device to a particular location corresponding to the preferred location indicated by the location preference. The location preference can indicate a preferred location of the robotic computing device when the robotic computing device is in communication with the second user. In some of these versions, the preferred location indicates a preferred distance of the robotic computing device from the second user, and the particular location is at least a preferred distance from the second location of the second user. In some implementations, maneuvering the robotic computing device to the first user location includes: a positioning preference associated with the first user is determined and the robotic computing device is maneuvered to a particular position that corresponds to the preferred position indicated by the positioning preference. The positioning preference can indicate a preferred positioning of the robotic computing device when the robotic computing device renders a particular type of output for the first user. For example, the particular type of output includes an audible output provided via an audio output interface of the robotic computing device or a visual output provided via a display interface of the robotic computing device. For example, a particular type of output can be an audible output with content characterizing a message from another user.

In some implementations, a method implemented by one or more processors is provided that includes determining, at a robotic computing device, that a user has requested the robotic computing device to perform an operation in a particular room located in a space including a plurality of different rooms. The method can further include causing, by the robotic computing device, one or more devices in one or more of the plurality of different rooms to provide one or more respective outputs detectable by the robotic computing device. The method can further include determining, based on the one or more respective outputs, whether the current location of the robotic computing device corresponds to a particular room. The method can further include, when the current location of the robotic computing device does not correspond to a particular room: the method includes repositioning the robotic computing device to the particular room based on the current location of the robotic computing device not corresponding to the particular room, and causing the robotic computing device to perform an operation when the robotic computing device is located in the particular room.

In some implementations, repositioning the robotic computing device to the particular room includes: it is determined that the particular portion of the particular room is preferred by the user for performing the particular type of operation corresponding to the operation and causing the robotic computing device to be repositioned to the particular portion of the particular room. In some implementations, repositioning the robotic computing device to the particular room includes: it is determined that the particular portion of the particular room is preferred by the user to perform a particular type of operation that does not correspond to the operation and to cause the robotic computing device to be repositioned to a different portion of the particular room. In some implementations, the method can further include, when the current location of the robotic computing device corresponds to a particular room: causing the robotic computing device to identify a portion of the current room within the current room in which the robotic computing device is located, the portion being a preferred portion for performing the operation. In some of these embodiments, causing the robotic computing device to identify the portion of the current room that is the preferred portion of the room for performing the operation includes: based on the user requesting the operation, it is determined that the user previously requested the robotic computing device to perform a particular type of operation corresponding to the operation in a preferred portion of the room. In some implementations, the method can further include, when the current location of the robotic computing device does not correspond to a particular room: before the robotic computing device performs the operation, causing the robotic computing device to render an output that solicits the user to confirm that the current location of the robotic computing device is approved to perform the operation. In some implementations, the method can further include, when the current location of the robotic computing device corresponds to a particular room: causing the robotic computing device to identify a relative distance that follows the user when performing an operation when the user relocates to another portion of the particular room within the current room in which the robotic computing device is located.

In some implementations, a method implemented by one or more processors is provided that includes determining, by a mobile robotic computing device and based on a map generated based at least in part on sensor observations of the mobile robotic computing device, that the mobile robotic computing device is currently located within a particular area of a structure (e.g., within a room of a home). The method can further include, when the mobile robotic computing device is located within the particular area: the first subset of smart devices is caused to each transmit one or more first outputs and the second subset of smart devices is caused to each transmit one or more second outputs. The one or more first outputs are audible and/or visual, and wherein the one or more first outputs are caused to be transmitted during the first time window and/or with the one or more first characteristics in response to the first subset of smart devices each being assigned a first semantic tag in the home graph. The one or more second outputs are audible and/or visual, and wherein the one or more second outputs are caused to be transmitted during a second time window and/or with one or more second characteristics in response to the second subset of smart devices each being assigned a second semantic tag in the family graph. The method can further include obtaining sensor data during transmission of the one or more first outputs and the one or more second outputs. The sensor data is generated by one or more sensors of the mobile robotic computing device. The method can further include determining that the first subset of smart devices is co-located with the robot in the particular area based on the analysis of the sensor data. Determining that the first subset of smart devices is co-located with the robot in the particular area is based on: (1) The analysis indicates that the detected output is during the first time window and/or matches one or more first characteristics, and/or (2) the magnitude of the detected output is during the first time window and/or matches one or more first characteristics. The method can further include, in response to determining that the first subset of smart devices are co-located with the robot in the given room: inferred semantic tags are assigned to specific regions. The inferred semantic tags can be the same as or derived from the first semantic tags assigned to the first subset of smart devices in the home graph.

In some implementations, one or more first outputs are transmitted during a first time window and one or more second outputs are transmitted during a second time window. In some versions of these embodiments, determining that the first subset of smart devices is co-located with the robot in the particular area comprises: it is determined that the detected output occurred during the first time window and that there was no detected output occurring during the second time window. In some additional or alternative versions of these embodiments, determining that the first subset of smart devices is co-located with the robot in the particular area includes: it is determined that the magnitude of the detected output occurring during the first time window is greater than the additional magnitude of the additional detected output occurring during the second time window. In some implementations, one or more first outputs have a first characteristic and wherein one or more second outputs have a second characteristic. In some versions of these embodiments, determining that the first subset of smart devices is co-located with the robot in the particular area comprises: it is determined that the detected output matches the first characteristic, and it is determined that there is no detected output matching the second characteristic. In some of these versions, the one or more first characteristics include a first frequency and the one or more second characteristics include a second frequency. For example, the first output can include a visual output and the first frequency can be a first visual frequency; and the second output can include a second visual output and the second frequency can be a second visual frequency. In some implementations, the one or more first outputs have a first characteristic and the one or more second outputs have a second characteristic, and wherein determining that the first subset of smart devices are co-located with the robot in the particular area comprises: it is determined that the magnitude of the first characteristic in the detected output is greater than the additional magnitude of the second characteristic in the detected output. In some versions of these embodiments, the one or more first characteristics include a first frequency and the one or more second characteristics include a second frequency. For example, the first output can include an audible output and the first frequency is a first audible frequency outside of the human hearing range; and the second output can include a second audible output and the second frequency is a second audible frequency outside of the human hearing range.

In some implementations, the first subset of smart devices includes a standalone automated assistant device, and the one or more first outputs include a first audible output via a hardware speaker of the automated assistant device. In some implementations, the first subset of smart devices includes a standalone automated assistant device, and the one or more first outputs include a first visual output via a hardware display of the automated assistant device or via a light emitting diode of the automated assistant device. In some implementations, the first subset of smart devices includes smart lights, smart televisions, and/or smart thermostats. In some implementations, the first semantic tag in the home graph is a first descriptor of a first room within the structure that was previously assigned to a first subset of the smart devices based on a first explicit user input; and/or a second semantic tag in the house graph is a second descriptor of a second room within the structure that was previously assigned to a second subset of the smart devices based on a second explicit user input. In some implementations, assigning the inferred semantic tags to the particular regions includes: inferred semantic tags are automatically assigned to specific areas in the map for use by the mobile robotic device. In some versions of these embodiments, the method can further include, after automatically assigning the inferred semantic tags to specific areas in the map for use by the mobile robotic device: the inferred semantic tags are used to control navigation of the mobile robotic device. In some of these versions, controlling navigation of the mobile robotic device using the inferred semantic tags includes: based on processing the spoken input detected at the one or more microphones of the mobile robotic device, determining that one or more terms of the spoken input match the inferred semantic tags; and based on determining that the one or more terms match the inferred semantic tags, and based on the inferred semantic tags being assigned to a particular area in the map, causing the robot to navigate to the particular area. In some implementations, assigning the inferred semantic tags to the particular regions includes: suggesting to the user in the graphical user interface that the inferred semantic tags be assigned to specific areas in the map for use by the mobile robotic device; and in response to receiving a positive user interface input by the user in response to the suggestion, assigning the inferred semantic tags to specific areas in the map for use by the mobile robotic device.

Claims

1. A method implemented by one or more processors, the method comprising:

determining, by a mobile robotic computing device and based on a map generated based at least in part on sensor observations of the mobile robotic computing device, that the mobile robotic computing device is currently located within a particular region of a structure;

when the mobile robotic computing device is located within the particular area:

causing a first subset of smart devices to each transmit one or more first outputs, wherein the one or more first outputs are audible and/or visual, and wherein the one or more first outputs are caused to be transmitted during a first time window and/or with one or more first characteristics in response to the first subset of smart devices each being assigned a first semantic tag in a household map;

causing a second subset of the smart devices to each transmit one or more second outputs, wherein the one or more second outputs are audible and/or visual, and wherein the one or more second outputs are caused to be transmitted during a second time window and/or with one or more second characteristics in response to the second subset of smart devices each being assigned a second semantic tag in the household map; and

Obtaining sensor data during the transmission of the one or more first outputs and the one or more second outputs, wherein the sensor data is generated by one or more sensors of the mobile robotic computing device;

determining that the first subset of smart devices is co-located with the robot in the particular area based on the analysis of the sensor data, wherein determining that the first subset of smart devices is co-located with the robot in the particular area is based on:

the analysis indicates that the detected output is during the first time window and/or matches the one or more first characteristics, and/or

The magnitude of the detected output, the detected output matching the one or more first characteristics during the first time window and/or; and

in response to determining that the first subset of smart devices is co-located with the robot in a given room:

assigning inferred semantic tags to the particular region, the inferred semantic tags being the same as or derived from the first semantic tags in the home graph assigned to the first subset of smart devices.

2. The method of claim 1, wherein the one or more first outputs are transmitted during the first time window, and wherein the one or more second outputs are transmitted during the second time window, and wherein determining that the first subset of smart devices are co-located with the robot in the particular region comprises:

it is determined that the detected output occurred during the first time window and that there was no detected output occurring during the second time window.

3. The method of claim 1, wherein the one or more first outputs are transmitted during the first time window, and wherein the one or more second outputs are transmitted during the second time window, and wherein determining that the first subset of smart devices are co-located with the robot in the particular region comprises:

determining that the magnitude of the detected output occurring during the first time window is greater than an additional magnitude of an additional detected output occurring during the second time window.

4. The method of claim 1 or claim 2, wherein the one or more first outputs have the first characteristic, and wherein the one or more second outputs have the second characteristic, and wherein determining that the first subset of smart devices are co-located with the robot in the particular region comprises:

It is determined that the detected output matches the first characteristic and that there is no detected output matching the second characteristic.

5. The method of claim 4, wherein the one or more first characteristics comprise a first frequency, and wherein the one or more second characteristics comprise a second frequency.

6. The method according to claim 5,

wherein the first output comprises a visual output and the first frequency is a first visual frequency;

wherein the second output comprises a visual output and the second frequency is a second visual frequency.

7. The method of claim 1 or claim 3, wherein the one or more first outputs have the first characteristic, and wherein the one or more second outputs have the second characteristic, and wherein determining that the first subset of smart devices are co-located with the robot in the particular region comprises:

determining that the magnitude of the first characteristic in the detected output is greater than the additional magnitude of the second characteristic in the detected output.

8. The method of claim 7, wherein the one or more first characteristics comprise a first frequency, and wherein the one or more second characteristics comprise a second frequency.

9. The method according to claim 8, wherein the method comprises,

wherein the first output comprises an audible output and the first frequency is a first audible frequency outside of a human hearing range; and

wherein the second output comprises an audible output and the second frequency is a second audible frequency outside the human hearing range.

10. The method of any preceding claim, wherein the first subset of smart devices comprises a stand-alone automated assistant device and the one or more first outputs comprise a first audible output via a hardware speaker of the stand-alone automated assistant device.

11. The method of any of claims 1-10, wherein the first subset of smart devices comprises a standalone automated assistant device and the one or more first outputs comprise first visual outputs via a hardware display of the automated assistant device or via light emitting diodes of the automated assistant device.

12. The method of any preceding claim, wherein the first subset of smart devices comprises a smart light, a smart television, or a smart thermostat.

13. The method according to any preceding claim,

wherein the first semantic tag in the household map is a first descriptor of a first room within a structure, the first descriptor previously assigned to the first subset of smart devices based on a first explicit user input; and

wherein the second semantic tag in the household map is a second descriptor of a second room within the structure, the second descriptor previously assigned to the second subset of smart devices based on a second explicit user input.

14. The method of any preceding claim, wherein assigning the inferred semantic tags to the particular region comprises:

the inferred semantic tags are automatically assigned to the particular region in the map for use by the mobile robotic device.

15. The method of claim 14, further comprising, after automatically assigning the inferred semantic tags to the particular region in the map for use by the mobile robotic device:

controlling navigation of the mobile robotic device using the inferred semantic tags.

16. The method of claim 15, wherein using the inferred semantic tags to control navigation of the mobile robotic device comprises:

Determining, based on processing spoken input detected at one or more microphones of the mobile robotic device, that one or more terms of the spoken input match the inferred semantic tags; and

based on determining that the one or more terms match the inferred semantic tags, and based on the inferred semantic tags being assigned to the particular region in the map, the robot is navigated to the particular region.

17. The method of any of claims 1-13, wherein assigning the inferred semantic tag to the particular region comprises:

suggesting to a user in a graphical user interface that the inferred semantic tags be assigned to the particular region in the map for use by the mobile robotic device; and

in response to receiving a positive user interface input by the user in response to the suggestion, assigning the inferred semantic tag to the particular area in the map for use by the mobile robotic device.

18. A method implemented by one or more processors, the method comprising:

determining by the robotic computing device that the user has spoken a spoken utterance indicating that the user is not determining the location of a particular computing device,

Wherein the spoken utterance does not embody an explicit request for the robotic computing device to identify the location of the particular computing device;

causing, by the robotic computing device, an output interface of the robotic computing device to provide an indication to the user that the robotic computing device is capable of determining the location of the particular computing device;

processing, by the robotic computing device, input data from one or more input interfaces of the robotic computing device to further determine whether the user is willing to allow the robotic computing device to direct the user to the location of the particular computing device; and

when the robotic computing device has determined that the user is willing to allow the robotic computing device to direct the user to the location of the particular computing device:

causing the robotic computing device to communicate with the particular computing device to further estimate a relative positioning of the particular computing device with respect to the robotic computing device, and

causing the robotic computing device to manipulate the relative positioning toward the particular computing device.

19. The method of claim 18, wherein causing the robotic computing device to manipulate the relative positioning toward the particular computing device comprises:

Causing the robotic computing device to manipulate the relative positioning towards the particular computing device at a speed selected based on a state of an application accessible via the particular computing device.

20. The method of claim 19, wherein the application comprises a voice call application and the status of the application indicates that the user missed a call from a particular contact.

21. The method of claim 20, wherein communicating the robotic computing device with the particular computing device to further estimate the relative positioning of the particular computing device with respect to the robotic computing device comprises:

determining a signal metric based on communications between the robotic computing device and the particular computing device,

wherein the signal metric indicates a relative distance of the particular computing device from the robotic computing device.

22. The method of claim 21, wherein the signal metric comprises an audio amplitude of an audio output rendered by the particular computing device.

23. The method of any of claims 18-22, wherein processing the input data to further determine whether the user is willing to allow the robotic computing device to direct the user to the location of the particular computing device comprises:

Image data indicative of motion of the user toward the robotic computing device is processed.

24. The method of any of claims 18-23, wherein the input data is devoid of audio data characterizing an explicit appeal from the user to the robotic computing device to determine the relative positioning of the particular computing device.

25. A method implemented by one or more processors, the method comprising:

determining, at a robotic computing device, that a user has requested the robotic computing device to perform an operation in a particular room located in a space comprising a plurality of different rooms;

causing, by the robotic computing device, one or more devices in one or more of the plurality of different rooms to provide one or more respective outputs detectable by the robotic computing device;

determining, based on the one or more respective outputs, whether a current location of the robotic computing device corresponds to the particular room;

when the current location of the robotic computing device does not correspond to the particular room:

repositioning the robotic computing device to the particular room based on the current location of the robotic computing device not corresponding to the particular room, and

The robotic computing device is caused to perform the operation when the robotic computing device is located in the particular room.

26. The method of claim 25, wherein repositioning the robotic computing device to the particular room comprises:

determining that a particular portion of the particular room is preferred by the user for performing a particular type of operation corresponding to the operation, and

causing the robotic computing device to be repositioned to the particular portion of the particular room.

27. The method of claim 25, wherein repositioning the robotic computing device to the particular room comprises:

determining that a particular portion of the particular room is preferred by the user for performing a particular type of operation that does not correspond to the operation, and

causing the robotic computing device to be repositioned to a different portion of the particular room.

28. The method of claim 25, further comprising:

when the current location of the robotic computing device corresponds to the particular room:

causing the robotic computing device to identify a portion of a current room in which the robotic computing device is located, the portion being a preferred portion for performing the operation.

29. The method of any of claims 25-28, wherein causing the robotic computing device to identify the portion of the current room as the preferred portion of the room for performing the operation comprises:

based on the user requesting the operation, it is determined that the user previously requested the robotic computing device to perform a particular type of operation corresponding to the operation in the preferred portion of the room.

30. The method of any of claims 25 to 29, further comprising:

before the robotic computing device performs the operation, causing the robotic computing device to render an output that solicits the user to confirm that the current location of the robotic computing device is approved to perform the operation.

31. The method of any of claims 25 to 30, further comprising:

causing the robotic computing device to identify, within a current room in which the robotic computing device is located, a relative distance that follows the user when performing the operation when the user is repositioned to another portion of the particular room.

32. A method implemented by one or more processors, the method comprising:

receiving, by a robotic computing device, a spoken utterance from a first user, the first user being in one space with the robotic computing device and a second user;

determining that the first user has directed the robotic computing device to communicate with the second user based on the spoken utterance,

wherein the second user is located at a second user location that is different from the first user location of the first user;

responsive to the spoken utterance, causing the robotic computing device to maneuver to the second user location and render an output for the second user,

wherein the output embodies a natural language query based on the spoken utterance from the first user;

receiving by the robotic computing device a response input from the second user,

wherein the response input embodies natural language content responsive to the natural language query embodied in the output from the robotic computing device; and

after the robotic computing device provides the output to the second user, causing the robotic computing device to maneuver to the first user location and render another output for the first user,

Wherein the further output characterizes the response input from the second user and embodies other natural language content than the natural language content embodied in the response input from the second user.

33. The method of claim 32, wherein steering the robotic computing device to the second user location comprises:

determining a positioning preference associated with the second user,

wherein the positioning preference indicates a preferred positioning of the robotic computing device when the robotic computing device is in communication with the second user, and

the robotic computing device is maneuvered to a particular location corresponding to the preferred location indicated by the location preference.

34. The method of claim 33, wherein the preferred location indicates a preferred distance of the robotic computing device from the second user, and the particular location is at least the preferred distance from the second location of the second user.

35. The method of any of claims 32-34, wherein maneuvering the robotic computing device to the first user location comprises:

Determining a positioning preference associated with the first user,

wherein the positioning preference indicates a preferred positioning of the robotic computing device when the robotic computing device renders a particular type of output for the first user, and

36. The method of any of claims 32 to 35, wherein the particular type of output comprises an audible output provided via an audio output interface of the robotic computing device or a visual output provided via a display interface of the robotic computing device.

37. The method of claim 36, wherein the particular type of output is the audible output having content characterizing a message from another user.

38. A computer program comprising instructions which, when executed by one or more processors of a computing system, cause the computing system to perform the method of any preceding claim.

39. A system comprising one or more computing devices configured to perform the method of any of claims 1-37.

40. The system of claim 39, wherein the one or more computing devices comprise a mobile robotic computing device.

41. A computer-readable storage medium storing instructions executable by one or more processors of a computing system to perform the method of any one of claims 1-37.