CN117121100A

CN117121100A - Enabling natural conversations with soft endpoints for automated assistants

Info

Publication number: CN117121100A
Application number: CN202180096651.3A
Authority: CN
Inventors: 贾克琳·康策尔曼; 特雷弗·施特勒曼; 乔纳森·布鲁姆; 约翰·沙尔克威克; 约瑟夫·斯玛尔
Original assignee: Google LLC
Current assignee: Google LLC
Priority date: 2021-08-17
Filing date: 2021-11-29
Publication date: 2023-11-24

Abstract

As part of a dialogue session between the user and the automated assistant, embodiments can process an audio data stream capturing a portion of the spoken utterance using a streaming ASR model to generate an ASR output; processing the ASR output using the NLU model to generate an NLU output; and causing a fulfillment data stream to be generated based on the NLU output. Further, the embodiments can also determine audio-based characteristics associated with a portion of the spoken utterance captured in the audio data stream based on processing the audio data stream. Based on the audio-based characteristics and/or NLU output streams, embodiments can determine whether the user has paused providing the spoken utterance or has completed providing the spoken utterance. If the user has paused, the embodiments enable natural dialog output to be provided to the user for presentation.

Description

Enabling natural conversations with soft endpoints for automated assistants

Background

Humans may participate in human-machine conversations with interactive software applications, referred to herein as "automated assistants" (also referred to as "chat robots," "interactive personal assistants," "intelligent personal assistants," "personal voice assistants," "conversation agents," etc.). Automated assistants typically rely on component pipelines to interpret and respond to spoken utterances (or touch/typed inputs). For example, an Automatic Speech Recognition (ASR) engine can process audio data corresponding to a spoken utterance of a user to generate an ASR output, such as the spoken utterance or a sequence of speech hypotheses (i.e., term(s) and/or other word (s)) predicted to correspond to phonemes of the spoken utterance. Further, a Natural Language Understanding (NLU) engine can process ASR output (or touch/typed input) to generate NLU output, such as the user's intent in providing a spoken utterance (or touch/typed input) and optionally the slot value(s) of the parameter(s) associated with the intent. Further, the fulfillment engine can be used to process the NLU output and generate fulfillment output, such as obtaining response content to the spoken utterance and/or a structured request to perform an action in response to the spoken utterance, and can generate a fulfillment data stream based on the fulfillment output.

Typically, a dialog session with the automated assistant is initiated by the user providing a spoken utterance, and the automated assistant is able to respond to the spoken utterance to generate a response using the aforementioned component pipeline. The user can continue the conversation session by providing additional spoken utterances, and the automated assistant can respond to the additional spoken utterances using the aforementioned component pipeline to generate additional responses. In other words, these dialog sessions are typically round-robin based in that the user makes a round-robin in the dialog session to provide a spoken utterance and when the user stops speaking, the automated assistant makes a round-robin in the dialog session to respond to the spoken utterance. However, from the perspective of the user, these session-based conversations may be unnatural, as they do not reflect how humans actually talk to each other.

For example, a first person may provide a plurality of different spoken utterances to convey a single idea to a second person, and the second person can consider each of the plurality of different spoken utterances in formulating a response to the first person. In some cases, the first person may pause for different amounts of time between these multiple different utterances (or for different amounts of time when a single spoken utterance is provided). Notably, the second person may not be able to fully formulate a response to the first person based simply on the first spoken utterance (or a portion thereof) of the plurality of different spoken utterances or each of the plurality of different spoken utterances in isolation.

Similarly, in these wheel-based dialog sessions, the automated assistant may not be able to fully formulate a response to a given spoken utterance (or a portion thereof) of the user without regard to the context of the given spoken utterance relative to a plurality of different spoken utterances or without waiting for the user to complete providing the given spoken utterance. Thus, these talk-around-based conversations may be prolonged when a user attempts to convey his/her mind to an automated assistant with a single spoken utterance during a single talk-around of these talk-around-based conversations, wasting computing resources. Furthermore, if a user attempts to convey his/her mind to an automated assistant in multiple spoken utterances during a single session of these session-based dialog sessions, the automated assistant may simply fail, wasting computing resources as well. For example, when a user provides a long pause in attempting to formulate a spoken utterance, an automated assistant may prematurely conclude that the user has completed speaking, processed an incomplete spoken utterance, and failed by determining (based on processing) an meaningless intent conveyed by the incomplete spoken utterance, or failed by determining (based on processing) an incorrect intent conveyed by the incomplete spoken utterance. Additionally, the talk-round based dialog session can prevent spoken utterances of the user provided during rendering of the auxiliary response from being meaningfully processed. This can require the user to wait for the completion of the rendering of the assistant response before providing the spoken utterance, thereby extending the dialog session.

Disclosure of Invention

Embodiments described herein relate to enabling an automated assistant to perform a natural conversation with a user during a conversation session. Some implementations can process an audio data stream generated by microphone(s) of a user's client device using a streaming Automatic Speech Recognition (ASR) model to generate an ASR output stream. The audio data stream can capture a portion of a spoken utterance directed to a user of an automated assistant implemented at least in part at a client device. In addition, ASR output can be processed using a Natural Language Understanding (NLU) model to generate an NLU output stream. In addition, the NLU output can be processed using one or more fulfillment rules and/or one or more fulfillment models to generate a fulfillment data stream. Additionally, audio-based characteristics associated with one or more of the spoken utterances can be determined based on processing the audio data stream. Audio-based characteristics associated with portions of the spoken utterance include, for example, intonation, tone, accent, rhythm, beat, pitch, elongated syllable, pause, grammar(s) associated with the pause, and/or other audio-based characteristics that may be derived from processing the audio data stream. Based on the NLU output stream and/or based on the characteristics of the audio, the automated assistant can determine whether the user has paused providing the spoken utterance or has completed providing the spoken utterance (e.g., a soft endpoint).

In some implementations, in response to determining that the user has paused providing the spoken utterance, the automated assistant can cause a natural dialog output to be provided to the user for presentation to instruct the automated assistant to wait for the user to complete providing the spoken utterance (and even if the automated assistant determines that fulfillment of the spoken utterance can be performed in various implementations). In some implementations, in response to determining that the user has completed providing the spoken utterance, the automated assistant can cause the fulfillment output to be provided to the user for presentation. Thus, by determining whether the user pauses or completes providing the spoken utterance, the automated assistant can naturally wait for the user to complete his/her mind based on what the user speaks and how they were, rather than simply responding to the user after the user pauses providing the spoken utterance as in a talk-round based conversation session.

For example, assume that a user has engaged in a conversation session with an automated assistant and provided a spoken utterance to "call arolld's". When the user provides a spoken utterance, the ASR output stream, the NLU output stream, and the fulfillment data stream can be generated based on processing the audio data stream that captured the spoken utterance. Notably, in this example and upon receiving a spoken utterance, the ASR output stream may include recognized text corresponding to the spoken utterance (e.g., "call Arnold's"), the NLU output stream may include a predicted "call" or "phone call" intent having a slot value of a callee parameter "Arnold" associated with the predicted "call" or "phone call" intent, and the fulfillment data stream can include an assistant command that, when performed as a fulfillment output, causes the client device or an additional client device in communication with the client device to initiate a phone call using a contact entry of a user associated with the entity reference "Arnold". Furthermore, the audio-based characteristics associated with the spoken utterance can be generated based on processing the audio data stream, and can include, for example, elongated syllables (e.g., as indicated by "llll" in "call arnolld's") that indicate the user is uncertain of the exact intent of the called party parameter. Thus, in this example, even though the automated assistant may be able to fulfill the spoken utterance based on the NLU data stream (e.g., by having the client device or an additional client device initiate a phone call using the contact entry "Arnold"), the automated assistant may determine that the user has paused and avoid having the spoken utterance fulfilled to provide additional time for the user to complete the spoken utterance based on the audio-based characteristics.

Rather, in this example, the automated assistant can determine to provide natural dialog output to the user for presentation. For example, in response to determining that the provision of the spoken utterance has been paused (and optionally after the user has been paused for a threshold duration), the automated assistant can cause a natural dialog output, such as "Mmhmm" or "Uh huhh" (or other voice return channel), to be provided to the user via the speaker(s) of the client device for audible presentation to indicate that the automated assistant is waiting for the user to finish providing the spoken utterance. In some cases, the volume of natural dialog output provided to the user for audible presentation can be lower than other audible output provided to the user for presentation. Additionally or alternatively, in embodiments in which the client device includes a display, the client device can render one or more graphical elements, such as a streaming transcription of the spoken utterance along with the jump ellipses, to indicate that the automated assistant is waiting for the user to finish providing the spoken utterance. Additionally or alternatively, in embodiments in which the client device includes one or more Light Emitting Diodes (LEDs), the client device can cause one or more of the LEDs to be illuminated to indicate that the automated assistant is waiting for the user to finish providing the spoken utterance. Notably, while natural dialog output is provided to a user of the client device for audible presentation, one or more automated assistant components (e.g., ASR, NLU, fulfillment, and/or other components) can remain active to continue processing the audio data stream.

In this example, further assume that the user provides a spoken utterance of "Arnold's restaurant" to complete providing a prior spoken utterance while the natural dialog output is provided for audible presentation, or generates a spoken utterance of "call Arnold's restaurant" after the natural dialog output is provided for audible presentation, where "Arnold's restaurant" is an imaginary italian restaurant. Thus, the ASR output stream, the NLU output stream, and the fulfillment data stream can be updated based on the user completing the spoken utterance. In particular, the NLU output stream may still include a predicted "call" or "telephone call" intent, but with a slot value of "Arnold 'restaurant" (e.g., not a contact entry of "Arnold") for the callee parameter associated with the predicted "call" or "telephone call" intent, and the fulfillment data stream can include an assistant command that causes the client device or an additional client device in communication with the client device to initiate a telephone call to a restaurant associated with the entity reference "Arnold's restaurant" when the fulfillment output is performed. Further, in response to determining that the spoken utterance is complete, the automated assistant can cause the client device or an additional client device in communication with the client device to initiate a telephone call.

In contrast, further assume that after the natural dialog output is provided for audible presentation (and optionally for a threshold duration after the natural dialog output is provided for audible presentation), the user does not provide any spoken utterances to complete providing the prior spoken utterances. In this example, the automated assistant can determine additional natural dialog outputs provided to the user for audible presentation. However, additional natural conversations can explicitly request that the user of the client device complete a spoken utterance (e.g., "what is you speaking. In some implementations, and assuming that the user then provides a spoken utterance of "Arnold's restaurant" to complete providing the previous spoken utterance, the ASR output stream, the NLU output stream, and the fulfillment output stream can be updated, and the automated assistant can cause the spoken utterance to be fulfilled as described above (e.g., by having the client device initiate a telephone call to a restaurant associated with the entity reference "Arnold's restaurant").

In additional or alternative implementations, and assuming that the client device includes a display, the automated assistant can provide the user with a plurality of selectable graphical elements for visual presentation, wherein each of the selectable graphical elements is associated with a different interpretation of one or more portions of the spoken utterance. In this example, the automated assistant can provide a first selectable graphical element that, when selected, causes the automated assistant to initiate a telephone call with the restaurant "Arnold's restaurant" and a second selectable graphical element that, when selected, causes the automated assistant to initiate a telephone call using the contact entry "Arnold". The automated assistant can then initiate a telephone call based on receiving a user selection of a given one of the selectable graphical elements, or based on NLU metrics associated with the interpretation if the user does not select one of the selectable graphical elements within a threshold duration that causes one or more selectable graphical elements to be presented. For example, in this example, the automated assistant can initiate a telephone call with the restaurant "Arnold's restaurant" if the user does not provide a selection of one or more selectable graphical elements within 5 seconds, 7 seconds, or any other threshold duration after the one or more selectable graphical elements are provided to the user for presentation.

As another example, assume that the user has participated in a dialogue session with an automated assistant and provided a spoken utterance of "what s on my calendar forrrr (what forrrr is on my calendar)". When the user provides a spoken utterance, the ASR output stream, the NLU output stream, and the fulfillment data stream can be generated based on processing the audio data stream that captured the spoken utterance. Notably, in this example and upon receipt of a spoken utterance, the ASR output stream may include recognized text corresponding to the spoken utterance (e.g., "what is on my calendar"), the NLU output stream may include a predicted "calendar" or "calendar lookup" intent having an unknown slot value of a date parameter associated with the predicted "calendar" or "calendar lookup" intent, and the fulfillment data stream can include an assistant command that, when performed as a fulfillment output, causes the client device to lookup the user's calendar information. Similarly, audio-based characteristics associated with the spoken utterance can be generated based on processing the audio data stream, and can include, for example, elongated syllables (e.g., as indicated by "rrrr" in "what's on my calendar forrrr") that indicate a user-uncertainty date parameter. Thus, in this example, the automated assistant may not be able to fulfill the spoken utterance based on the NLU data stream (e.g., based on the unknown slot value) and/or the audio-based characteristics of the spoken utterance, the automated assistant may determine that the user has paused and refrain from having the spoken utterance fulfilled to provide additional time for the user to complete the spoken utterance based on the audio-based characteristics.

Similarly, in this example, the automated assistant can determine to provide natural dialog output to the user for presentation. For example, in response to determining that the provision of the spoken utterance has been paused (and optionally after the user has been paused for a threshold duration), the automated assistant can cause a natural dialog output (such as "Mmhmm" or "Uh huhh") to be provided to the user via speaker(s) of the client device for audible presentation to indicate that the automated assistant is waiting for the user to complete providing the spoken utterance, and/or to indicate that the automated assistant is waiting for the user to complete providing other indications of the spoken utterance. However, it is further assumed that after the natural dialog output is provided for audible presentation (and optionally for a threshold duration after the natural dialog output is provided for audible presentation), the user does not provide any spoken utterances to complete providing the previous spoken utterances. In this example, the automated assistant may simply infer the slot value for the current date of the unknown date parameter associated with the predicted "calendar" or "calendar lookup" intent, and cause the automated assistant to fulfill the spoken utterance by providing the user with calendar information (e.g., audibly and/or visually) for the current date, even if the user did not complete the spoken utterance. In additional or alternative embodiments, the automated assistant can utilize one or more additional or alternative automated assistant components to disambiguate any spoken utterances, confirm the fulfillment of any spoken utterances, and/or perform any other actions prior to causing any assistant commands to be fulfilled.

In various implementations, such as the latter example, in which the user initially provided a spoken utterance of "what's on my calendar forrrr" and in contrast to the former example, in which the user initially provided a spoken utterance of "call arolld's", the automated assistant can determine one or more computational costs associated with fulfilling the spoken utterance to be fulfilled and/or revoking the fulfillment of the spoken utterance if the spoken utterance was erroneously fulfilled. For example, in the former example, the computational cost associated with fulfilling the spoken utterance can include initiating a telephone call using at least the contact entry "Arnold", and the computational cost associated with revoking the fulfillment of the spoken utterance can include terminating at least the telephone call of the contact entry associated with "Arnold", re-initiating a conversation session with the user, processing additional spoken utterances, and initiating another telephone call using the dining hall "Arnold's restaurant". Further, in the previous example, one or more user costs associated with initiating a user-unintended telephone call may be relatively high. Further, for example, in the latter example, the computational cost associated with fulfilling the spoken utterance can include causing at least calendar information of a current date to be provided to the user for presentation, and the computational cost associated with revoking the fulfillment of the spoken utterance can include causing calendar information of another date specified by the user to be provided to the user for presentation. Further, in the latter example, one or more user costs associated with providing incorrect calendar information to the user may be relatively low. In other words, the computational cost associated with fulfillment (and cancel fulfillment) in the former example is relatively higher than the computational cost associated with fulfillment (and cancel fulfillment) in the latter example. Thus, in the latter example, the automated assistant may determine to use the inferred date parameters to fulfill the spoken utterance based on the latter computational cost in an attempt to end the dialog session in a more rapid and efficient manner, but not in the former example due to the former computational cost.

One or more technical advantages may be realized by using the techniques described herein. As one non-limiting example, the techniques described herein enable an automated assistant to engage in a natural conversation with a user during a conversation session. For example, the automated assistant can determine whether the user has paused or completed providing the spoken utterance and adapt the output provided to the user for presentation accordingly, such that the automated assistant is not limited to a talk-round based conversation session or relies on determining that the user has finished speaking before responding to the user. Thus, when the user engages in these natural conversations, the automated assistant is able to determine when to respond to the user and how to respond to the user. This results in various technical advantages of saving computing resources at the client device and enables the dialog session to end in a faster and more efficient manner. For example, the number of occurrences of automated assistant failures can be reduced because the automated assistant can wait for more information from the user before attempting to perform any fulfillment on behalf of the user (even if the automated assistant predicts that fulfillment should be performed). Further, for example, the number of user inputs received at the client device can be reduced because the number of occurrences that a user has to repeat themselves or recall the automated assistant can be reduced.

As used herein, a "conversation session" may include a logically independent exchange between a user and an automated assistant (and in some cases, other human participants). The automated assistant may differentiate between multiple conversations with the user based on various signals, such as a time lapse between the sessions, a change in user context between the sessions (e.g., positioning, before/during/after a scheduled meeting, etc.), detection of one or more intervening interactions between the user and the client device other than the conversation between the user and the automated assistant (e.g., the user temporarily switches applications, the user leaves and then later returns to a separate voice activated product), a lock/sleep of the client device between the sessions, a change in the client device for interfacing with the automated assistant, etc.

The foregoing description is provided as an overview of only some of the embodiments disclosed herein. These and other embodiments will be described in detail herein.

It should be appreciated that the techniques disclosed herein may be implemented locally on a client device, remotely by a server(s) connected to the client device via one or more networks, and/or both.

Drawings

FIG. 1 depicts a block diagram of an example environment that illustrates aspects of the present disclosure, and in which embodiments disclosed herein can be implemented.

FIG. 2 depicts an exemplary process flow for using the various components of FIG. 1 to demonstrate aspects of the present disclosure, in accordance with various embodiments.

Fig. 3 depicts a flowchart illustrating an exemplary method of determining whether to cause natural dialog output to be provided for presentation to a user in response to determining that the user is pausing providing a spoken utterance and/or determining when to fulfill the spoken utterance, in accordance with various embodiments.

Fig. 4 depicts a flowchart illustrating another exemplary method of determining whether to cause natural dialog output to be provided for presentation to a user in response to determining that the user is pausing providing a spoken utterance and/or determining when to fulfill the spoken utterance, in accordance with various embodiments.

5A, 5B, 5C, 5D, and 5E depict various non-limiting examples of determining whether to have a natural dialog output provided for presentation to a user in response to determining that the user pauses providing a spoken utterance and/or determining when to fulfill the spoken utterance, according to various embodiments.

FIG. 6 depicts an example architecture of a computing device according to various embodiments.

Detailed Description

Turning now to fig. 1, a block diagram of an example environment is depicted that illustrates aspects of the present disclosure and in which embodiments disclosed herein can be implemented. An example environment includes a client device 110 and a natural dialog system 180. In some implementations, the natural dialog system 180 can be implemented locally at the client device 110. In additional or alternative embodiments, the natural dialog system 180 can be implemented remotely from the client device 110 as depicted in fig. 1 (e.g., at a remote server (s)). In these embodiments, client device 110 and natural conversation system 180 may be communicatively coupled to each other via one or more networks 199, such as one or more wired or wireless local area networks ("LANs" including Wi-Fi LANs, mesh networks, bluetooth, near field communications, etc.) or wide area networks ("WANs" including the internet).

Client device 110 may be, for example, one or more of the following: a desktop computer, a laptop computer, a tablet computer, a mobile phone, a computing device of a vehicle (e.g., an in-vehicle communication system, an in-vehicle entertainment system, an in-vehicle navigation system), a stand-alone interactive speaker (optionally with a display), a smart appliance such as a smart television, and/or a wearable apparatus including a user of the computing device (e.g., a user watch with the computing device, user glasses with the computing device, a virtual or augmented reality computing device). Additional and/or alternative client devices may be provided.

The client device 110 is capable of enforcing an automated assistant client 114. An instance of the automated assistant client 114 can be an application separate from the operating system of the client device 110 (e.g., installed "on top" of the operating system) -or can alternatively be implemented directly by the operating system of the client device 110. The automated assistant client 114 is capable of interacting with a natural dialog system 180 implemented locally at the client device 110 or remotely from the client device 110 via one or more of the networks 199 as depicted in fig. 1 (e.g., at a remote server (s)). The automated assistant client 114 (and optionally through its interaction with the remote server (s)) may form what appears to the user to be a logical instance of the automated assistant 115 with which the user can engage in human-machine conversations. An example of an automated assistant 115 is depicted in fig. 1 and is surrounded by a dashed line that includes an automated assistant client 114 of the client device 110 and the natural dialog system 180. Thus, it should be appreciated that a user participating in an automated assistant client 114 executing on a client device 110 may actually participate in a logical instance of his or her own automated assistant 115 (or a logical instance of an automated assistant 115 shared among a family or other group of users). For brevity and simplicity, the automated assistant 115 as used herein will refer to an automated assistant client 114 that is implemented locally on the client device 110 and/or remotely from the client device 110 (e.g., at a remote server(s) where instances of the natural dialog system 180 may additionally or alternatively be implemented).

In various implementations, the client device 110 may include a user input engine 111 configured to detect user input provided by a user of the client device 110 using one or more user interface input devices. For example, the client device 110 may be equipped with one or more microphones that generate audio data, such as audio data that captures spoken utterances of the user of the client device 110 or other sounds in the environment of the client device 110. Additionally or alternatively, the client device 110 may be equipped with one or more visual components configured to generate visual data that captures images and/or movements (e.g., gestures) detected in the field of view of one or more of the visual components. Additionally or alternatively, the client device 110 may be equipped with one or more touch-sensitive components (e.g., keyboard and mouse, stylus, touch screen, touch panel, one or more hardware buttons, etc.) configured to generate one or more signals directed to touch inputs of the client device 110.

In various implementations, the client device 110 may include a rendering engine 112 configured to provide content for audible and/or visual presentation to a user of the client device 110 using one or more user interface output devices. For example, the client device 110 may be equipped with one or more speakers such that content can be provided to a user of the client device 110 for audible presentation via the one or more speakers of the client device 110. Additionally or alternatively, the client device 110 may be equipped with a display or projector, such that content can be provided to a user of the client device for visual presentation via the display or projector of the client device 110. In other implementations, the client device 110 may communicate with one or more other computing devices (e.g., via one or more of the networks 199), and user interface input devices and/or user interface output devices of one or more of the other computing devices may be used to detect user input provided by a user of the client device 110 and/or to provide content for audible and/or visual presentation, respectively, to a user of the client device 110. Additionally or alternatively, the client device 110 may be equipped with one or more Light Emitting Diodes (LEDs) capable of emitting light in one or more colors to provide an indication that the automated assistant 115 is processing user input from a user of the client device 110, waiting for the user of the client device 110 to continue providing user input, and/or providing an indication that the automated assistant 115 is performing any other function.

In various implementations, the client device 110 may include one or more presence sensors 113, the presence sensors 113 configured to provide a signal indicative of detected presence, particularly human presence, if the corresponding user approval(s). In some of these implementations, the automated assistant 115 can identify the client device 110 (or other computing device associated with the user of the client device 110) to satisfy the spoken utterance based at least in part on the presence of the user at the client device 110 (or at other computing devices associated with the user of the client device 110). The spoken utterance can be satisfied by rendering the response content at the client device 110 and/or other computing device(s) associated with the user of the client device 110 (e.g., via the rendering engine 112), by having the client device 110 and/or other computing device(s) associated with the user of the client device 110 controlled, and/or by having the client device 110 and/or other computing device(s) associated with the user of the client device 110 perform any other action. As described herein, the automation assistant 115 can utilize the data determined based on the presence sensor 113 to determine the client device 110 (or other computing device (s)) based on where the user is near or recently near and to provide corresponding commands only to the client device 110 (or to the other computing device (s)). In some additional or alternative implementations, the automated assistant 115 can utilize data determined based on the presence sensor 113 to determine whether any user (any user or particular user) is currently in proximity to the client device 110 (or other computing device (s)) and can optionally refrain from supplying data to the client device 110 (or other computing device (s)) and/or from the client device 110 (or other computing device (s)) based on the user(s) in proximity to the client device 110 (or other computing device (s)).

The presence sensor 113 may be present in various forms. For example, the client device 110 can detect the presence of a user (e.g., microphone(s), visual component(s), and/or touch-sensitive component(s) described above) using one or more of the user interface input components described above with respect to the user input engine 111. Additionally or alternatively, the client device 110 may be equipped with other types of light-based presence sensors 113, such as passive infrared ("PIR") sensors that measure infrared ("IR") light radiated from objects within its field of view.

Additionally or alternatively, in some embodiments, the presence sensor 113 may be configured to detect other phenomena associated with human presence or device presence. For example, in some embodiments, client device 110 may be equipped with presence sensor 113 that detects various types of wireless signals (e.g., waves such as radio, ultrasonic, electromagnetic, etc.) emitted by other computing devices (e.g., mobile devices, wearable computing devices, etc.) and/or other computing devices that are carried/operated by, for example, a user. For example, the client device 110 may be configured to transmit human-imperceptible waves, such as ultrasonic or infrared waves, that may be detected by other computing device(s) (e.g., via an ultrasonic/infrared receiver such as a microphone supporting ultrasonic waves).

Additionally or alternatively, the client device 110 may emit other types of human-imperceptible waves, such as radio waves (e.g., wi-Fi, bluetooth, cellular, etc.), which may be detected by other computing device(s) (e.g., mobile devices, wearable computing devices, etc.) carried/operated by the user and used to determine a particular location of the user. In some implementations, GPS and/or Wi-Fi triangulation may be used to detect the location of a person, e.g., based on GPS and/or Wi-Fi signals to/from client device 110. In other implementations, the client device 110 may use other wireless signal characteristics, such as time of flight, signal strength, etc., alone or in combination, to determine the location of a particular person based on signals transmitted by other computing device(s) carried/operated by the user.

Additionally or alternatively, in some implementations, the client device 110 may perform speaker recognition (SID) to recognize users from their voices. In some implementations, the movement of the speaker may then be determined, for example, by the presence sensor 113 of the client device 110 (and optionally the GPS sensor, the Soli chip, and/or the accelerometer of the client device 110). In some implementations, based on such detected movements, a location of the user may be predicted, and may be assumed to be a location of the user when any content is rendered at the client device 110 and/or other computing device(s) based at least in part on the proximity of the client device 110 and/or other computing device(s) to the user location. In some implementations, the user may simply be assumed to be in his or her last position of interaction with the automated assistant 115, especially if too much time has not elapsed since the last interaction.

Further, the client device 110 and/or the natural conversation system 180 can include one or more memories for storing data (e.g., a software application, one or more first party (1P) agents 171, one or more third party (3P) agents 172, etc.), one or more processors for accessing and administering data, and/or other components that facilitate communication through one or more of the networks 199, such as one or more network interfaces. In some implementations, one or more of the software applications, 1P agents 171, and/or 3P agents 172 can be installed locally at the client device 110, while in other implementations, one or more of the software applications, 1P agents 171, and/or 3P agents 172 can be hosted remotely (e.g., through one or more servers) and can be accessed by the client device 110 through one or more of the networks 199. Operations performed by client device 110, other computing device(s), and/or by automated assistant 115 may be distributed among multiple computer systems. The automated assistant 115 may be implemented, for example, as a computer program running on one or more located client devices 110 and/or one or more computers coupled to each other through a network (e.g., one or more of the networks 199 of fig. 1).

In some implementations, the operations performed by the automated assistant 115 may be implemented locally at the client device 110 via the automated assistant client 114. As shown in fig. 1, the automated assistant client 114 may include an Automatic Speech Recognition (ASR) engine 120A1, a Natural Language Understanding (NLU) engine 130A1, a fulfillment engine 140A1, and a text-to-speech (TTS) engine 150A1. In some implementations, the operations performed by the automated assistant 115 may be distributed among multiple computer systems, such as when the natural dialog system 180 is implemented remotely from the client device 110 depicted in fig. 1. In these embodiments, in embodiments in which the natural dialog system 180 is implemented remotely from the client device 110 (e.g., at a remote server), the automated assistant 115 may additionally or alternatively utilize the ASR engine 120A2, NLU engine 130A2, fulfillment engine 140A2, and TTS engine 150A2 of the natural dialog system 180.

As described in more detail with respect to fig. 2, each of these engines may be configured to perform one or more functions. For example, the ASR engines 120A1 and/or 120A2 can process an audio data stream that captures at least a portion of the spoken utterance and is generated by microphone(s) of the client device 110 to generate an ASR output stream using stream-wise ASR model(s) (e.g., recurrent Neural Network (RNN) model, transducer model, and/or any other type of ML model that can perform ASR) stored in the machine-learning (ML) model database(s) 115A. Notably, the streaming ASR model can be used to generate an ASR output stream when generating an audio data stream. Further, NLU engines 130A1 and/or 130A2 can process the ASR output streams using NLU model(s) (e.g., long Short Term Memory (LSTM), gate loop unit (GRU), and/or any other type of RNN or other ML model capable of executing NLUs) and/or grammar-based NLU rules stored in ML model database(s) 115A to generate NLU output streams. Further, fulfillment engines 140A1 and/or 140A2 can generate a set of fulfillment outputs based on the fulfillment data streams generated based on the NLU output streams. The fulfillment data stream can be generated using, for example, one or more of a software application, a 1P agent 171, and/or a 3P agent 172. Finally, TTS engines 150A1 and/or 150A2 can process text data (e.g., text formulated by automated assistant 115) using TTS model(s) stored in ML model database(s) 115A to generate synthesized speech audio data that includes computer-generated synthesized speech corresponding to the text data. Notably, the ML model(s) stored in ML model database(s) 115A can be on-device ML models stored locally at client device 110, or shared ML models that are accessible to both client device 110 and/or other systems (e.g., in embodiments where the natural dialog system is implemented by remote server (s)).

In various implementations, the ASR output stream can include, for example, a stream of speech hypotheses (e.g., term hypotheses and/or transcription hypotheses) predicted to correspond to spoken utterance(s) of the user of the client device 110 captured in the audio data stream, one or more corresponding predicted values (e.g., probabilities, log-likelihoods, and/or other values) for each speech hypothesis, a plurality of phonemes predicted to correspond to spoken utterance(s) of the user of the client device 110 captured in the audio data stream and/or other ASR output. In some versions of these implementations, the ASR engines 120A1 and/or 120A2 can select one or more of the speech hypotheses as the recognized text corresponding to the spoken utterance (e.g., based on the corresponding predicted value).

In various embodiments, the NLU output stream can include, for example: an annotated stream of recognition text comprising one or more annotations of recognition text for one or more (e.g., all) terms included in the ASR output stream; one or more predicted intents determined based on the recognized text of one or more (e.g., all) terms included in the ASR output stream; a predicted and/or inferred slot value for a corresponding parameter associated with each of one or more predicted intents determined based on recognition text of one or more (e.g., all) terms included in the ASR output stream and/or other NLU output. For example, NLU engines 130A1 and/or 130A2 may include a part-of-speech tagger (not depicted) configured to annotate terms with their grammatical roles. Additionally or alternatively, NLU engines 130A1 and/or 130A2 may include an entity annotator (not depicted) configured to annotate entity references in one or more segments of recognized text, such as references to persons (including, for example, literature personas, celebrities, public personas, etc.), organizations, locations (real and imagined), and the like. In some implementations, data about an entity can be stored in one or more databases, such as in a knowledge graph (not depicted). In some implementations, the knowledge graph can include nodes representing known entities (and in some cases, entity attributes), and edges connecting the nodes and representing relationships between the entities. The entity annotators can annotate references to entities at a high level of granularity (e.g., capable of identifying all references to entity categories such as people) and/or at a low level of granularity (e.g., capable of identifying all references to particular entities such as particular people). The entity annotators can rely on content of natural language input to parse particular entities and/or can optionally communicate with knowledge graphs or other entity databases to parse particular entities.

Additionally or alternatively, NLU engines 130A1 and/or 130A2 may include a co-fingering resolver (not depicted) configured to group or "cluster" references to the same entity based on one or more context cues. For example, based on the "theatre tickets" mentioned in the client device notification being rendered immediately prior to receiving the input "buy them", a co-reference parser may be used to parse the term "they" in the natural language input "buy them" into "buy theatre tickets". In some implementations, one or more components of NLU engines 130A1 and/or 130A2 may rely on annotations from one or more other components of NLU engines 130A1 and/or 130 A2. For example, in some implementations, an entity annotator may rely on annotations from a co-fingering parser to annotate all references to a particular entity. Also, for example, in some embodiments, the co-fingering parser can rely on annotations from the entity annotators to cluster references to the same entity.

In various embodiments, the fulfillment data stream can include one or more fulfillment outputs generated by one or more of the software applications, the 1P agent 171, and/or the 3P agent 172, for example. One or more structured requests generated based on the NLU output stream can be transmitted to one or more of the software application, 1P agent 171, and/or 3P agent 172, and in response to receiving one or more of the structured requests, one or more of the software application, 1P agent 171, and/or 3P agent 172 can transmit a fulfillment output predicted to satisfy the spoken utterance. Fulfillment engines 140A1 and/or 140A2 can include the fulfillment output received at client device 110 in a set of fulfillment outputs corresponding to the fulfillment data streams. Notably, the fulfillment data stream can be generated when a user of the client device 110 provides a spoken utterance. Further, the fulfillment output engine 164 can select one or more fulfillment outputs from the fulfillment output stream, and the selected one or more of the fulfillment outputs can be provided for presentation to the user of the client device 110 to satisfy the spoken utterance. The one or more fulfillment outputs can include, for example: audible content predicted to be responsive to the spoken utterance and capable of being audibly rendered for presentation to a user of the client device 110 via the speaker(s); visual content predicted to be responsive to the spoken utterance and capable of being visually rendered for presentation to a user of the client device 110 via a display; and/or an assistant command that, when executed, causes the client device 110 and/or other computing devices in communication with the client device 110 (e.g., through one or more of the networks 199) to be controlled in response to the spoken utterance.

Although fig. 1 is described with respect to a single client device having a single user, it should be understood that this is for purposes of illustration and is not meant to be limiting. For example, one or more additional client devices of the user may also be capable of implementing the techniques described herein. For example, client device 110, one or more additional client devices, and/or any other computing device of a user can form an ecosystem of devices that can employ the techniques described herein. These additional client devices and/or computing devices may communicate with client device 110 (e.g., through one or more of networks 199). As another example, a given client device 110 can be used by multiple users in a shared setting (e.g., a group of users, a family, a shared living space, etc.).

As described herein, the automated assistant 115 can determine whether to have natural dialog output provided to the user for presentation in response to determining that the user pauses providing the spoken utterance and/or determining when to fulfill the spoken utterance. In making this determination, the automated assistant can utilize the natural dialog engine 160. In various implementations, and as depicted in fig. 1, the natural dialog engine 160 can include an acoustic engine 161, a pause engine 162, a time engine 163, a natural dialog output engine 164, and a fulfillment output engine 165.

In some implementations, the acoustic engine 161 can determine the audio-based characteristics based on processing the audio data stream. In some versions of these embodiments, the acoustic engine 161 is capable of processing the audio data stream using the audio-based ML model stored in the ML model database(s) 115A to determine audio-based characteristics. In additional or alternative embodiments, the acoustic engine 161 can process the audio data stream using one or more rules to determine the audio-based characteristics. The audio-based characteristics can include, for example, prosodic attributes associated with the spoken utterance(s) captured in the audio data stream and/or other audio-based characteristics. Prosodic attributes can include, for example, one or more attributes of syllables and larger phonetic units, including language functions such as intonation, tone, accent, rhythm, beat, pitch, elongated syllable, pause, grammar(s) associated with pause, and/or other audio-based characteristics that may be derived from processing an audio data stream. Furthermore, prosodic properties can provide an indication of, for example: an emotional state; form (e.g., statement, question, or command); irony (S); irony; a speech rhythm; and/or emphasis. In other words, prosodic attributes are speech features that are independent of the individual voice characteristics of a given user and can be dynamically determined during a conversation session based on individual spoken utterances and/or a combination of multiple spoken utterances.

In some implementations, the pause engine 162 can determine whether the user of the client device 110 has paused providing the spoken utterance captured in the audio data stream or has completed providing the spoken utterance. In some versions of these implementations, the pause engine 162 can determine that the user of the client device 110 has paused providing the spoken utterance based on processing of the audio-based characteristics determined using the acoustic engine 161. For example, the pause engine 162 can process the audio-based characteristics using the audio-based classification ML model stored in the ML model database(s) 115A to generate an output, and determine whether the user of the client device 110 has paused providing the spoken utterance or has completed providing the spoken utterance based on the output generated using the audio-based classification ML model. The output can include, for example, one or more predictive metrics (e.g., binary values, log-likelihood, probability, etc.) that indicate whether the user of the client device 110 has paused providing the spoken utterance or has completed providing the spoken utterance. For example, assume that the user of client device 110 provides a spoken utterance that "calls arnollld's," where "llll" indicates an elongated syllable included in the spoken utterance. In this example, the audio-based characteristics can include an indication that the spoken utterance includes an elongated syllable, and thus, an output generated using the audio-based classification ML model can indicate that the user has not completed providing the spoken utterance.

In additional or alternative versions of these embodiments, pause engine 162 can determine that the user of client device 110 has paused providing the spoken utterance based on the NLU data stream generated using NLU engines 130A1 and/or 130 A2. For example, the pause engine 162 can process the audio data stream based on the predicted intent(s) and/or predicted slot values for the predicted and/or inferred slot values of the corresponding parameters associated with the predicted intent(s), whether the user of the client device 110 has paused providing the spoken utterance or has completed providing the spoken utterance. For example, assume that the user of the client device 110 provides a spoken utterance of "call arnollld's (call arnollld)", where "llll" indicates an elongated syllable included in the spoken utterance. In this example, the NLU data stream can include the predicted intent of "call" and the slot value of the entity parameter of "Arnold". However, in this example, even though the automated assistant 115 may access a contact entry associated with the entity "Arnold" (such that the spoken utterance can be fulfilled), the automated assistant 115 may not initiate a call to the entity "Arnold" based on the elongated syllables included in the audio-based characteristics determined based on processing the spoken utterance. In contrast, in this example, if the user does not provide "Arnolllld's" with elongated syllables and/or if the user provides an explicit command to cause the automated assistant 115 to initiate fulfillment of the spoken utterance (e.g., "now call Arnold", "immediately call Arnold", etc.), the pause engine 162 may determine that the user of the client device 110 has completed providing the spoken utterance.

In some implementations, the natural dialog output engine 163 can determine a natural dialog output provided to a user of the client device for presentation in response to determining that the user has paused providing the spoken utterance. In some versions of these implementations, natural dialog output engine 163 can determine a set of natural dialog outputs and can select one or more of the natural dialog outputs from the set of natural dialog outputs (e.g., randomly or circularly through the set of natural dialog outputs) to provide to a user for presentation (e.g., audible presentation via one or more speakers of client device 110) based on NLU metrics associated with the NLU data stream and/or audio-based characteristics. In some further versions of these implementations, the superset of natural dialog outputs can be stored in one or more databases (not depicted) accessible by client device 110 (e.g., as text data converted to synthesized speech audio data (e.g., using TTS engines 150A1 and/or 150 A2) and/or as synthesized speech audio data), and the set of natural dialog outputs can be generated from the superset of natural dialog outputs based on NLU metrics associated with NLU data streams and/or audio-based characteristics.

These natural dialog outputs can be implemented to facilitate dialog sessions during which the spoken utterance is not necessarily implemented as a fulfillment of the spoken utterance. For example, the natural dialog output can include an indication requesting the user to confirm that it is desired to continue interacting with the automated assistant 115 (e.g., "do you stay there. In various embodiments, the natural dialog engine 163 can utilize one or more language models stored in the ML model database(s) 115A to generate a set of natural dialog outputs. In other implementations, the natural conversation engine 163 can obtain a set of natural conversation outputs from a remote system (e.g., remote server (s)) and store the set of natural conversation outputs in on-device memory of the client device 110.

In some implementations, the fulfillment output engine 164 can select one or more fulfillment outputs from the fulfillment output stream to provide to the user of the client device for presentation in response to determining that the user has completed providing the spoken utterance, or in response to determining that the user has not completed providing the spoken utterance, but should still fulfill the spoken utterance (e.g., as described with respect to fig. 5C). Although 1P agent 171 and 3P agent 172 are depicted in FIG. 1 as being implemented through one or more of networks 199, it should be understood that this is for purposes of illustration and is not meant to be limiting. For example, one or more of 1P agents 171 and/or 3P agents 172 can be implemented locally at client device 110, and NLU output streams can be transmitted to one or more of 1P agents 171 and/or 3P agents 172 via an Application Programming Interface (API), and fulfillment outputs from one or more of 1P agents 171 and/or 3P agents 172 can be obtained via the API and incorporated into the fulfillment data stream. Additionally or alternatively, one or more of 1P agents 171 and/or 3P agents 172 can be implemented remotely from client device 110 (e.g., at 1P server(s) and/or 3P server(s), respectively), and NLU output streams can be transmitted to one or more of 1P agents 171 and/or 3P agents 172 via one or more of networks 199, and fulfillment outputs from one or more of 1P agents 171 and/or 3P agents 172 can be obtained via one or more of networks 199 and incorporated into the fulfillment data streams.

For example, fulfillment output engine 164 can select one or more fulfillment outputs from the fulfillment data streams based on NLU metrics associated with the NLU data streams and/or fulfillment metrics associated with the fulfillment data streams. The NLU metric can be, for example, a probability, a log likelihood, a binary value, etc., that indicates how confident the predicted intent(s) of NLU engine 130A1 and/or 130A2 correspond to the actual intent of the user providing the spoken utterance(s) captured in the audio data stream, and/or how confident the inferred and/or predicted slot value(s) of the parameter(s) associated with the predicted intent(s) correspond to the actual slot value(s) of the parameter(s) associated with the predicted intent(s). NLU metrics can be generated when NLU engines 130A1 and/or 130A2 generate an NLU output stream and can be included in the NLU output stream. The performance metrics can be, for example, probabilities, log-likelihoods, binary values, etc., that indicate how confident the performance output(s) of the performance engines 140A1 and/or 140A2 correspond to the user's expected performance. The fulfillment metrics can be generated when one or more of the software applications, 1P agents 171, and/or 3P agents 172 generate the fulfillment output and can be incorporated into the fulfillment data stream, and/or can be generated when the fulfillment engines 140A1 and/or 140A2 process the fulfillment data received from one or more of the software applications, 1P agents 171, and/or 3P agents 172 and can be incorporated into the fulfillment data stream.

In some implementations, and in response to determining that the user has paused providing the spoken utterance, the time engine 165 can determine a duration of the pause in providing the spoken utterance and/or a duration of any subsequent pauses. The automated assistant 115 can cause the natural conversation output engine 163 to utilize one or more of these pause durations in selecting the natural conversation output provided to the user of the client device 110 for presentation. For example, assume that the user of client device 110 provides a spoken utterance that "calls arnollld's," where "llll" indicates that an elongated syllable included in the spoken utterance is included. Further assume that it is determined that the user has paused providing the spoken utterance. In some implementations, in response to determining that the user of the client device 110 has paused providing the spoken utterance (e.g., by audibly rendering "mmhmm," etc.), a natural dialog output may be provided for presentation to the user. However, in other embodiments, a natural dialog output may be provided for presentation to the user in response to the time engine 165 determining that a threshold duration of time has elapsed since the user first paused. Further, assume further that the user of client device 110 does not continue to provide spoken utterances in response to natural dialog output being provided for presentation. In this example, in response to the time engine 165 determining that an additional threshold duration has elapsed since the user first paused (or that an additional threshold duration has elapsed since the natural dialog output was provided to the user for presentation), additional natural dialog output may be provided for presentation to the user. Thus, in providing additional natural dialog outputs for presentation to the user, the natural dialog output engine 163 can select a different natural dialog output that requests the user of the client device 110 to complete a spoken utterance (e.g., "what is you speaking," "what is i missed," etc.) or requests the user of the client device 110 to provide a particular slot value(s) of predicted intent (e.g., "who you want to call," "how many people are subscribed," etc.).

In various implementations, and while the automated assistant 115 is waiting for the user of the client device 110 to complete the spoken utterance, the automated assistant 115 can optionally cause the fulfillment output in the set of fulfillment outputs to be partially fulfilled. For example, the automated assistant 115 can establish a connection with one or more of the software application, the 1P agent 171, the 3P agent 172, and/or an additional computing device in communication with the client device 110 (e.g., via one or more of the networks 199) based on one or more of the fulfillment outputs included in the fulfillment output set, such as other client devices associated with a user of the client device 110, intelligent networking devices, etc., can cause synthesized voice audio data including synthesized voice to be generated (but not audibly rendered), can cause graphical content to be generated (but not visually rendered), and/or perform any other partial fulfillment of one or more of the fulfillment outputs. Accordingly, the latency of having the fulfillment output provided for presentation to the user of client device 110 can be reduced.

Turning now to fig. 2, an example process flow is depicted that uses the various components of fig. 1 to demonstrate aspects of the present disclosure. The ASR engines 120A1 and/or 120A2 can process the audio data stream 201A using the streaming ASR model stored in the ML model database(s) 115A to generate an ASR output stream 220.NLU engines 130A1 and/or 130A2 can process ASR output stream 220 using the NLU model stored in ML model database(s) 115A to generate NLU output 230. In some implementations, NLU engine 130A1 and/or 130A2 can additionally or alternatively process non-audio data stream 201B when generating NLU output stream 230. The non-audio data stream 201B can include visual data streams generated by visual component(s) of the client device 110, touch input streams provided by a user via touch-sensitive component(s) of the client device 110, typing input streams provided by a user via touch-sensitive component(s) or peripheral devices (e.g., mouse and keyboard) of the client device 110, and/or any other non-audio data generated by any other user interface input device of the client device 110. In some implementations, 1P agent(s) 171 can process NLU output streams to generate 1P fulfillment data 240A. In additional or alternative embodiments, 3P agent(s) 172 can process NLU output stream 230 to generate 3P fulfillment data 240B. Fulfillment engine 140A1 and/or 140A2 can generate fulfillment data stream 240 based on 1P fulfillment data 240A and/or 3P fulfillment data 240B (and optionally other fulfillment data generated based on one or more software applications accessible at client device 110 processing NLU output stream 230). Further, the acoustic engine 161 can process the audio data stream 201A to generate audio-based characteristics 261 associated with the audio data stream 201A, such as the audio-based characteristics 261 of one or more spoken utterances (or portions thereof) included in the audio data stream 201A.

Pause engine 162 can process NLU output stream 230 and/or audio-based characteristics 261 to determine whether the user of the client device has paused providing the spoken utterance captured in audio data stream 201A or has completed providing the spoken utterance captured in audio data stream 201A, as shown at block 262. The automated assistant 115 can determine whether to provide a natural dialog output or a fulfillment output based on the block 262 indicating whether the user has paused providing the spoken utterance or has completed providing the spoken utterance. For example, assume that the automated assistant 115 determines, based on the indication at block 262, that the user has paused providing the spoken utterance. In this example, the automated assistant 115 can cause the natural conversation output engine 163 to select the natural conversation output 263, and the automated assistant 115 can cause the natural conversation output 263 to be provided for presentation to the user of the client device 110. In contrast, assume that the automated assistant 115 determines, based on the indication at block 262, that the user has completed providing the spoken utterance. In this example, the automated assistant 115 can cause the fulfillment output engine 164 to select one or more fulfillment outputs 264, and the automated assistant 115 can cause the one or more fulfillment outputs 264 to be provided for presentation to the user of the client device 110. In some implementations, the automated assistant 115 can consider the duration of the one or more pauses 265 determined by the time engine 165 to determine whether to cause the natural conversation output 263 to be provided for presentation to the user of the client device 110 or to cause the one or more fulfillment outputs 264 to be provided for presentation to the user of the client device 110. In these embodiments, the natural conversation output 263 and/or the one or more fulfillment outputs 264 can be adapted based on the duration of the one or more pauses. Although specific functionalities and embodiments are described with respect to fig. 1 and 2, it should be understood that this is for purposes of illustration and is not meant to be limiting. For example, additional functionality and embodiments are described below with respect to fig. 3, 4, 5A-5E, and 6.

Turning now to fig. 3, a flow chart illustrating an example method 300 of determining whether to have a natural dialog output provided for presentation to a user in response to determining that the user pauses providing a spoken utterance and/or determining when to fulfill the spoken utterance is depicted. For convenience, the operations of method 300 are described with reference to a system performing the operations. The system of method 300 includes one or more processors, memory, and/or other component(s) of a computing device(s) (e.g., client device 110 of fig. 1 and 5A-5E, computing device 610 of fig. 6, one or more servers, and/or other computing devices). Moreover, although the operations of method 300 are shown in a particular order, this is not intended to be limiting. One or more operations may be reordered, omitted, and/or added.

At block 352, the system processes the audio data stream, which includes a portion of the user spoken utterance and is directed to an automation assistant, using the streaming ASR model to generate an ASR output stream. The audio data stream can be generated by microphone(s) of the user's client device and during a conversation session with an automated assistant implemented at least in part at the client device. In some implementations, the system can process the audio data stream in response to determining that the user has invoked the automated assistant via actuation of one or more particular words and/or phrases (e.g., hotwords, such as "hey assistant", "assistant", etc.), one or more buttons (e.g., software and/or hardware buttons), one or more gestures captured by visual component(s) of the client device (when detected, the visual component(s) invoke the automated assistant), and/or by any other means. At block 354, the system processes the ASR output stream using the NLU model to generate an NLU output stream. At block 356, the system causes a fulfillment data stream to be generated based on the NLU output stream. At block 358, the system determines audio-based characteristics associated with portions of the captured spoken utterances in the audio data based on processing the audio data stream. The audio-based characteristics can include, for example, one or more prosodic attributes associated with portions of the spoken utterance (such as intonation, tone, accent, rhythm, beat, pitch, pause, and/or other prosodic attributes) and/or other audio-based characteristics that can be determined based on processing the audio data stream. The operation of blocks 352-358 is described in more detail herein (e.g., with respect to fig. 1 and 2).

At block 360, the system determines whether the user has paused providing the spoken utterance or has completed providing the spoken utterance based on the NLU output stream and/or audio-based characteristics associated with portions of the spoken utterance captured in the audio data. In some implementations, the system can process the audio-based characteristics using the audio-based classification ML model to generate an output, and the system can determine whether the user has paused providing the spoken utterance or completed providing the spoken utterance based on the output generated using the audio-based classification ML model. The output generated using the audio-based classification ML model can include one or more predictive metrics (e.g., binary values, probabilities, log-likelihoods, and/or other metrics) that indicate whether the user has paused providing the spoken utterance or has completed providing the spoken utterance. For example, assume that the output includes a first probability of 0.8 associated with a user having paused providing a prediction of a spoken utterance and a second probability of 0.6 associated with a user having completed providing a prediction of a spoken utterance. In this example, the system can determine that the user has paused providing the spoken utterance based on the predictive metric. In additional or alternative implementations, the system can process or analyze the NLU output stream to determine whether the user has paused providing the spoken utterance or has completed providing the spoken utterance. For example, if the system determines that the NLU metric(s) associated with the predicted intent(s) and/or the inferred and/or predicted slot value(s) of the corresponding parameter(s) associated with the predicted intent(s) fail to meet the NLU metric threshold, or if the system determines that the slot value(s) of the corresponding parameter(s) associated with the predicted intent(s) is unknown, the automated assistant may determine that the user has paused providing the spoken utterance. Notably, in various embodiments, the system can determine whether the user has paused providing the spoken utterance or has completed providing the spoken utterance based on both the audio-based characteristics and the NLU data stream. For example, if the system determines that the spoken utterance can be fulfilled based on the NLU data stream, but the audio-based characteristics indicate that the user has paused providing the spoken utterance, the system may determine that the user has paused providing the spoken utterance, as any additional portion of the spoken utterance that may be provided by the user may change the extent to which the user expects the spoken utterance to be fulfilled.

At the iteration of block 360, if the system determines that the user has completed providing the spoken utterance, the system can proceed to block 362. At block 362, the system causes the automated assistant to initiate fulfillment of the spoken utterance. For example, the system can select one or more fulfillment outputs from the fulfillment data stream that are predicted to satisfy the spoken utterance, and cause the one or more fulfillment outputs to be provided for presentation to the user via the client device or an additional computing device in communication with the client device. As described above with respect to fig. 1, the one or more fulfillment outputs can include, for example: audible content predicted to be responsive to the spoken utterance and capable of being audibly rendered for presentation to a user of the client device via the speaker(s); visual content predicted to be responsive to the spoken utterance and capable of being visually rendered for presentation to a user of the client device via the display; and/or an assistant command that, when executed, causes the client device and/or other computing devices in communication with the client device to be controlled in response to the spoken utterance. The system can return to block 352 and perform additional iterations of the method 300 of fig. 3.

At the iteration of block 360, if the system determines that the user has paused providing spoken utterances, the system can proceed to block 364. At block 364, the system determines natural dialog output provided to the user for audible presentation. Further, at block 366, the system can cause natural dialog output to be provided to the user for audible presentation. The natural dialog output can be selected from a set of natural dialog outputs stored in an on-device memory of the client device based on NLU metrics associated with the NLU data stream and/or the audio-based characteristics. In some implementations, one or more of the natural conversation outputs included in the set of natural conversation outputs can correspond to text data. In these implementations, text data associated with the selected natural dialog output can be processed using the TTS model to generate synthesized speech audio data including synthesized speech corresponding to the selected natural dialog output, and the synthesized speech audio data can be audibly rendered for presentation to the user via speaker(s) of the client device or the additional computing device.

In additional or alternative implementations, one or more of the natural dialog outputs included in the set of natural dialog outputs can correspond to synthesized speech audio data including synthesized speech corresponding to the selected natural dialog output, and the synthesized speech audio data can be audibly rendered for presentation to the user via the speaker(s) of the client device or the additional computing device. Notably, in various embodiments, when natural conversation output is provided to a user for audible presentation, the volume at which the natural conversation output is played to the user can be lower than the volume of other outputs that are audibly rendered for presentation to the user. Further, in various implementations, one or more automation assistant components can remain active (e.g., ASR engines 120A1 and/or 120A2, NLU engines 130A1 and/or 130A2, and/or fulfillment engines 140A2 and/or 140 A2) while natural dialog outputs are provided to the user for audible presentation, such that the automation assistant can continue to process the audio data stream.

At block 368, the system determines whether to fulfill the spoken utterance after having the natural dialog output provided to the user for audible presentation. In some implementations, the system can determine to fulfill the spoken utterance in response to determining that the user has completed providing the spoken utterance after having the natural dialog output provided to the user for audible presentation. In these implementations, the ASR output stream, NLU output stream, and fulfillment data stream can be updated based on the user completing the provision of the spoken utterance. In additional or alternative implementations, even if the user does not complete providing the spoken utterance based on one or more costs associated with having the automated assistant initiate performance of the spoken utterance, the system can determine to perform the spoken utterance in response to determining that the spoken utterance can be performed based on the portion of the spoken utterance (e.g., as described in more detail with respect to fig. 5C).

At the iteration of block 368, if the system determines to fulfill the spoken utterance after having the natural dialog output provided to the user for audible presentation, the system proceeds to block 362 to cause the automated assistant to initiate the fulfillment of the spoken utterance, as described above. At the iteration of block 368, if the system determines that the spoken utterance is not to be fulfilled after having the natural dialog output provided to the user for audible presentation, the system returns to block 364. At this subsequent iteration of block 364, the system can determine additional natural dialog outputs provided to the user for audible presentation. Notably, the additional dialog output provided to the user for audible presentation selected at this subsequent iteration of block 364 may be different from the natural dialog output provided to the user for audible presentation selected at the previous iteration of block 364. For example, the user may be provided with the natural dialog output selected at the previous iteration of block 364 that was provided to the user for audible presentation as an indication that the automated assistant is still listening and waiting for the user to complete the spoken utterance (e.g., "Mmhmm", "good", "uhhuhh", etc.). However, the user may also be provided with the natural dialog output selected at this subsequent iteration of block 364 that is provided to the user for audible presentation as an indication that the automated assistant is still listening and waiting for the user to complete the spoken utterance, but also more explicitly prompts the user to complete the spoken utterance or provide specific input (e.g., "do you go yet. The system can continue performing iterations of blocks 364-368 until the system determines to fulfill the spoken utterance at the iteration of block 368, and the system proceeds to block 362 to cause the automated assistant to initiate the fulfillment of the spoken utterance, as described above.

In various implementations, one or more predictive metrics indicating whether the user has paused providing the spoken utterance or completed providing the spoken utterance can be used to determine whether and/or when to provide the user with a natural dialog output for audible presentation. For example, assume that the output generated using the audio-based classification ML model includes a first probability of 0.8 associated with a user having paused providing a prediction of a spoken utterance and a second probability of 0.6 associated with a user having completed providing a prediction of a spoken utterance. Further assume that the first probability 0.8 meets a pause threshold that indicates that the system is highly confident that the user has paused providing spoken utterances. Thus, at the first iteration of block 364, the system can cause the voice return channel to be used as a natural dialog output (e.g., "uh huh"). Further, at the second iteration of block 364, the system can cause another voice return channel to be used as a natural dialog output because the system is highly confident that the user has paused providing the spoken utterance (e.g., "mmhmm" or "i am here"). In contrast, assume that the output generated using the audio-based classification ML model includes a first probability of 0.5 associated with a user having paused providing a prediction of a spoken utterance and a second probability of 0.4 associated with a user having completed providing a prediction of a spoken utterance. Further assume that the first probability 0.5 fails to meet a pause threshold that indicates that the system is highly confident that the user has paused providing spoken utterances. Thus, at the first iteration of block 364, the system can cause the voice return channel to be used as a natural dialog output (e.g., "uh huh"). However, at the second iteration of block 364, rather than making another voice return channel unsmooth to be used as a natural conversation output, the system may request that the user confirm a predicted intent predicted based on the processing of the spoken utterance (e.g., "do you want to call someone. Notably, in determining the natural dialog output provided to the user for audible presentation, the system can randomly select a given natural dialog output provided to the user for audible presentation from a set of natural dialog outputs, loop through the set of natural dialog outputs when selecting the given natural dialog output provided to the user for audible presentation, or in any other manner determine the natural dialog output provided to the user for audible presentation.

Although fig. 3 is described herein without regard to any time in having the natural dialog output provided to the user for audible presentation, it should be understood that this is for illustrative purposes. In various embodiments, and as described below with reference to fig. 4, the system may cause only instances of natural conversation output to be provided to the user for audible presentation based on various time thresholds. For example, in the method 300 of fig. 3, the system may cause an initial instance of the natural dialog output to be provided to the user for audible presentation in response to determining that a first threshold duration of time has elapsed since the user paused providing the spoken utterance. Further, in the method 300 of fig. 3, the system may cause subsequent instances of the natural conversation output to be provided to the user for audible presentation in response to determining that the second threshold duration has elapsed since the initial instance of the natural conversation output was provided to the user for audible presentation. In this example, the first threshold duration and the second threshold duration may be the same or different, and may correspond to any positive integer and/or fraction thereof (e.g., 2 seconds, 2.5 seconds, 3 seconds, etc.).

Turning now to fig. 4, a flow diagram is depicted that illustrates another example method 400 of determining whether to cause natural dialog output to be provided for presentation to a user in response to determining that the user pauses providing a spoken utterance and/or determining when to fulfill the spoken utterance. For convenience, the operations of method 400 are described with reference to a system performing the operations. The system of method 400 includes one or more processors, memory, and/or other component(s) of computing device(s) (e.g., client device 110 of fig. 1 and 5A-5E, computing device 610 of fig. 6, one or more servers, and/or other computing devices). Moreover, although the operations of method 400 are shown in a particular order, this is not intended to be limiting. One or more operations may be reordered, omitted, and/or added.

At block 452, the system receives an audio data stream that includes a portion of the user spoken utterance and is directed to an automated assistant. The audio data stream can be generated by microphone(s) of the user's client device and during a conversation session with an automated assistant implemented at least in part at the client device. At block 454, the system processes the audio data stream. The system can process the audio data stream in the same or similar manner as described above with respect to operation blocks 352-358 of method 300 of fig. 3.

At block 456, the system determines whether the user has paused providing the spoken utterance or has completed providing the spoken utterance based on the NLU output stream and/or audio-based characteristics associated with portions of the spoken utterance captured in the audio data determined based on processing the spoken utterance at block 454. The system can make this determination in the same or similar manner as described above with respect to the operation of block 360 of method 300 of fig. 3. At the iteration of block 456, if the system determines that the user has completed providing the spoken utterance, the system can proceed to block 458. At block 458, the system causes the automated assistant to initiate fulfillment of the spoken utterance in the same or similar manner as described above with respect to the operation of block 360 of the method 300 of fig. 3. The system returns to block 452 and additional iterations of method 400 of fig. 4 are performed. At the iteration of block 456, if the system determines that the user has paused providing spoken utterances, the system can proceed to block 460.

At block 460, the system determines whether a pause in the user providing the spoken utterance meets an N threshold, where N is any positive integer and/or fraction thereof (e.g., 2 seconds, 2.5 seconds, 3 seconds, etc.). At the iteration of block 460, if the system determines that the pause in the user's provision of the spoken utterance fails to meet the N threshold, the system returns to block 454 and continues processing the audio data stream. At the iteration of block 460, if the system determines that the pause in the user's provision of the spoken utterance satisfies the N threshold, the system proceeds to block 460. At block 462, the system determines a natural language dialog provided to the user for audible presentation. At block 464, the system causes natural dialog output to be provided to the user for audible presentation. The system can perform the operations of blocks 462 and 464, respectively, in the same or similar manner as described above with respect to the operations of blocks 364 and 366 of the method 300 of fig. 3. In other words, in an implementation that utilizes one or more aspects of the method 400 of fig. 4, and in contrast to the method 300 of fig. 3, the system may wait N seconds after the user first pauses providing the spoken utterance before having the natural dialog output provided to the user for audible presentation.

At block 466, the system determines whether an M threshold is met after having the natural dialog output provided to the user for audible presentation, where M is any positive integer and/or fraction thereof (e.g., 2 seconds, 2.5 seconds, 3 seconds, etc.). At the iteration of block 466, if the system determines that the pause in the user's provision of the spoken utterance satisfies the M threshold, the system returns to block 462. Similar to the description above with respect to fig. 3, at this subsequent iteration of block 462, the system can determine additional natural dialog output provided to the user for audible presentation, and the additional dialog output provided to the user for audible presentation selected at this subsequent iteration of block 462 can be different from the natural dialog output provided to the user for audible presentation selected at the previous iteration of block 462. In other words, the system can determine the natural dialog output provided to the user for audible presentation selected at the previous iteration of block 462 to advance the user to complete providing the spoken utterance, while the system can determine the additional natural dialog output provided to the user for audible presentation selected at the subsequent iteration of block 462 to explicitly request the user to complete providing the spoken utterance. At the iteration of block 466, if the system determines that the pause in the user's provision of the spoken utterance fails to meet the M threshold, the system proceeds to block 468.

At block 468, the system determines whether to fulfill the spoken utterance after having the natural dialog output provided to the user for audible presentation. In some implementations, the system can determine to fulfill the spoken utterance in response to determining that the user has completed providing the spoken utterance after having the natural dialog output (and/or any additional natural dialog output) provided to the user for audible presentation. In these implementations, the ASR output stream, NLU output stream, and fulfillment data stream can be updated based on user completion of providing the spoken utterance. In additional or alternative implementations, even if the user does not complete providing the spoken utterance based on one or more costs associated with having the automated assistant initiate performance of the spoken utterance, the system can determine to perform the spoken utterance in response to determining that the spoken utterance can be performed based on a portion of the spoken utterance (e.g., as described in more detail with respect to fig. 5C).

At the iteration of block 468, if the system determines to fulfill the spoken utterance after having the natural dialog output provided to the user for audible presentation, the system proceeds to block 458 to cause the automated assistant to initiate the fulfillment of the spoken utterance, as described above. At the iteration of block 468, if the system determines that the spoken utterance is not to be fulfilled after having the natural dialog output (and/or any additional natural dialog output) provided to the user for audible presentation, the system returns to block 462. Subsequent iterations of block 462 are described above. The system can continue performing iterations of blocks 462-468 until the system determines to fulfill the spoken utterance at the iteration of block 468, and the system proceeds to block 458 to cause the automated assistant to initiate the fulfillment of the spoken utterance, as described above.

Turning now to fig. 5A-5E, various non-limiting examples of determining whether to have a natural dialog output provided for presentation to a user in response to determining that the user pauses providing a spoken utterance and/or determining when to fulfill the spoken utterance are depicted. The automated assistant can be at least partially implemented at the client device 110 (e.g., the automated assistant 115 described with respect to fig. 1). The automated assistant can utilize a natural dialog system (e.g., the natural dialog system 180 described with respect to fig. 1) to determine natural dialog outputs and/or fulfillment outputs to be implemented in the promotion of a dialog session between the automated assistant and the user 101 of the client device 110. The client device 110 depicted in fig. 5A-5E may include various user interface components including, for example, microphone(s) for generating audio data based on spoken utterances and/or other audible input, speaker(s) for audibly rendering synthesized speech and/or other audible output, and display 190 for receiving touch input and/or visually rendering transcription and/or other visual output. Although the client device 110 depicted in fig. 5A-5E is a stand-alone interactive speaker with a display 190, it should be understood that this is for purposes of illustration and is not meant to be limiting.

For example, and with particular reference to fig. 5A, assume that the user 101 of the client device 110 provides an "assistant, calls the spoken utterance 552A1 of arnollld's," and then pauses for N seconds as shown by 552A2, where N is any positive integer and/or fraction thereof (e.g., 2 seconds, 2.5 seconds, 3 seconds, etc.). In this example, the automation assistant can cause the audio data stream capturing the spoken utterance 552A1 and the pause indicated by 552A2 to be processed using a streaming ASR model to generate an ASR output stream. In addition, the automation assistant can cause the ASR output stream to be processed using the NLU model to generate an NLU output stream. Further, the automation assistant can cause the fulfillment data stream to be generated based on the NLU output stream using the software application(s) accessible at the client device 110, the 1P agent(s) accessible at the client device 110, and/or the 3P agent(s) accessible at the client device 110. In this example, and based on processing the spoken utterance 552A1, assume that the ASR output includes recognized text corresponding to the spoken utterance 552A1 captured in an audio data stream (e.g., recognized text corresponding to "call Arnold's"), the NLU data stream includes a predicted "call" or "phone call" intent having a slot value of a called party entity parameter associated with the predicted "call" or "phone call" intent, and the fulfillment data stream includes an assistant command that, when executed, causes the client device 110 to initiate a phone call using contact entries associated with friends of the user 101 named "Arnold". Thus, based on processing the spoken utterance 552A1 and not processing any additional spoken utterances, the automated assistant may determine that the spoken utterance 552A1 can be satisfied by having the assistant command administered. However, and even though the automated assistant may determine that the spoken utterance 552A1 is able to be fulfilled, the automated assistant may avoid initiating the fulfillment of the spoken utterance.

In some implementations, the automated assistant can be able to process the audio data stream using the audio-based ML model to determine audio-based characteristics associated with the spoken utterance 552 A1. In addition, the automated assistant can cause the audio-based characteristics to be processed using the audio-based classification ML model to generate an output indicating whether the user paused providing the spoken utterance 552A1 or completed providing the spoken utterance. In the example of fig. 5A, it is assumed that the output generated using the audio-based classification ML model instructs the user 101 to pause providing the spoken utterance 552A1 (e.g., as indicated by the user providing elongated syllables in "arnollld's"). Thus, in this example, the automated assistant can avoid initiating fulfillment of the spoken utterance 552A1 based at least on the audio-based characteristics of the spoken utterance 552 A1.

In additional or alternative implementations, the automated assistant can determine one or more computational costs associated with the fulfillment of the spoken utterance 552 A1. The one or more computational costs can include, for example, computational costs associated with performing the fulfillment of the spoken utterance 552A1, computational costs associated with revoking the performance of the spoken utterance 552A1, and/or other computational costs. In the example of fig. 5A, the computational costs associated with performing the fulfillment of the spoken utterance 552A1 can include at least initiating a telephone call and/or other costs using the contact entry associated with "Arnold. Further, the computational costs associated with revoking performance of the fulfillment of the spoken utterance 552A1 can include at least terminating the contact entry phone call associated with "Arnold", re-initiating a conversation session with the user 101, processing additional spoken utterances, and/or other costs. Thus, in this example, the automated assistant can avoid initiating the fulfillment of the spoken utterance 552A1 based at least on the relatively high computational cost associated with prematurely fulfilling the spoken utterance 552 A1.

Accordingly, the automated assistant may determine to provide a natural dialog output 554A, such as "Mmhmm" shown in fig. 5A, for audible presentation to the user 101 via the speaker(s) of the client device 110 (and optionally in response to determining that the user 101 has paused for N seconds after the spoken utterance 552A1 was provided, as indicated by 552 A2). The natural dialog output 554A can be provided to the user 101 for audible presentation to provide an indication that the automated assistant is still listening and waiting for the user 101 to complete providing the spoken utterance 552 A1. Notably, in various implementations, while the automated assistant provides the natural dialog output 554A to the user 101 for presentation, automated assistant components utilized in processing the audio data stream (e.g., ASR engines 120A1 and/or 120A2, NLU engines 130A1 and/or 130A2, fulfillment engines 140A1 and/or 140A2, and/or other automated assistant components of fig. 1, such as acoustic engine 161 of fig. 1) can remain active at the client device 110. Further, in various embodiments, the natural dialog output 554A can be provided to the user 101 for audible presentation at a lower volume than other audible outputs to avoid distracting the user 101 from completing the spoken utterance 552A1 and reflecting a more natural dialog between real humans.

In the example of fig. 5A, it is further assumed that the user 101 has completed the spoken utterance 552A1 by providing a spoken utterance 556A of "call Arnold's restaurant," which is a fictive italian restaurant. Based on the user 101 completing the spoken utterance 552A1 by providing the spoken utterance 556A, the automation assistant can cause the ASR output stream, the NLU output stream, and the fulfillment data stream to be updated. In particular, the automated assistant can determine that the updated NLU data stream still includes a predicted "call" or "phone call" intent, but has a slot value of "Arnold's restaurant for the called party entity parameter associated with the predicted" call "or" phone call "intent, rather than the previously predicted" Arnold ". Thus, in response to user 101 completing spoken utterance 552A1 by providing spoken utterance 556A, the automated assistant can cause client device 110 (or an additional client device in communication with client device 110 (e.g., a mobile device associated with user 101)) to initiate a telephone call using "Arnold's restaurant" and optionally cause synthetic speech 558A of "good, calling Arnold's restaurant" to be provided to user 101 for audible presentation. In these and other ways, the automated assistant can avoid incorrectly and prematurely fulfilling the predicted intent of the user 101 determined based on the spoken utterance 552A1 (e.g., by calling the contact entry "Arnold") and wait for the user 101 to complete his/her mind to correctly fulfill the predicted intent of the user 101 determined based on the user 101 completing the spoken utterance 552A1 via the spoken utterance 556A (e.g., by calling an imaginary restaurant "Arnold's restaurant").

As another example, and with particular reference to fig. 5B, again assume that the user 101 of the client device 110 provides an "assistant, calls the spoken utterance 552B1 of arnollld's," and then pauses for N seconds as shown by 552B2, where N is any positive integer and/or fraction thereof (e.g., 2 seconds, 2.5 seconds, 3 seconds, etc.). Similar to fig. 5A, even though the automated assistant may determine that the spoken utterance 552B1 is capable of being fulfilled, the automated assistant may refrain from initiating the fulfillment of the spoken utterance 552B1 based on audio-based characteristics associated with the spoken utterance 552B1 and/or based on one or more computational costs associated with performing the fulfillment of the spoken utterance 552B1 and/or revoking the fulfillment of the spoken utterance 552B 1. Further assume that the automated assistant determines to provide a natural dialog output 554B1, such as "Mmhmm" shown in fig. 5B, and causes the natural dialog output 554B1 to be provided to the user 101 of the client device 110 for audible presentation. However, in the example of fig. 5B and compared to the example of fig. 5A, assume that the user 101 of the client device 110 fails to complete the spoken utterance 552B1 within M seconds of providing the natural dialog output 554B1 to the user 101 for audible presentation, as shown by 554B2, where M is any positive integer and/or fraction thereof (e.g., 2 seconds, 2.5 seconds, 3 seconds, etc.) that may be the same or different than the N seconds shown by 552B 2.

Thus, in the example of fig. 5B, the automated assistant can determine additional natural dialog output 556B provided to the user 101 of the client device 110 for audible presentation. Notably, the additional natural dialog output 556B can more explicitly indicate that the automated assistant is waiting for the user 101 to complete the spoken utterance 552B1 and/or request the user 101 to provide specific input to facilitate a dialog session (e.g., as described below with respect to fig. 5C), rather than having a voice return channel provided to the user 101 of the client device 110 for audible presentation as the automated assistant is waiting for the user 101 to complete the natural dialog output 554B1 of the spoken utterance 552B1. Further assume that in the example of FIG. 5B, in response to additional natural dialog 556B being provided to user 101 for audible presentation, user 101 of client device 110 provides a "spent" spoken utterance 558B calling Arnold's restaurant "to complete providing spoken utterance 552B1. Thus, in response to the user 101 completing the provision of the spoken utterance 552B1 by providing the spoken utterance 558B, the automated assistant can cause the client device 110 (or an additional client device in communication with the client device 110 (e.g., the mobile device of the user 101)) to initiate a telephone call using "Arnold's restaurant" and optionally cause a synthesized voice 560B of "good, calling Arnold's restaurant" to be provided to the user 101 for audible presentation. Similar to fig. 5B, the automated assistant can avoid incorrectly prematurely fulfilling the predicted intent of the user 101 determined based on the spoken utterance 552B1 (e.g., by calling the contact entry "Arnold") and wait for the user 101 to complete his/her mind to properly fulfill the predicted intent of the user 101, thereby completing the provision of the spoken utterance 55B1 via the spoken utterance 558B (e.g., by calling the fictional restaurant "Arnold's restaurant"), even when the user 101 may pause for a longer duration as in the example of fig. 5B.

As yet another example, and with particular reference to fig. 5C, assume that the user 101 of the client device 110 provides a spoken utterance 552C1 of "assistant, here reserved for six people in arnollld's restaurant, tonight," and then pauses for N seconds as shown by 552C2, where N is any positive integer and/or fraction thereof (e.g., 2 seconds, 2.5 seconds, 3 seconds, etc.). In this example, and based on processing the spoken utterance 552C1, assume that the ASR output includes recognition text corresponding to the spoken utterance 552C1 captured in the audio data stream (e.g., recognition text corresponding to "six people reserved in arnollld's restaurant for tonight"), the NLU data stream includes a predicted "reserve" or "restaurant reserve" intent having a slot value of "Arnold's restaurant" for a restaurant entity parameter associated with the predicted "reserve" or "restaurant reserve" intent, a slot value of "today's date" for a reserve date parameter associated with the predicted "reserve" or "restaurant reserve" intent, and a slot value of "six" for a people number parameter associated with the predicted "reserve" or "restaurant reserve" intent. Notably, upon providing the spoken utterance 552C1, the user 101 of the client device 110 fails to provide a slot value for a time parameter associated with a "reservation" or "restaurant reservation" intent. Thus, based on the NLU data stream, the automated assistant can determine that the user 101 has paused providing the spoken utterance 552C1.

Further assume that the fulfillment data stream includes an assistant command that, when executed, causes client device 110 to make a restaurant reservation using a restaurant reservation software application accessible at client device 110 and/or a restaurant reservation agent (e.g., one of 1P agent(s) 171 and/or 3P agent(s) of fig. 1) accessible at client device 110. In the example of fig. 5C, and in contrast to the examples of fig. 5A and 5B, based on processing the spoken utterance 552C1 and not processing any additional spoken utterances, the automated assistant may determine that the spoken utterance 552C1 can be satisfied by having the assistant command administered. In this example, the automated assistant can initiate fulfillment of the spoken utterance 552C1 based on an NLU metric associated with an NLU data stream that indicates that the user 101 wants to make a restaurant reservation, but simply fails to provide a slot value for a time parameter associated with a "reservation" or "restaurant reservation" intent. Thus, the automated assistant can establish a connection with a restaurant reservation software application accessible at the client device 110 and/or a restaurant reservation agent (e.g., one of the 1P agents 171 and/or 3P agents of fig. 1) accessible at the client device 110, and begin providing slot values to begin making reservations even if the fulfillment of the spoken utterance 552C1 cannot be fully performed.

Notably, when the automated assistant initiates the performance of the spoken utterance 552C1, the automated assistant is still able to determine to provide a natural dialog output 554C1, such as "Uh huhh" shown in fig. 5C, and have the natural dialog output 554C1 provided to the user 101 of the client device 110 for audible presentation, because the automated assistant determines that the user 101 pauses providing the spoken utterance 552C1 based at least on the NLU data stream. However, in the example of fig. 5C and similar to fig. 5B, assume that user 101 of client device 110 fails to complete spoken utterance 552C1 within M seconds of providing natural dialog output 554C1 to user 101 for audible presentation, as shown by 554C2, where M is any positive integer and/or fraction thereof (e.g., 2 seconds, 2.5 seconds, 3 seconds, etc.) that may be the same or different than N seconds shown by 552C 2.

Thus, in the example of fig. 5C, the automated assistant can determine additional natural dialog output 556C provided to the user 101 of the client device 110 for audible presentation. Notably, the additional natural conversation output 556C can request that the user 101 provide specific inputs to facilitate a conversation session, such as "what time? By way of example, rather than having a voice return channel provided to the user 101 of the client device 110 for audible presentation as if the automated assistant were waiting for the user 101 to complete the natural dialog output 554C1 of the spoken utterance 552C1. Further assume that in the example of fig. 5C, in response to additional natural dialog 556C being provided to the user for audible presentation, user 101 of client device 110 provides a spoken utterance 558C of "7:00 pm" to complete providing spoken utterance 552B1. Thus, in response to user 101 completing spoken utterance 552C1 by providing spoken utterance 558C, the automated assistant can complete fulfillment of the assistant command using a previously unknown slot value and make a restaurant reservation on behalf of user 101. In these and other ways, the automated assistant can wait for the user 101 to complete his/her mind by providing the natural dialog output 554C1, and then prompt the user 101 to complete his/her mind by providing the natural dialog output 556C if the user 101 does not complete his/her mind in response to providing the natural dialog output 554C 1.

As another example, and with particular reference to fig. 5D, assume that the user 101 of the client device 110 provides a spoken utterance 552D1 of "Assistant, what forrrr (assant, what's on mycalendar forrrr)" on my calendar, and then pauses for N seconds as shown by 552D2, where N is any positive integer and/or fraction thereof (e.g., 2 seconds, 2.5 seconds, 3 seconds, etc.). In this example, and based on processing the spoken utterance 552C1, it is assumed that the ASR output includes recognized text corresponding to the spoken utterance 552D1 captured in the audio data stream (e.g., recognized text corresponding to what is on my calendar (what's on my calendar for)), and the NLU data stream includes a predicted "calendar" or "calendar look-up" intent with unknown slot values of the date parameter. In this example, the automated assistant can determine, based on the NLU data stream, that the user 101 has paused providing the spoken utterance 552D1 because the user did not provide the slot value of the date parameter. Additionally or alternatively, in this example, the automated assistant can determine, based on the audio-based characteristics of the spoken utterance 552D1, that the user 101 has paused providing the spoken utterance 552D1, as indicated by the elongated syllables included in the spoken utterance 552D1 (e.g., "rrrr" when "forrr" is provided in the spoken utterance 552D 1).

Further assume that the fulfillment data stream includes an assistant command that, when executed, causes client device 110 to use a calendar software application accessible at client device 110 and/or a calendar agent (e.g., one of 1P agent(s) 171 and/or 3P agent(s) of fig. 1) accessible at client device 110 to find calendar information for user 101. In the example of fig. 5D, and in contrast to the example of fig. 5A-5C, based on processing the spoken utterance 552D1 and not processing any additional spoken utterances, the automated assistant may determine that the spoken utterance 552D1 can be satisfied by having the assistant command administered. In this example, the automated assistant can initiate fulfillment of the spoken utterance 552D1 based on an NLU metric associated with an NLU data stream that indicates that the user 101 wants to find one or more calendar entries, but simply fails to provide a slot value for a date parameter associated with a "calendar" or "calendar find" intent. Thus, the automated assistant can establish a connection with a calendar software application accessible at the client device 110 and/or a calendar agent accessible at the client device 110 (e.g., one of the 1P agent(s) 171 and/or 3P agent(s) of fig. 1).

When the automated assistant initiates the performance of the spoken utterance 552D1, the automated assistant is still able to determine to provide a natural dialog output 554D1, such as "Uh huhh" shown in fig. 5D, and have the natural dialog output 554D1 provided to the user 101 of the client device 110 for audible presentation, because the automated assistant determines that the user 101 pauses providing the spoken utterance 552D1 based on the NLU data stream. However, in the example of fig. 5D and similar to fig. 5B and 5C, assume that the user 101 of the client device 110 fails to complete the spoken utterance 552C1 within M seconds of providing the natural dialog output 554D1 to the user 101 for audible presentation, as shown by 554D2, where M is any positive integer and/or fraction thereof (e.g., 2 seconds, 2.5 seconds, 3 seconds, etc.) that may be the same or different than N seconds shown by 552D 2.

However, in the example of fig. 5D, even though the user 101 may not complete the spoken utterance 552D1, the automated assistant can determine that fulfillment of the spoken utterance 552D1 is to be performed. The automated assistant may make this determination based on one or more computational costs associated with causing performance to be performed and/or revoking any performed performance. In this example, one or more of the computational costs can include having "you have two calendar entries … … today" synthesized speech 556D1 provided to the user 101 for audible presentation to fulfill the spoken utterance 552D1, and having other synthesized speech provided to the user 101 for audible presentation if the user 101 wants calendar information for another day. Thus, the automated assistant can determine that the inferred time-of-day value of the date parameter associated with the "calendar" or "calendar look-up" intent is to be continued and used to cause performance of the spoken utterance 552D1 to be performed, as the computational cost of doing so is relatively low and attempting to end the conversation session faster.

Notably, in various implementations, while the automated assistant provides the synthesized speech 556D1 to the user 101 for presentation, automated assistant components utilized in processing the audio data stream (e.g., ASR engines 120A1 and/or 120A2, NLU engines 130A1 and/or 130A2, fulfillment engines 140A1 and/or 140A2, and/or other automated assistant components of fig. 1, such as acoustic engine 161 of fig. 1) can remain active at the client device 110. Thus, in these embodiments, if the user 101 interrupts the automated assistant during the audible presentation of the synthesized speech 556D1 by providing another spoken utterance requesting a date different from the inferred current date, the automated assistant is able to quickly and efficiently tune the fulfillment of the spoken utterance 552D1 based on the different date provided by the user 101. In additional or alternative embodiments, and after having the synthesized speech 556D1 provided to the user 101 for audible presentation, the automated assistant can cause additional synthesized speech 556D2 (such as "or the like, i just interrupted you. In these and other ways, the automated assistant can balance waiting for the user 101 to complete his/her mind by providing a natural dialog output 554D1, and end the dialog session in a faster and efficient manner by fulfilling the spoken utterance 552D1 at a relatively low computational cost of fulfilling the spoken utterance 552D 1.

Although the examples of fig. 5A-5D are described with respect to having natural dialog outputs provided to the user 101 for audible presentation, it should be understood that this is for purposes of example and is not meant to be limiting. For example, and referring briefly to fig. 5E, again assume that the user 101 of the client device 110 provides an "assistant, calls Arnolllld's" spoken utterance, and then pauses for N seconds, where N is any positive integer and/or fraction thereof (e.g., 2 seconds, 2.5 seconds, 3 seconds, etc.). In the example of fig. 5E, a streaming transcription 552E of the spoken utterance can be provided to the user for visual display via the display 190 of the client device 190. In some implementations, the display 190 of the client device 110 can additionally or alternatively provide one or more graphical elements 191 that indicate that the automated assistant is waiting for the user 101 to complete a spoken utterance, such as an ellipsis appended to the streaming transcription 552E that can be moved on the display 190. Although the graphical element 191 depicted in fig. 5E is an ellipsis appended to the streaming transcription, it should be understood that this is for purposes of example and is not meant to be limiting, and any other graphical element can be provided to the user 101 for visual presentation to indicate that the automated assistant is waiting for the user 101 to complete providing a spoken utterance. In additional or alternative embodiments, one or more LEDs can be illuminated to indicate that the automated assistant is waiting for the user 101 to complete providing the spoken utterance (e.g., as shown by dashed line 192), which may be particularly advantageous in the absence of the display 190 by the client device 110. Further, it should be understood that the examples of FIGS. 5A-5E are provided for purposes of illustration only and are not meant to be limiting.

Further, in embodiments in which the client device 110 of the user 101 includes a display 190, when the user provides a spoken utterance, the user can be provided with one or more selectable graphical elements associated with various interpretations of the spoken utterance(s) for visual presentation. Responsive to not receiving a user selection from user 101 within a threshold duration, the automated assistant can initiate fulfillment of the spoken utterance(s) based on receiving a user selection of a given one of the one or more selectable graphical elements from user 101 and/or based on NLU metrics associated with the given one of the one or more selectable graphical elements. For example, in the example of fig. 5A, after receiving the spoken utterance 552A1 of "assistant, call arnolld's," a first selectable graphical element can be provided for presentation to the user 101 via the display that, when selected, causes the automated assistant to call a contact entry associated with "Arnold. However, as the user continues to provide the spoken utterance 556A of "call Arnold's restaurant," one or more of the selectable graphical elements can be updated to include a second selectable graphical element that, when selected, causes the automated assistant to call the restaurant associated with "Arnold's restaurant. In this example, and assuming that the user 101 does not provide any user selection of the first selectable graphical element or the second selectable graphical element (relative to the first selectable graphical element being presented or the second selectable graphical element being presented) for a threshold duration, the automated assistant is able to initiate a telephone call to the restaurant "Arnold's restaurant" based on the NLU metric associated with initiating a telephone call to the restaurant "Arnold's restaurant" being more indicative of the actual intent of the user 101 than the NLU metric associated with initiating a telephone call to the contact entry "Arnold".

Turning now to fig. 6, a block diagram of an exemplary computing device 610 that may optionally be utilized to perform one or more aspects of the techniques described herein is depicted. In some implementations, one or more of the client device, the cloud-based automated assistant component(s), and/or other component(s) may include one or more components of the example computing device 610.

The computing device 610 typically includes at least one processor 614 that communicates with a number of peripheral devices via a bus subsystem 612. These peripheral devices may include a storage subsystem 624 (including, for example, a memory subsystem 625 and a file storage subsystem 626), a user interface output device 620, a user interface input device 622, and a network interface subsystem 616. Input and output devices allow users to interact with computing device 610. Network interface subsystem 616 provides an interface to external networks and couples to corresponding interface devices among other computing devices.

User interface input devices 622 may include a keyboard, a pointing device (such as a mouse, trackball, touch pad, or tablet), a scanner, a touch screen incorporated into a display, an audio input device (such as a voice recognition system, microphone, and/or other types of input devices). In general, the term "input device" is intended to include all possible types of devices and ways of inputting information into computing device 610 or onto a communication network.

The user interface output device 620 may include a display subsystem, a printer, a facsimile machine, or a non-visual display (such as an audio output device). The display subsystem may include a Cathode Ray Tube (CRT), a flat panel device such as a Liquid Crystal Display (LCD), a projection device, or some other mechanism for creating a visual image. The display subsystem may also provide for non-visual displays, such as via an audio output device. In general, the term "output device" is used to include all possible types of devices and ways to output information from computing device 610 to a user or to another machine or computing device.

Storage subsystem 624 stores programming and data structures that provide the functionality of some or all of the modules described herein. For example, storage subsystem 624 may include logic to perform selected aspects of the methods disclosed herein, as well as to implement the various components depicted in fig. 1 and 2.

These software modules are typically executed by the processor 614 alone or in combination with other processors. The memory 625 used in the storage subsystem 624 can include a number of memories including a main Random Access Memory (RAM) 630 for storing instructions and data during program execution and a Read Only Memory (ROM) 632 in which fixed instructions are stored. File storage subsystem 626 may provide persistent storage for program and data files, and may include a hard disk drive, a floppy disk drive, and associated removable media, CD-ROM drive, optical drive, or removable media cartridge. Modules implementing the functionality of particular embodiments may be stored by file storage subsystem 626 in storage subsystem 624, or in other machines accessible to processor(s) 614.

Bus subsystem 612 provides a mechanism for letting the various components and subsystems of computing device 610 communicate with each other as intended. Although bus subsystem 612 is shown schematically as a single bus, alternative implementations of bus subsystem 612 may use multiple buses.

Computing device 610 can be of different types including a workstation, a server, a computing cluster, a blade server, a server farm, or any other data processing system or computing device. Due to the ever-changing nature of computers and networks, the description of computing device 610 depicted in FIG. 6 is intended only as a specific example for purposes of illustrating some embodiments. Many other configurations of computing device 610 are possible with more or fewer components than the computing device depicted in fig. 6.

Where the systems described herein collect or otherwise monitor personal information about a user or may utilize personal and/or monitoring information, the user may be provided with an opportunity to control whether programs or features collect user information (e.g., information about the user's social network, social actions or activities, profession, user preferences, or the user's current geographic location), or to control whether and/or how content that may be more relevant to the user is received from a content server. Moreover, certain data may be processed in one or more ways prior to storage or use in order to remove personal identification information. For example, the identity of the user may be processed such that personal identity information of the user cannot be determined, or the geographic location of the user may be generalized where the geographic location information is obtained (such as a city, zip code, or state level) such that a particular geographic location of the user cannot be determined. Thus, the user may control how information about the user is collected and/or used.

In some implementations, a method implemented by one or more processors is provided and includes: processing the audio data stream using an Automatic Speech Recognition (ASR) model to generate an ASR output stream, the audio data stream generated by one or more microphones of a client device of a user, and the audio data stream capturing a portion of a spoken utterance provided by the user, the portion of the spoken utterance directed to an automated assistant implemented at least in part at the client device; processing the ASR output stream using a Natural Language Understanding (NLU) model to generate an NLU output stream; determining an audio-based characteristic associated with the portion of the spoken utterance based on processing the audio data stream; determining whether the user has paused providing the spoken utterance or has completed providing the spoken utterance based on audio-based characteristics associated with portions of the spoken utterance; and in response to determining that the user has paused providing the spoken utterance, and in response to determining that the automated assistant is capable of initiating fulfillment of the spoken utterance based at least on the NLU output stream: determining a natural dialog output provided to the user for audible presentation, the natural dialog output provided to the user for audible presentation to indicate that the automated assistant is waiting for the user to complete providing the spoken utterance; and causing natural conversation output to be provided to the user for audible presentation via one or more speakers of the client device.

These and other embodiments of the technology disclosed herein can optionally include one or more of the following features.

In some implementations, causing the natural dialog output to be provided to the user auditory presentation via one or more speakers of the client device may be further responsive to determining that the user has paused providing the spoken utterance for a threshold duration.

In some implementations, determining whether the user has paused providing the spoken utterance or has completed providing the spoken utterance based on the audio-based characteristics associated with the portion of the spoken utterance can include: processing audio-based characteristics associated with portions of the spoken utterance using an audio-based classification Machine Learning (ML) model to generate an output; and determining whether the user has paused or completed providing the spoken utterance based on an output generated using the audio-based classification ML model.

In some implementations, the method can further include causing a fulfillment data stream to be generated based on the NLU output stream. Determining that the automated assistant is able to initiate fulfillment of the spoken utterance may be further based on the fulfillment data stream. In some versions of these embodiments, the method may further include, in response to determining that the user has completed providing the spoken utterance: the automated assistant is caused to initiate fulfillment of the spoken utterance based on the fulfillment data stream. In additional or alternative versions of these embodiments, the method may further include maintaining activity of one or more automated assistant components utilizing the ASR model while causing natural dialog output to be provided to the user for audible presentation via one or more speakers of the client device. In additional or alternative versions of these embodiments, the method may further comprise: determining, based on the ASR output stream, whether the spoken utterance includes a particular word or phrase; and in response to determining that the spoken utterance includes a particular word or phrase: avoiding determining whether the user has paused providing the spoken utterance or has completed providing the spoken utterance based on audio-based characteristics associated with portions of the spoken utterance; and causing the automated assistant to initiate fulfillment of the spoken utterance based on the fulfillment data stream. In additional or alternative versions of these embodiments, the method may further include determining whether the user has continued to provide the spoken utterance for a threshold duration after causing the natural dialog output to be provided to the user for audible presentation via one or more speakers of the client device; and in response to determining that the user has not continued to provide one or more spoken utterances within the threshold duration: determining whether the automated assistant is capable of initiating fulfillment of the spoken utterance based on the NLU data stream and/or the fulfillment data stream; and in response to determining, based on the fulfillment data stream, that the automated assistant is capable of initiating fulfillment of the spoken utterance: the automated assistant is caused to initiate fulfillment of the spoken utterance based on the fulfillment data stream.

In some embodiments, the method may further comprise: determining whether the user has continued to provide the spoken utterance for a threshold duration after causing natural dialog output to be provided to the user for audible presentation via one or more speakers of the client device; and in response to determining that the user has not continued to provide the spoken utterance: determining an additional natural dialog output provided to the user for audible presentation, the additional natural dialog output provided to the user for audible presentation to request that the user complete providing the spoken utterance; and causing additional natural conversation output to be provided to the user for audible presentation via one or more speakers of the client device.

In some implementations, the method can further include causing one or more graphical elements to be provided to the user for visual presentation via a display of the client device while causing natural dialog output to be provided to the user for audible presentation via one or more speakers of the client device, the one or more graphical elements being provided to the user for visual presentation to indicate that the automated assistant is waiting for the user to finish providing the spoken utterance. In some versions of these implementations, the ASR output may include a streaming transcription corresponding to a portion of the spoken utterance captured in the audio data stream, and the method may further include causing the streaming transcription to be provided to the user for visual presentation via a display of the client device while causing the natural dialog output to be provided to the user for audible presentation via one or more speakers of the client device, wherein one or more graphical elements are prefixed or appended to the streaming transcription provided to the user for visual presentation via the display of the client device.

In some implementations, the method may further include causing one or more Light Emitting Diodes (LEDs) of the client device to be illuminated while causing natural conversation output to be provided to the user for audible presentation via one or more speakers of the client device, the one or more LEDs being illuminated to indicate that the automated assistant is waiting for the user to finish providing the spoken utterance.

In some implementations, the audio-based characteristics associated with the portion of the spoken utterance can include one or more of: intonation, tone, accent, rhythm, beat, pitch, pause, one or more grammars associated with pause, and elongated syllables.

In some implementations, determining natural dialog output provided to the user for audible presentation may include: maintaining a set of natural conversation outputs in an on-device memory of the client device; and selecting a natural dialog output from the set of natural dialog outputs based on the audio-based characteristics associated with the portion of the spoken utterance.

In some implementations, causing the natural conversation output to be provided to the user for audible presentation via one or more speakers of the client device may include causing the natural conversation output to be provided to the user for audible presentation via one or more speakers of the client device at a lower volume than other outputs provided to the user for audible presentation.

In some implementations, causing natural conversation output to be provided to a user for audible presentation via one or more speakers of a client device may include: processing the natural dialog output using a text-to-speech (TTS) model to generate synthesized speech audio data including the natural dialog output; and causing the synthesized speech audio data to be provided to the user for audible presentation via one or more speakers of the client device.

In some implementations, causing natural conversation output to be provided to a user for audible presentation via one or more speakers of a client device may include: obtaining synthesized speech audio data including natural dialog output from an on-device memory of the client device; and causing the synthesized speech audio data to be provided to the user for audible presentation via one or more speakers of the client device.

In some implementations, one or more processors may be implemented locally at a client device of a user.

In some implementations, a method implemented by one or more processors is provided and includes: processing the audio data stream using an Automatic Speech Recognition (ASR) model to generate an ASR output stream, the audio data stream generated by one or more microphones of the client device, and the audio data stream capturing a portion of a spoken utterance of the user, the portion of the spoken utterance directed to an automated assistant implemented at least in part at the client device; processing the ASR output stream using a Natural Language Understanding (NLU) model to generate an NLU output stream; determining whether the user has paused providing the spoken utterance or has completed providing the spoken utterance based at least on the NLU output stream; and in response to determining that the user has paused providing the spoken utterance and has not completed providing the spoken utterance: determining a natural dialog output provided to the user for audible presentation that is provided to the user for audible presentation to indicate that the automated assistant is waiting for the user to complete providing the spoken utterance; and causing natural conversation output to be provided to the user for audible presentation via one or more speakers of the client device.

In some implementations, determining whether the user has paused providing the spoken utterance or has completed providing the spoken utterance based on the NLU output stream can include determining whether the automated assistant is able to initiate fulfillment of the spoken utterance based on the NLU output stream. Determining that the user has paused providing the spoken utterance may include determining, based on the NLU output stream, that the automated assistant cannot initiate fulfillment of the spoken utterance. In some versions of these embodiments, the method may further comprise: determining whether the user has continued to provide the spoken utterance for a threshold duration after causing natural dialog output to be provided to the user for audible presentation via one or more speakers of the client device; and in response to determining that the user has not continued to provide the spoken utterance: determining an additional natural dialog output provided to the user for audible presentation, the additional natural dialog output provided to the user for audible presentation to request that the user complete providing the spoken utterance; and causing additional natural conversation output to be provided to the user for audible presentation via one or more speakers of the client device. In some further versions of these embodiments, the additional natural dialog output provided to the user for audible presentation may request that the additional portion of the spoken utterance include particular data based on the NLU data stream.

In some implementations, a method implemented by one or more processors is provided and includes: processing the audio data stream using an Automatic Speech Recognition (ASR) model to generate an ASR output stream, the audio data stream generated by one or more microphones of the client device, and the audio data stream capturing a portion of a spoken utterance of the user, the portion of the spoken utterance directed to an automated assistant implemented at least in part at the client device; processing the ASR output stream using a Natural Language Understanding (NLU) model to generate an NLU output stream; determining whether the user has paused providing the spoken utterance or has completed providing the spoken utterance; and in response to determining that the user has paused providing the spoken utterance and has not completed providing the spoken utterance: determining a natural dialog output provided to the user for audible presentation that is provided to the user for audible presentation to indicate that the automated assistant is waiting for the user to complete providing the spoken utterance; causing natural conversation output to be provided to a user for audible presentation via one or more speakers of a client device; responsive to determining that after causing natural dialog output to be provided to the user for audible presentation via one or more speakers of the client device, the user has not completed providing the spoken utterance within a threshold duration of time: determining whether the automated assistant can initiate fulfillment of the spoken utterance based at least on the NLU data stream; and in response to determining, based on the NLU data stream, that the automated assistant can initiate fulfillment of the spoken utterance: causing the automated assistant to initiate fulfillment of the spoken utterance.

These and other embodiments of the technology disclosed herein may optionally include one or more of the following features.

In some implementations, the method can further include determining an audio-based characteristic associated with the portion of the spoken utterance based on processing the audio data stream. Determining whether the user has paused or completed providing the spoken utterance may be based on audio-based characteristics associated with portions of the spoken utterance.

In some implementations, determining whether the user has paused providing the spoken utterance or has completed providing the spoken utterance can be based on the NLU data stream.

In some embodiments, the method may further comprise: responsive to determining, based on the NLU data stream, that the automated assistant cannot initiate fulfillment of the spoken utterance: determining a natural dialog output provided to the user for audible presentation, the natural dialog output provided to the user for audible presentation to request that the user complete providing the spoken utterance; and causing additional natural conversation output to be provided to the user for audible presentation via one or more speakers of the client device. In some versions of these embodiments, the natural dialog output provided to the user for audible presentation may request that additional portions of the spoken utterance include specific data based on the NLU data stream.

In some implementations, determining whether the automated assistant is capable of initiating fulfillment of the spoken utterance may be further based on one or more computational costs associated with fulfillment of the spoken utterance. In some versions of these implementations, the one or more computational costs associated with the fulfillment of the spoken utterance may include one or more of: a computational cost associated with performing the fulfillment of the spoken utterance, and a computational cost associated with revoking the performance of the spoken utterance.

In some implementations, the method can further include causing a fulfillment data stream to be generated based on the NLU output stream. Determining that the automated assistant is able to initiate fulfillment of the spoken utterance may be further based on the fulfillment data stream.

In some implementations, a method implemented by one or more processors is provided and includes: receiving an audio data stream, the audio data stream generated by one or more microphones of a client device of a user, and the audio data stream capturing at least a portion of a spoken utterance provided by the user, the at least a portion of the spoken utterance directed to an automated assistant implemented at least in part at the client device; determining an audio-based characteristic associated with the portion of the spoken utterance based on processing the audio data stream; determining whether the user has paused providing the spoken utterance or has completed providing the spoken utterance based on audio-based characteristics associated with portions of the spoken utterance; and in response to determining that the user has paused providing the spoken utterance and has not completed providing the spoken utterance: determining a natural dialog output provided to the user for audible presentation, the natural dialog output provided to the user for audible presentation to indicate that the automated assistant is waiting for the user to complete providing the spoken utterance; and causing natural conversation output to be provided to the user for audible presentation via one or more speakers of the client device.

Additionally, some implementations include one or more processors (e.g., central processing unit(s) (CPU), graphics processing unit(s) (GPU), and/or tensor processing unit(s) (TPU)) of one or more computing devices, wherein the one or more processors are operable to execute instructions stored in an associated memory, and wherein the instructions are configured to cause performance of any of the methods described above. Some embodiments also include one or more non-transitory computer-readable storage media storing computer instructions executable by one or more processors to perform any of the methods described above. Some embodiments also include a computer program product comprising instructions executable by one or more processors to perform any of the methods described above.

Claims

1. A method implemented by one or more processors, the method comprising:

processing an audio data stream using an Automatic Speech Recognition (ASR) model to generate an ASR output stream, the audio data stream generated by one or more microphones of a user's client device, and the audio data stream capturing a portion of a spoken utterance provided by the user, the portion of the spoken utterance directed to an automated assistant implemented at least in part at the client device;

Processing the ASR output stream using a Natural Language Understanding (NLU) model to generate an NLU output stream;

determining an audio-based characteristic associated with the portion of the spoken utterance based on processing the audio data stream;

determining, based on the audio-based characteristics associated with the portion of the spoken utterance, whether the user has paused providing the spoken utterance or has completed providing the spoken utterance; and

responsive to determining that the user has paused providing the spoken utterance, and responsive to determining, based at least on the NLU output stream, that the automated assistant is capable of initiating fulfillment of the spoken utterance:

determining a natural dialog output to be provided to the user for audible presentation, the natural dialog output to be provided to the user for audible presentation to instruct the automated assistant to wait for the user to finish providing the spoken utterance; and

the natural conversation output is caused to be provided to the user for audible presentation via one or more speakers of the client device.

2. The method of claim 1, wherein causing the natural conversation output to be provided to the user via the one or more speakers of the client device for audible presentation is further responsive to determining that the user has paused providing the spoken utterance for a threshold duration.

3. The method of claim 1 or claim 2, wherein determining whether the user has paused providing the spoken utterance or has completed providing the spoken utterance based on the audio-based characteristic associated with the portion of the spoken utterance comprises:

processing the audio-based characteristics associated with the portion of the spoken utterance using an audio-based classification Machine Learning (ML) model to generate an output; and

determining whether the user has paused providing the spoken utterance or has completed providing the spoken utterance based on the output generated using the audio-based classification ML model.

4. The method of any preceding claim, further comprising:

a fulfillment data stream is generated based on the NLU output stream,

wherein determining that the automated assistant is capable of initiating fulfillment of the spoken utterance is further based on the fulfillment data stream.

5. The method of claim 4, further comprising:

in response to determining that the user has completed providing the spoken utterance:

causing the automated assistant to initiate fulfillment of the spoken utterance based on the fulfillment data stream.

6. The method of claim 4, further comprising:

one or more automated assistant components that utilize the ASR model are maintained active while the natural conversation output is provided to the user for audible presentation via one or more speakers of the client device.

7. The method of claim 4, further comprising:

determining whether the spoken utterance includes a particular word or phrase based on the ASR output stream; and

responsive to determining that the spoken utterance includes the particular word or phrase:

avoiding determining whether the user has paused providing the spoken utterance or has completed providing the spoken utterance based on the audio-based characteristic associated with the portion of the spoken utterance; and

8. The method of claim 4, further comprising:

determining whether the user has continued to provide the spoken utterance for a threshold duration after causing the natural dialog output to be provided to the user via the one or more speakers of the client device for audible presentation; and

In response to determining that the user has not continued to provide the one or more spoken utterances for the threshold duration:

determining whether the automated assistant is capable of initiating fulfillment of the spoken utterance based on an NLU data stream and/or the fulfillment data stream; and

responsive to determining, based on the fulfillment data stream, that the automated assistant is capable of initiating fulfillment of the spoken utterance:

9. The method of any preceding claim, further comprising:

in response to determining that the user does not continue to provide the spoken utterance:

determining additional natural dialog output to be provided to the user for audible presentation, the additional natural dialog output to be provided to the user for audible presentation to request that the user complete providing the spoken utterance; and

The additional natural conversation output is caused to be provided to the user for audible presentation via one or more speakers of the client device.

10. The method of any preceding claim, further comprising:

causing one or more graphical elements to be provided to the user for visual presentation via a display of the client device while causing the natural dialog output to be provided to the user for audible presentation via one or more speakers of the client device, the one or more graphical elements to be provided to the user for visual presentation to indicate that the automated assistant is waiting for the user to finish providing the spoken utterance.

11. The method of claim 10, wherein the ASR output includes a streaming transcription corresponding to the portion of the spoken utterance captured in the audio data stream, and further comprising:

causing the streaming transcription to be provided to the user via the display of the client device for visual presentation while causing the natural conversation output to be provided to the user via one or more speakers of the client device for audible presentation, wherein the one or more graphical elements are prefixed or appended to the streaming transcription that is provided to the user via the display of the client device for visual presentation.

12. The method of any preceding claim, further comprising:

causing one or more Light Emitting Diodes (LEDs) of the client device to be illuminated while causing the natural conversation output to be provided to the user for audible presentation via one or more speakers of the client device, the one or more LEDs being illuminated to indicate that the automated assistant is waiting for the user to finish providing the spoken utterance.

13. The method of any preceding claim, wherein the audio-based characteristics associated with the portion of the spoken utterance include one or more of: intonation, tone, accent, rhythm, beat, pitch, pause, one or more grammars associated with pause, and elongated syllables.

14. The method of any preceding claim, wherein determining the natural dialog output to be provided to the user for audible presentation comprises:

maintaining a natural set of dialog outputs in an on-device memory of the client device; and

the natural dialog output is selected from the set of natural dialog outputs based on the audio-based characteristics associated with the portion of the spoken utterance.

15. The method of any preceding claim, wherein causing the natural conversation output to be provided to the user for audible presentation via one or more speakers of the client device comprises:

the natural conversation output is caused to be provided to the user for audible presentation via the one or more speakers of the client device at a lower volume than other outputs provided to the user for audible presentation.

16. The method of any preceding claim, wherein causing the natural conversation output to be provided to the user for audible presentation via the one or more speakers of the client device comprises:

processing the natural dialog output using a text-to-speech (TTS) model to generate synthesized speech audio data comprising the natural dialog output; and

the synthesized speech audio data is caused to be provided to the user via the one or more speakers of the client device for audible presentation.

17. The method of any preceding claim, wherein causing the natural conversation output to be provided to the user for audible presentation via the one or more speakers of the client device comprises:

Obtaining synthesized speech audio data including the natural dialog output from an on-device memory of the client device; and

18. The method of any preceding claim, wherein the one or more processors are implemented locally at the client device of the user.

19. A method implemented by one or more processors, the method comprising:

processing an audio data stream using an Automatic Speech Recognition (ASR) model to generate an ASR output stream, the audio data stream generated by one or more microphones of the client device and capturing a portion of a spoken utterance of the user, the portion of the spoken utterance directed to an automated assistant implemented at least in part at the client device;

determining, based at least on the NLU output stream, whether the user has paused providing the spoken utterance or has completed providing the spoken utterance; and

Responsive to determining that the user has paused providing the spoken utterance and has not completed providing the spoken utterance:

20. The method of claim 19, wherein determining, based on the NLU output stream, whether the user has paused providing the spoken utterance or has completed providing the spoken utterance comprises:

determining whether the automated assistant is capable of initiating fulfillment of the spoken utterance based on the NLU output stream,

wherein determining that the user has paused providing the spoken utterance includes determining, based on the NLU output stream, that the automated assistant cannot initiate fulfillment of the spoken utterance.

21. The method of claim 20, further comprising:

22. The method of claim 21, wherein the additional natural dialog output to be provided to the user for audible presentation requests additional portions of the spoken utterance include specific data based on the NLU data stream.

23. A method implemented by one or more processors, the method comprising:

determining whether the user has paused providing the spoken utterance or has completed providing the spoken utterance; and

determining a natural dialog output to be provided to the user for audible presentation, the natural dialog output to be provided to the user for audible presentation to instruct the automated assistant to wait for the user to finish providing the spoken utterance;

causing the natural conversation output to be provided to the user for audible presentation via one or more speakers of the client device; and

in response to determining that the user has not completed providing the spoken utterance within a threshold duration after causing the natural dialog output to be provided to the user for audible presentation via the one or more speakers of the client device:

determining, based at least on the NLU data stream, whether the automated assistant is capable of initiating fulfillment of the spoken utterance; and

responsive to determining, based on the NLU data stream, that the automated assistant is capable of initiating fulfillment of the spoken utterance:

Causing the automated assistant to initiate fulfillment of the spoken utterance.

24. The method of claim 23, further comprising:

determining audio-based characteristics associated with the portion of the spoken utterance based on processing the audio data stream,

wherein determining whether the user has paused providing the spoken utterance or has completed providing the spoken utterance is based on the audio-based characteristic associated with the portion of the spoken utterance.

25. The method of claim 23 or claim 24, wherein determining whether the user has paused providing the spoken utterance or has completed providing the spoken utterance is based on the NLU data stream.

26. The method of any of claims 23 to 25, further comprising:

responsive to determining, based on the NLU data stream, that the automated assistant cannot initiate fulfillment of the spoken utterance:

determining a natural dialog output to be provided to the user for audible presentation, the natural dialog output to be provided to the user for audible presentation to request that the user complete providing the spoken utterance; and

27. The method of claim 26, wherein the natural dialog output to be provided to the user for audible presentation requests additional portions of the spoken utterance include specific data based on the NLU data stream.

28. The method of any of claims 23-27, wherein determining whether the automated assistant is capable of initiating fulfillment of the spoken utterance is further based on one or more computational costs associated with fulfillment of the spoken utterance.

29. The method of claim 28, wherein the one or more computational costs associated with fulfillment of the spoken utterance can include one or more of: a computational cost associated with performing the fulfillment of the spoken utterance, and a computational cost associated with revoking the performance of the spoken utterance.

30. The method of any of claims 23 to 29, further comprising:

a fulfillment data stream is generated based on the NLU output stream,

31. A method implemented by one or more processors, the method comprising:

receiving an audio data stream, the audio data stream generated by one or more microphones of a user's client device, and the audio data stream capturing at least a portion of a spoken utterance provided by the user, the at least a portion of the spoken utterance directed to an automated assistant implemented at least in part at the client device;

32. A system, comprising:

at least one processor; and

a memory storing instructions that, when executed, cause the at least one processor to perform operations corresponding to any of claims 1 to 31.

33. A non-transitory computer-readable storage medium storing instructions that, when executed, cause at least one processor to perform operations corresponding to any of claims 1-31.