CN112071302A

CN112071302A - Synthesized voice selection for computing agents

Info

Publication number: CN112071302A
Application number: CN202010767926.9A
Authority: CN
Inventors: 瓦莱里·尼高; 波格丹·卡普里塔; 罗伯特·斯特茨; 塞苏雷什·克里希纳库马兰; 贾森·布兰特·道格拉斯
Original assignee: Google LLC
Current assignee: Google LLC
Priority date: 2016-10-03
Filing date: 2017-09-29
Publication date: 2020-12-11
Also published as: JP2022040183A; EP4109375A1; US20230274205A1; JP6882463B2; JP7005694B2; KR20190054174A; CN109804428B; WO2018067404A1; WO2018067403A1; WO2018067402A1; EP3504705A1; EP3504705B1; CN109804428A; JP2019535037A; JP2020173462A; JP7108122B2; CN109844855B; CN109844855A

Abstract

The application relates to synthesized speech selection for computing agents. An example method includes receiving, by a computing assistant executing at one or more processors, a representation of an utterance spoken at a computing device; selecting an agent from a plurality of agents based on the utterance, wherein the plurality of agents includes one or more first-party agents and a plurality of third-party agents; in response to determining that the selected agent comprises a first party agent, selecting a reserved voice from a plurality of voices; and outputting synthesized audio data to satisfy the utterance using the selected speech.

Description

Synthesized voice selection for computing agents

Description of the cases

The application belongs to divisional application of Chinese patent application 201780061508.4 with application date of 2017, 9 and 29.

RELATED APPLICATIONS

This application claims the benefit of U.S. provisional patent application No. 62/403,665 filed 2016, month 10, and day 3, the entire contents of which are incorporated herein by reference.

Background

Some computing platforms may provide a user interface from which a user may chat, speak, or otherwise communicate with a virtual computing assistant (e.g., also referred to as an "intelligent personal assistant" or simply "assistant") to cause the assistant to output useful information, respond to user requirements, or otherwise perform certain operations to assist the user in completing various real-world or virtual tasks. For example, a computing device may receive speech input (e.g., audio data) corresponding to a user utterance with a microphone. An assistant executing at least in part at a computing device can analyze the speech input and attempt to satisfy the utterance by: outputting useful information based on the utterance, responding to user needs indicated by the utterance, or otherwise performing some operation to assist the user in completing various real-world or virtual tasks in accordance with the utterance.

Disclosure of Invention

In general, the techniques of the present invention may enable a user to communicate with multiple virtual computing agents/assistants. For example, there may be several agents available to a user via a computing device that may be able to respond, at least to some extent, to utterances (e.g., requests, questions, queries, subscriptions, etc.). The agent may respond to the utterance or otherwise converse with the user by at least causing the computing device to output the synthesized audio data. For example, the agent may provide text on which the computing device executes text-to-speech (TTS) to generate synthesized audio data. However, it may be desirable for different agents to use different voices, as opposed to synthesized audio data that is generated for all agents using the same voice. In this way, an adaptive interface is provided in which the output of data is adjusted based on the data itself.

In accordance with one or more techniques of this disclosure, the agent may cause the computing device to output synthesized audio data using different voices. For example, a first agent may cause a computing device to output synthesized audio data using a first voice, and a second agent may cause the computing device to output synthesized audio data using a second voice that is different from the first voice. By having different agents use different voices when communicating with a user via a particular computing device, the user can better track which agent the user is communicating with. In this way, the user may avoid repeated utterances, which process consumes power and other system resources. In this manner, the techniques of this disclosure may reduce power consumption and/or system resource requirements for agent interactions.

In one example, a method comprises: receiving, by a computing assistant executing at one or more processors, a representation of an utterance spoken at a computing device; selecting an agent from a plurality of agents based on the utterance, wherein the plurality of agents includes one or more first-party agents and a plurality of third-party agents; in response to determining that the selected agent comprises a first party agent, selecting a reserved voice from a plurality of voices; and outputting, using the selected speech, synthesized audio data for playback by one or more speakers of the computing device to satisfy the utterance.

In another example, an apparatus includes at least one processor; at least one memory comprising instructions that, when executed, cause the at least one processor to perform an assistant configured to: receiving, from one or more microphones operatively connected to the computing device, a representation of an utterance spoken at the computing device; selecting an agent from a plurality of agents based on an utterance, wherein the plurality of agents includes one or more first-party agents and a plurality of third-party agents, the memory further including instructions that, when executed, cause the at least one processor to: in response to determining that the selected agent comprises a first party agent, selecting a reserved voice from a plurality of voices; and outputting, using the selected speech, synthesized audio data for playback through one or more speakers operatively connected to the computing device to satisfy the utterance.

In another example, a system includes one or more communication units; at least one processor; at least one memory including instructions that, when executed, cause the at least one processor to execute an assistant configured to: receiving, from one or more microphones operatively connected to the computing device, a representation of an utterance spoken at the computing device; selecting an agent from a plurality of agents based on an utterance, wherein the plurality of agents includes one or more first-party agents and a plurality of third-party agents, the memory further including instructions that, when executed, cause the at least one processor to: in response to determining that the selected agent comprises a first party agent, selecting a reserved voice from a plurality of voices; and outputting, using the selected speech, synthesized audio data for playback by one or more speakers operatively connected to the computing device to satisfy the utterance.

In another example, a system comprises: means for receiving, by a computing assistant executing at one or more processors, a representation of an utterance spoken at a computing device; means for selecting an agent from a plurality of agents based on the utterance, wherein the plurality of agents includes one or more first-party agents and a plurality of third-party agents; means for selecting a reserved voice from a plurality of voices in response to determining that the selected agent comprises a first party agent; and means for outputting, using the selected speech, synthesized audio data for playback by one or more speakers operatively connected to the computing device to satisfy the utterance.

In another example, a computer-readable storage medium stores instructions that, when executed, cause one or more processors to execute an assistant configured to: receiving a representation of an utterance spoken at a computing device; selecting an agent from a plurality of agents based on the utterance, wherein the plurality of agents includes one or more first-party agents and a plurality of third-party agents, the storage medium further including instructions that, when executed, cause the one or more processors to: in response to determining that the selected agent comprises a first party agent, selecting a reserved voice from a plurality of voices; and, outputting the synthesized audio data for playback using the selected speech to satisfy the utterance.

The details of one or more examples are set forth in the accompanying drawings and the description below. Other features, objects, and advantages of the disclosure will be apparent from the description and drawings, and from the claims.

Drawings

Fig. 1 is a conceptual diagram illustrating an example system executing an example virtual assistant in accordance with one or more aspects of the present disclosure.

Fig. 2 is a block diagram illustrating an example computing device configured to execute an example virtual assistant in accordance with one or more aspects of the present disclosure.

Fig. 3 is a block diagram illustrating an example computing system configured to execute an example virtual assistant in accordance with one or more aspects of the present disclosure.

Fig. 4 is a block diagram illustrating an example computing system configured to execute an example third party agent in accordance with one or more aspects of the present disclosure.

Fig. 5 is a flowchart illustrating example operations performed by one or more processors executing an example virtual assistant in accordance with one or more aspects of the present disclosure.

6A-6C are flow diagrams illustrating example operations performed by one or more processors to select a virtual agent to perform a task according to one or more aspects of the present disclosure.

FIG. 7 is a flow diagram illustrating example operations performed by one or more processors to facilitate task execution of a plurality of virtual agents in accordance with one or more aspects of the present invention.

FIG. 8 is a flow diagram illustrating example operations performed by one or more processors to select a voice for use in outputting synthesized audio data for text generated by a virtual agent in accordance with one or more aspects of the present invention.

Detailed Description

In general, the techniques of this disclosure may enable a virtual computing assistant (e.g., also referred to as a "smart personal assistant" or simply "assistant") to manage multiple agents in response to user input (e.g., to satisfy user utterances or text input). For example, a computing device may receive speech input (e.g., audio data) corresponding to a user utterance with a microphone. The agent selection module may analyze the voice input and select an agent from a plurality of agents to satisfy the utterance. The plurality of agents may include one or more first party (1P) agents and one or more third party (3P) agents. The 1P broker may be included within the assistant and/or share a common publisher with an operating system of the assistant, broker selection module, and/or computing device receiving the voice input.

To perform the selection, the agent selection module may use a 1P agent, a 3P agent, or some combination of 1P agent and 3P agent to determine whether the utterance is satisfied. In an instance in which the agent selection module determines, at least in part, using the 3P agents, that the utterance is satisfied, the agent selection module may rank the one or more 3P agents based on the utterance.

The selected language agent (1P language agent, 3P language agent, or some combination of 1P language agent and 3P language agent) may attempt to satisfy the utterance. For example, the selected agent may perform one or more actions to satisfy the utterance (e.g., output information based on the utterance, respond to user needs indicated by the utterance, or otherwise perform certain operations based on the utterance to help the user complete various real world or virtual tasks).

In some examples, there may be an indication of the type of agent performing the action. For example, where the one or more actions include "speaking" to the user, the 1P proxy and the 3P proxy may use different voices. As one example, the 1P proxy may utilize all of the reserved voices in the plurality of voices, and the 3P proxy may utilize other voices in the plurality of voices, but may prohibit the use of the reserved voices. In the case where the one or more actions include text interaction with the user, the agents may use different identifiers (e.g., "agent 1: I have male your diner reservation (I have ordered dinner)" and "agent 2: I have moved $100from your checking account to your savings account)"), each agent with a different font, and so on.

Throughout this disclosure, examples are described in which a computing device and/or computing system analyzes information (e.g., context, location, communication, contact, chat session, voice session, etc.) associated with the computing device and a user of the computing device only if the computing device receives permission to analyze the information from the user of the computing device. For example, in the case discussed below, before an assistant executing at a computing device or computing system is able to collect or may use information associated with a user, the user may be provided with an opportunity to provide input to control whether the assistant (or other programs or features of the computing device and/or computing system) may collect and utilize user information or to indicate whether and/or how the computing device and/or computing system may receive content that may be relevant to the user. Additionally, certain data may be encrypted and/or processed in one or more ways before being stored or used by the assistant or underlying computing device and/or computing system, thereby removing personally identifiable information. For example, the identity of the user may be processed so that personally identifiable information about the user cannot be determined, or the geographic location of the user may be up-scaled when location information is obtained (e.g., to a city, zip code, or state, rather than a coordinate location or physical address) so that a particular location of the user cannot be determined. Thus, the user can control how information is collected about the user and used by the assistant and the underlying computing device and computing system executing the assistant.

Fig. 1 is a conceptual diagram illustrating an example system executing an example virtual assistant in accordance with one or more aspects of the present disclosure. The system 100 of FIG. 1 includes an assistant server system 160 in communication with a search server system 180, third party (3P) proxy server systems 170A-170N (collectively, "3P proxy server systems 170"), and a computing device 110 via a network 130. Although system 100 is shown as being distributed among assistant server system 160, 3P proxy server system 170, search server system 180, and computing device 110, in other examples features and techniques attributed to system 100 may be performed internally by local components of computing device 110. Similarly, assistant server system 160 and/or 3P proxy server system 170 may include certain components and perform various techniques that are otherwise attributed to search server system 180 and/or computing device 110 in the following description.

Network 130 represents any public or private communication network, such as a cellular, Wi-Fi, and/or other type of network, for transmitting data between computing systems, servers, and computing devices. The assistant server system 160 may exchange data with the computing device 110 via the network 130 to provide virtual assistant services that are accessible to the computing device 110 when the computing device 110 is connected to the network 130. Similarly, the 3P proxy server system 170 may exchange data with the computing device 110 via the network 130 to provide virtual proxy services accessible to the computing device 110 when the computing device 110 is connected to the network 130. Assistant server system 160 may exchange data with search server system 180 via network 130 to access search services provided by search server system 180. Computing device 110 may exchange data with search server system 180 via network 130 to access search services provided by search server system 180. 3P proxy server system 170 may exchange data with search server system 180 via network 130 to access search services provided by search server system 180.

Network 130 may include one or more hubs, network switches, network routers, or any other network devices operatively coupled to each other to provide for the exchange of information between

server systems

160, 170, and 180 and computing device 110. The computing device 110, the assistant server system 160, the 3P proxy server system 170, and the search server system 180 may send and receive data over the network 130 using any suitable communication technology. The computing device 110, the assistant server system 160, the 3P proxy server system 170, and the search server system 180 may each be operatively coupled to the network 130 using respective network links. The links coupling the computing device 110, the assistant server system 160, the 3P proxy server system 170, and the search server system 180 to the network 130 may be ethernet or other types of network connections, and such connections may be wireless and/or wired connections.

Assistant server system 160, 3P proxy server system 170, and search server system 180 represent any suitable remote computing system, such as one or more desktop, laptop, mainframe, server, cloud computing systems, etc., capable of sending and receiving information to and from a network, such as network 130. Assistant server system 160 hosts (or at least provides access to) virtual assistant services. The 3P proxy server system 170 hosts (or at least provides access to) a virtual language agent. Search server system 180 hosts (or at least provides access to) a search service. In some examples, assistant server system 160, 3P proxy server system 170, and search server system 180 represent cloud computing systems that provide access to their respective services through the cloud.

Computing device 110 represents a stand-alone mobile or non-mobile computing device. Examples of computing device 110 include a mobile phone, a tablet computer, a laptop computer, a desktop device, a set-top box, a television, a wearable device (e.g., a computerized watch, computerized eyewear, computerized gloves, etc.), a home automation device or system (e.g., a smart thermostat or home assistant device), a Personal Digital Assistant (PDA), a gaming system, a media player, an electronic book reader, a mobile television platform, a car navigation or infotainment system, or any other type of mobile, non-mobile, wearable, and non-wearable computing device configured to execute or access a virtual assistant and receive information via a network, such as network 130.

Computing device 110 includes User Interface Device (UID)112, User Interface (UI) module 120, and local assistant module 122A.

Modules

120 and 122A may perform operations described using software, hardware, firmware, or a mix of hardware, software, and firmware resident in respective computing devices 110 and/or executed at respective computing devices 110. Computing device 110 may execute

modules

120 and 122A having multiple processors or multiple devices. Computing device 110 may execute

modules

120 and 122A as virtual machines executing on the underlying hardware.

Modules

120 and 122A may execute as one or more services of an operating system or computing platform.

Modules

120 and 122A may execute as one or more executable programs at the application layer of the computing platform.

Computing device 110 may communicate with assistant server system 160, 3P proxy server system 170, and/or search server system 180 via network 130 to access virtual assistant services provided by assistant server system 160, virtual language proxies provided by 3P proxy server system 170, and/or to access search services provided by search server system 180. In providing virtual assistant services, assistant server system 160 may communicate with search server system 180 via network 130 to obtain search results for providing information to users of the virtual assistant services to complete tasks. In providing the virtual assistant service, the assistant server system 160 can communicate with the 3P proxy server system 170 via the network 130 to engage one or more virtual language agents provided by the 3P proxy server system 170 to provide additional assistance to the user of the virtual assistant service. In providing additional assistance, the 3P proxy server system 170 may communicate with the search server system 180 via the network 130 to obtain search results for providing information to the user of the language agent to complete a task.

In the example of fig. 1, assistant server system 160 includes remote assistant module 122B and agent index 124B. Remote assistant module 122B can maintain remote agent index 124B as part of a virtual assistant service provided by assistant server system 160 via network 130 (e.g., to computing device 110). Computing device 110 includes User Interface Device (UID)112, User Interface (UI) module 120, local assistant module 122A, and agent index 124A. Local assistant module 122A can maintain agent index 124A as part of a virtual assistant service that is executed locally at computing device 110. Remote assistant module 122B and local assistant module 122A may be collectively referred to as assistant module 122. The home agent index 124A and the remote agent index store 124B may be collectively referred to as an agent index 124.

Modules 122B, 128Ab-128Nb (collectively "3P proxy modules 128B"), and 182 may perform operations described using software, hardware, firmware, or a mixture of hardware, software, and firmware residing or executing in assistant server system 160, 3P proxy server system 170, and search server system 180, respectively. The assistant server system 160, the 3P proxy server system 170, and the search server system 180 may execute the plurality of

modules

122B, 128B, and 182 using a plurality of processors, a plurality of devices, respectively, as virtual machines executing on underlying hardware, or as one or more services of an operating system or computing platform. In some examples,

modules

122B, 128B, and 182 may be executed as one or more executable programs at the application layer of the computing platforms of assistant server system 160, 3P proxy server system 170, and search server system 180, respectively.

UIDs 112 of computing devices 110 may serve as input and/or output devices for computing devices 110. UIDs 112 may be implemented using various technologies. For example, UID 112 may function as an input device using a presence-sensitive input screen, such as a resistive touch screen, a surface acoustic wave touch screen, a capacitive touch screen, a projected capacitive touch screen, a pressure sensitive screen, an acoustic pulse recognition touch screen, or other presence-sensitive display technology. Additionally, UIDs 112 may include microphone technology, infrared sensor technology, or other input device technology for receiving user inputs.

UID 112 may function as an output (e.g., display) device using any one or more display devices, such as a Liquid Crystal Display (LCD), dot matrix display, Light Emitting Diode (LED) display, Organic Light Emitting Diode (OLED) display, electronic ink, or similar monochrome or color display capable of outputting visual information to a user of computing device 110. In addition, UIDs 112 may include speaker technology, haptic feedback technology, or other output device technology for outputting information to a user.

UID 112 may include a presence-sensitive display that may receive tactile input from a user of computing device 110. UID 112 may receive indications of tactile input by detecting one or more gestures from a user (e.g., the user touching or pointing to one or more locations of UID 112 with a finger or stylus pen). UID 112 may present output to a user, for example, at a presence-sensitive display. UID 112 may present the output as a graphical user interface (e.g., user interface 114) that may be associated with functionality provided by computing device 110 and/or services accessed by computing device 110.

For example, UIDs 112 may present user interfaces (e.g., user interfaces 114) related to virtual assistants provided by local assistant modules 122A and/or remote assistant modules 122B accessed by UI module 120 on behalf of computing device 110. UID 112 may present a user interface related to other features of a computing platform, operating system, application, and/or service executing at computing device 110 or accessible from computing device 110 (e.g., an email, chat or other electronic messaging application, internet browser application, phone application, mobile or desktop operating system, etc.).

UI module 120 may manage user interactions with UIDs 112 and other components of computing devices 110, including interacting with assistant server system 160 to provide autonomous search results at UIDs 112. UI module 120 may cause UID 112 to output a user interface, such as user interface 114 (or other example user interfaces), for display when a user of computing device 110 views the output and/or provides input at UID 112. UI module 120 and UIDs 112 may receive one or more indications of input from a user as the user interacts with the user interface, at different times, or when the user and computing device 110 are in different locations. UI module 120 and UIDs 112 may interpret inputs detected at UIDs 112 and may relay information regarding the inputs detected at UIDs 112 to one or more associated platforms, operating systems, applications, and/or services executing at computing devices 110, e.g., to cause computing devices 110 to perform functions.

UI module 120 may receive information and instructions from one or more associated platforms, operating systems, applications, and/or services executing at computing device 110 and/or one or more remote computing systems, such as

server systems

160 and 180. In addition, UI module 120 may act as an intermediary between one or more associated platforms, operating systems, applications, and/or services executing at computing device 110 and various output devices of computing device 110 (e.g., speakers, LED indicators, audio or tactile output devices, etc.). To produce output (e.g., graphics, flashing lights, sounds, haptic responses, etc.) with the computing device 110. In some examples, UI module 120 may perform text-to-speech (TTS). For example, when providing (e.g., by another module) text, UI module 120 may synthesize the audio data to speak the test (e.g., read aloud the text).

The local assistant module 122A of the computing device 110 and the remote assistant module 122B of the assistant server system 160 may each perform similar functions described herein for automatically executing an assistant configured to select an agent to satisfy user input (e.g., spoken utterances, text inputs, etc.) received from a user of the computing device. Remote assistant module 122B and agent index 124B represent a server-side or cloud implementation of the example virtual assistant, while local assistant module 122A and agent index 124A represent a client-side or local implementation of the example virtual assistant.

Modules

122A and 122B may each include a respective software agent configured to execute as an intelligent personal assistant, which may perform tasks or services for a person, such as a user of computing device 110.

Modules

122A and 122B may perform these tasks or services based on user input (e.g., detected at UID 112), location awareness (e.g., based on context), and/or the ability to access other information from various information sources (e.g., stored locally at computing device 110, assistant server system 160, or obtained through a search service provided by search server system 180) (e.g., weather or traffic conditions, news, stock prices, sports scores, user schedules, traffic schedules, retail prices, etc.).

Modules

122A and 122B may perform artificial intelligence and/or machine learning techniques to automatically identify and complete one or more tasks on behalf of a user.

In some examples, the assistant provided by module 122 is referred to as a first party (1P) assistant and/or a 1P proxy. For example, the broker represented by module 122 may share a common publisher and/or common developer with the owner of the operating system of computing device 110 and/or assistant server system 160. Thus, in some examples, the module 122 may have capabilities that are not available to other agents, such as third party (3P) agents. In some examples, the agents represented by module 122 may not all be 1P agents. For example, the agent represented by local assistant module 122A may be a 1P agent, while the agent represented by remote assistant module 122B may be a 3P agent. In some examples, the assistant provided by module 122 may be referred to as a 1P assistant (e.g., a 1P computing assistant), and module 122 may further provide one or more 1P agents (e.g., that share a common publisher and/or a common developer with the 1P computing assistant).

As described above, local assistant module 122A may represent a software agent configured to execute as an intelligent personal assistant that may perform tasks or services for a user of a person, such as computing device 110. However, in some examples, it may be desirable for the assistant to utilize other agents to perform tasks or services for the person. For example, in some scenarios, it may be desirable to facilitate the use of one or more 3P agents to perform tasks or services for a user of computing device 110. As one example, a 3P agent may be able to perform a particular task more efficiently (e.g., using less computing power, system resources, etc.) than an assistant.

In the example of fig. 1, the 3P proxy server system 170 includes a remote 3P proxy module 128 b. The remote 3P proxy module 128b may perform similar functions described below with respect to the local 3P proxy module 128a to automatically perform a proxy configured to satisfy utterances received from users of computing devices, such as the computing device 110. In other words, the remote 3P proxy module 128b represents a server-side or cloud implementation of the example 3P proxy, while the local 3P proxy module 128a represents a client-side or local implementation of the example 3P proxy.

In some examples, each of modules 128a and 128b (collectively, "modules 128") may represent a software agent configured to execute as an intelligent personal assistant that may perform tasks or services for a person, such as a user of computing device 110. In some examples, each of modules 128 may represent a software agent that may be used by an assistant provided by module 122. In some examples, the assistants and/or agents provided by module 128 are referred to as third party (3P) assistants and/or 3P agents. For example, the assistant and/or broker represented by module 128 may not share a common publisher with the operating system of computing device 110 and/or the owner of assistant server system 160. Thus, in some examples, the assistant and/or agent represented by module 128 may not have the capabilities available to other assistants and/or agents, such as first party (1P) assistants and/or agents.

In some examples, the 3P proxy may be configured to be used without user involvement. In some examples, some 3P agents may need to be configured prior to use. For example, when installing a smart lighting dimmer in a home, a user may configure a 3P proxy provided by the manufacturer of the smart lighting dimmer for use. The configuration process may involve associating the 3P agent with the 1P assistant (e.g., the user may provide the 1P assistant with account information for the 3P agent) and authorizing (e.g., by the user) the 1P assistant to communicate with the 3P agent on behalf of the user.

Search module 182 may perform information (e.g., as part of a task that a virtual assistant performs on behalf of a user of computing device 110) that the search is determined to be relevant to a search query that search module 182 automatically generates (e.g., based on contextual information associated with computing device 110) or that search module 182 receives from assistant server system 160, 3P proxy system 170, or computing device 110. The search module 182 may conduct an internet search or a local device search based on the search query to identify information relevant to the search query. After performing the search, the search module 182 may output information (e.g., search results) returned from the search to one or more of the assistant server system 160, the 3P proxy server system 170, or the computing device 110.

One or more components of system 100, such as local assistant module 122A and/or remote assistant module 122B, may maintain agent index 124A and/or agent index 124B (collectively, "agent index 124") to store information related to agents available to a person, such as a user of computing device 110. In some examples, the agent index 124 may store the agent description and the capability list in a semi-structured index of agent information for each agent. For example, the agent index 124 may contain a single document with information for each available agent. The documents included in the agent index 124 for a particular agent may be comprised of information provided by the developer of the particular agent. Some example information fields that may be included in a document or that may be used to construct a document include, but are not limited to: a description of the agent, one or more entry points of the agent, a category of the agent, one or more trigger phrases of the agent, a website associated with the agent, an indication of speech used in synthesizing audio data based on text generated by the agent, and/or a list of capabilities of the agent (e.g., a list of tasks or types of tasks that the agent is capable of performing). In some examples, one or more information fields may be written in free-form natural language. In some examples, one or more information fields may be selected from a predefined list. For example, the category field may be selected from a predefined set of categories (e.g., games, productivity, communications). In some examples, the entry point of the agent may be a type of device used to interface with the agent (e.g., a cellular phone). In some examples, the entry point of the agent may be a resource address or other parameter of the agent.

In some examples, the agent index 124 may store information related to the use and/or execution of available agents. For example, the proxy index 124 may include a proxy quality score for each available proxy. In some examples, the proxy quality score may be determined based on one or more of: whether a particular agent is selected more frequently than competing agents, whether the developer of the agent has produced other high quality agents, whether the developer of the agent has a good (or bad) mail score on other user attributes, and whether the user typically gives up agents in the course of execution. In some examples, the proxy quality score may be represented as a value between 0 and 1 (including 0 and 1).

The agent index 124 may provide a mapping between trigger phrases and agents. As described above, a developer of a particular agent may provide one or more trigger phrases associated with the particular agent. In some examples, to improve the quality of agent selection, the local assistant module 122A may extend the provided trigger phrase. For example, the local assistant module 122A may extend the trigger phrase by extending the structure of the trigger phrase and synonyms for key concepts of the trigger phrase. With respect to the structural extensions, the local assistant module 122A can insert terms in the natural utterance (e.g., "please," "could you," etc.) that are typically used for the user between the compounds in the trigger phase and then replace the phrase compounds.

In some cases, the concept of the ability to trigger phrases can be expressed as verbs and nouns. Thus, in some examples, local assistant module 122A may examine query logs of a web search, tag verbs and nouns of each query (e.g., using a natural language framework), and build verb clusters based on the tagged verbs and nouns. In each cluster, all verbs can be considered to have similar meaning in the context of the same noun. Thus, using the verb cluster model, the local assistant module 122A may extend synonyms of verbs in trigger phrases associated with agents and store the results in the agent index 124 (i.e., as substitute trigger phrases for agents).

In some examples, some trigger phrases may also contain variables that represent related data sets. These data sets may be defined by schema. These triggers and parameter value sets are fed into the training system of the text matching system. The training system may convert the specified pattern into a set of rules represented by valid online query matches. Local assistant module 122A may also maintain a mapping of text matching system rules to applicable agents.

One or more components of the system 100, such as the search module 182, may attach metadata about the agent to any associated web site in the web search index. The metadata may include an ID of the agent and an associated agent entry point.

As the user interacts with the agent, one or more components of the system 100 may record details of the interaction to the user's personal history. As described above, logging can be constrained by one or more user controls so that a user can disable logging of agent interactions. In particular, one or more components of system 100 may record details only after receiving explicit authorization from a user.

In operation, local assistant module 122A may receive an indication of user input provided by a user of computing device 110 from UI module 120. As one example, local assistant module 122A can receive an indication of a speech input that corresponds to an utterance provided by a user of computing device 110. As another example, local assistant module 122A may receive an indication of a text input provided by a user of computing device 110 (e.g., at a physical and/or virtual keyboard). In accordance with one or more techniques of this disclosure, the local assistant module 122 may select an agent from a plurality of agents to satisfy the utterance. For example, the local assistant module 122A may use a 1P proxy (i.e., a 1P proxy provided by the local assistant module 122A), a 3P proxy (i.e., a 3P proxy provided by one of the 3P proxy modules 128), or some combination of 1P proxy and 3P proxy to determine whether the user utterance is satisfied.

Local assistant module 122A can base agent selection on analysis of the utterance. As one example, local assistant module 122A may at least initially select a 1P language agent, where it is not possible to satisfy an utterance using only a 3P language agent. As another example, the local assistant module 122A may select an agent from available agents (e.g., 1P agents and 3P agents) based on the speech recognition task and based on a ranking of the agents and/or the ability of the available agents to perform the task. As another example, local assistant module 122A can determine (e.g., based on data included in agent index 124A) whether the speech input includes one or more predetermined trigger phrases associated with a 1P agent or one or more predetermined trigger phrases associated with a 3P agent.

As described above, the local assistant module 122A can base the agent selection on whether the voice input includes one or more predetermined trigger phrases. For example, if the speech input includes one or more predetermined trigger phrases associated with a 1P agent, the local assistant module 122A can select one or more 1P agents to satisfy the utterance. In some examples, one or more 1P agents are selected, and the resulting engagement may be referred to as a 1P experience.

However, if the speech input includes one or more predetermined trigger phrases associated with the 3P agents, the local assistant module 122A can select one or more of the 3P agents to satisfy the utterance. For example, local assistant module 122A may select one of the 3P agents associated with the trigger phrase included in the voice input. To perform 3P agent selection, the local assistant module 122A may rank one or more 3P agents based on the utterance. In some examples, local assistant module 122A may rank all known 3P agents. In some examples, local assistant module 122A may rank a subset of all known 3P agents. For example, local assistant module 122A may rank 3P agents that are preconfigured for use by a user of computing device 110.

As described above, local assistant module 122A may select a 3P agent based on the ranking. For example, the local assistant module 122A may select the 3P agent with the highest ranking to satisfy the utterance. In some examples, the local assistant module 122A may request user input to select a 3P language agent to satisfy the utterance, such as when there is a tie in the ranking and/or if the ranking of the 3P agent with the highest ranking is less than a ranking threshold. For example, local assistant module 122A may cause UI module 120 to output a user interface requesting the user to select a 3P agent from top-ranked N (e.g., 2, 3, 4, 5, etc.) ranked 3P agents to satisfy the utterance.

The selected agent (1P agent, 3P agent, or some combination of 1P language agent and 3P agent) may attempt to satisfy the utterance. For example, the selected agent may perform one or more actions (e.g., output useful information based on the utterance, respond to user needs indicated by the utterance, or otherwise perform certain operations to help the user complete various real-world or virtual tasks based on the utterance) to satisfy the utterance.

As described above, in some examples, the agents represented by module 122 may not all be 1P agents. For example, the agent represented by local assistant module 122A may be a 1P agent, while the agent represented by remote assistant module 122B may be a 3P agent. In some such examples, the local assistant module 122A may utilize the 3P remote assistant module 122B to perform some (or all) of the 3P agent selection, identification, ranking, and/or invocation of other 3P agents. In some such examples, the local assistant module 122A may not be able to utilize the 3P remote assistant module 122B to perform some (or all) of the other 3P agents that the 3P agent selects, identifies, ranks, and/or invokes and may perform such tasks locally.

It should be appreciated that improved operation of one or more of the computing device 110, the assistant server system 160, and the 3P proxy server system 170 is obtained in accordance with the above description. As one example, by identifying a preferred agent for performing user-provided tasks, general searches and complex query rewrites may be reduced. This in turn reduces bandwidth and data transfer usage, reduces temporary volatile memory usage, reduces battery consumption, etc. Further, in some embodiments, optimizing device performance and/or minimizing cellular data usage may be a highly weighted feature for ranking agents such that selecting an agent based on these criteria may provide a direct improvement in device performance and/or reduce data usage as needed. As another example, by providing a single assistant/agent (e.g., a 1P assistant) to initially process utterances (e.g., recognize tasks and select agents for performing tasks), the computational load may be reduced. For example, rather than having multiple agents monitor, process, and satisfy incoming utterances, which would consume significant system resources (e.g., CPU cycles, power consumption, etc.), the techniques of this disclosure enable a single assistant to initially process the utterances and invoke the 3P agents as needed. In this way, the techniques of this disclosure achieve the benefits of making multiple agents available to satisfy a utterance without the technical drawbacks of involving multiple agents in each step of utterance processing.

Fig. 2 is a block diagram illustrating an example computing device configured to execute an example virtual assistant in accordance with one or more aspects of the present disclosure. Computing device 210 of fig. 2 is described below as an example of computing device 110 of fig. 1. Fig. 2 shows only one particular example of computing device 210, and many other examples of computing device 210 may be used in other instances and may include a subset of the components included in example computing device 210, or may include additional components not shown in fig. 2.

As shown in the example of fig. 2, computing device 210 includes a user interface device (USD) 212, one or more processors 240, one or more communication units 242, one or more input components 244, one or more output components 246, and one or more storage components 248. USD 212 includes a display component 202, a presence-sensitive input component 204, a microphone component 206, and a speaker component 208. Storage components 248 of computing device 210 include UI module 220, assistant module 222, search module 282, one or more application modules 226, context module 230, and agent index 224.

Communication channel 250 may interconnect each of

components

212, 240, 242, 244, 246, and 248 for inter-component communication (physically, communicatively, and/or operatively). In some examples, communication channel 250 may include a system bus, a network connection, an interprocess communication data structure, or any other method for communicating data.

One or more communication units 242 of computing device 210 may communicate with external devices (e.g., assistant server system 160 and/or search server system 180 of system 100 of fig. 1) via one or more wired and/or wireless networks by transmitting and/or receiving network signals over one or more networks (e.g., network 130 of system 100 of fig. 1). Examples of communication unit 242 include a network interface card (e.g., such as an ethernet card), an optical transceiver, a radio frequency transceiver, a GPS receiver, or any other type of device that can send and/or receive information. Other examples of the communication unit 242 may include short wave radio, cellular data radio, wireless network radio, and Universal Serial Bus (USB) controller.

One or more input components 244 of computing device 210 may receive input. Examples of inputs are tactile, text, audio, image and video inputs. In one example, input component 242 of computing device 210 includes a presence-sensitive input device (e.g., touch-sensitive screen, PSD), mouse, keyboard, voice response system, camera, microphone, or any other type of device for detecting input from a human or machine. In some examples, input components 242 may include one or more sensor components, one or more location sensors (GPS components, Wi-Fi components, cellular components), one or more temperature sensors, one or more movement sensors (e.g., accelerometers, gyroscopes), one or more pressure sensors (e.g., barometers), one or more ambient light sensors, and one or more other sensors (e.g., infrared proximity sensors, hygrometer sensors, etc.). Other sensors may include heart rate sensors, magnetometers, glucose sensors, olfactory sensors, compass sensors, step counter sensors, to name a few other non-limiting examples.

One or more output components 246 of the computing device 110 may generate output. Examples of outputs are tactile, audio and video outputs. Output component 246 of computing device 210, in one example, includes a presence-sensitive display, a sound card, a video graphics adapter card, a speaker, a Cathode Ray Tube (CRT) monitor, a Liquid Crystal Display (LCD), or any other type of device for generating an output to a human or machine.

UID 212 of computing device 210 may be similar to UID 112 of computing device 110 and include display component 202, presence-sensitive input component 204, microphone component 206, and speaker component 208. Display component 202 can be a screen at which information is displayed via USD 212, while presence-sensitive input component 204 can detect objects at and/or near display component 202. Speaker component 208 may be a speaker from which UID 212 plays audible information, while microphone component 206 may detect audible input provided at and/or near display component 202 and/or speaker component 208.

While shown as an internal component of computing device 210, UID 212 may also represent an external component that shares a data path with computing device 210 for sending and/or receiving inputs and outputs. For example, in one example, UID 212 represents a built-in component of computing device 210 that is located within computing device 210 (e.g., a screen on a mobile phone) and is physically connected to an external package of computing device 210. In another example, UID 212 represents an external component of computing device 210 that is external and physically separate from a packaging or housing of computing device 210 (e.g., a monitor, projector, etc., that shares a wired and/or wireless data path with computing device 210).

As one example range, presence-sensitive input component 204 may detect an object, such as a finger or stylus within two inches or less of display component 202. Presence-sensitive input component 204 may determine the location (e.g., [ x, y ] coordinates) of display component 202 where the object was detected. In another example range, presence-sensitive input component 204 may detect objects six inches or less from display component 202, and other ranges are possible. The presence-sensitive input component 204 may determine the position of the display component 202 selected by the user's finger using capacitive, inductive, and/or optical recognition techniques. In some examples, presence-sensitive input component 204 also provides output to the user using tactile, audio, or video stimuli as described with respect to display component 202. In the example of FIG. 2, PSD 212 may present a user interface (such as graphical user interface 114 of FIG. 1).

The speaker component 208 may include speakers built into a housing of the computing device 210, and in some examples, may be speakers built into a set of wired or wireless headphones operatively coupled to the computing device 210. Microphone component 206 may detect audible input occurring at or near UID 212. The microphone component 206 may perform various noise cancellation techniques to remove background noise from the detected audio signal and isolate the user speech from the detected audio signal.

UID 212 of computing device 210 may detect two-dimensional and/or three-dimensional gestures as input from a user of computing device 210. For example, a sensor of UID 212 may detect movement of the user (e.g., moving a hand, arm, pen, stylus, etc.) within a threshold distance of the sensor of UID 212. UID 212 may determine a two-dimensional or three-dimensional vector representation of the motion and associate the vector representation with a gesture input having multiple dimensions (e.g., waving, pinching, clapping, stroke, etc.). In other words, UID 212 may detect multi-dimensional gestures without requiring the user to make gestures at or near the screen or surface on which UID 212 outputs information for display. Conversely, UID 212 may detect multi-dimensional gestures performed at or near a sensor that may or may not be located near a screen or surface on which UID 212 outputs information for display.

The one or more processors 240 may implement functions and/or execute instructions associated with the computing device 210. Examples of processor 240 include an application processor, a display controller, a secondary processor, one or more sensor hubs, and any other hardware configured to function as a processor, processing unit, or processing device.

Modules

220, 222, 226, 230, and 282 may be operated by processor 240 to perform various actions, operations, or functions of computing device 210. For example, processor 240 of computing device 210 may retrieve and execute instructions stored by storage component 248 that cause processor 240 to execute

operational modules

220, 222, 226, 230, and 282. The instructions, when executed by processor 240, may cause computing device 210 to store information within storage component 248.

One or more storage components 248 within computing device 210 may store information for processing during operation of computing device 210 (e.g., computing device 210 may store data accessed by

modules

220, 222, 226, 230, and 282 during execution at computing device 210). In some examples, storage component 248 is a temporary memory, meaning that the primary purpose of storage component 248 is not long-term storage. The storage component 248 on the computing device 210 may be configured to store information as volatile memory for short periods of time, and therefore not retain stored content if power is removed. Examples of volatile memory include Random Access Memory (RAM), Dynamic Random Access Memory (DRAM), Static Random Access Memory (SRAM), and other forms of volatile memory known in the art.

In some examples, storage component 248 also includes one or more computer-readable storage media. In some examples, storage component 248 includes one or more non-transitory computer-readable storage media. Storage component 248 may be configured to store a larger amount of information than is typically stored by a volatile memory. Storage component 248 may also be configured to store information for long periods as non-volatile storage space and retain information after power on/off cycling. Examples of non-volatile memory include magnetic hard disks, optical disks, floppy disks, flash memory, or forms of electrically programmable memory (EPROM) or Electrically Erasable and Programmable (EEPROM) memory. Storage component 248 may store program instructions and/or information (e.g., data) associated with

modules

220, 222, 226, 230, and 282 and data store 224. Storage component 248 may include a memory configured to store data or other information associated with

modules

220, 222, 226, 230, and 282 and data store 224.

UI module 220 may include all of the functionality of UI module 120 of computing device 110 of fig. 1, and may perform operations similar to UI module 120 for managing a user interface (e.g., user interface 114) provided by computing device 210, e.g., at USD 212, for facilitating interaction between a user of computing device 110 and assistant module 222. For example, UI module 220 of computing device 210 may receive information from assistant module 222 that includes instructions for outputting (e.g., displaying or playing audio) an assistant user interface (e.g., user interface 114). UI module 220 may receive information from assistant module 222 over communication channel 250 and use the data to generate a user interface. UI module 220 may send display or audible output commands and associated data over communication channel 250 to cause UID 212 to present a user interface at UID 212.

In some examples, UI module 220 may receive an indication of one or more user inputs detected at UID 212 and may output information regarding the user inputs to assistant module 222. For example, UID 212 may detect voice input from a user and send data regarding the voice input to UI module 220.

UI module 220 may send an indication of the voice input to assistant module 222 for further explanation. Assistant module 222 may determine, based on the speech input, that the detected speech input represents a user request to perform one or more tasks for assistant module 222.

UI module 220 may be capable of performing text-to-speech (TTS). For example, when text is provided (e.g., by an assistant or agent), UI module 220 may synthesize the audio data to speak the text (e.g., read aloud text). UI module 220 may be capable of performing TTS using a plurality of different voices.

Application module 226 represents all of the various individual applications and services executing at computing device 210 and accessible from computing device 210, which may be accessed by an assistant, such as assistant module 222, to provide information to a user and/or to perform tasks. A user of computing device 210 may interact with a user interface associated with one or more application modules 226 to cause computing device 210 to perform functions. There may be many examples of application modules 226 including a fitness application, a calendar application, a search application, a mapping or navigation application, a transportation service application (e.g., a bus or train tracking application), a social media application, a gaming application, an email application, a chat or messaging application, an internet browser application, or any and all other applications that may be executed at computing device 210.

Search module 282 of computing device 210 may perform integrated search functions on behalf of computing device 210. Search module 282 may be invoked by one or more of UI module 220, application modules 226, and/or assistant module 222 to perform search operations on their behalf. When invoked, the search module 282 may perform search functions, such as generating search queries and performing searches across various local and remote information sources based on the generated search queries. The search module 282 may provide the results of the performed search to the calling component or module. That is, search module 282 may output search results to UI module 220, assistant module 222, and/or application module 226 in response to an invocation command.

Context module 230 may collect context information associated with computing device 210 to define a context of computing device 210. In particular, context module 210 is used primarily by assistant module 222 to define a context for computing device 210 that specifies characteristics of a physical and/or virtual environment of computing device 210 and a user of computing device 210 at a particular time.

As used throughout this disclosure, the term "context information" is used to describe any information that may be used by context module 230 to define virtual and/or physical environmental characteristics that a computing device and a user of the computing device may experience at a particular time. Examples of contextual information are numerous and may include: sensor information obtained by sensors of computing device 210 (e.g., location sensors, accelerometers, gyroscopes, barometers, ambient light sensors, proximity sensors, microphones, and any other sensors), communication information sent and received by a communication module of computing device 210 (e.g., text-based communications, audible communications, video communications, etc. sent and received by the communication module of computing device 210), and application usage information associated with applications executing at computing device 210 (e.g., application data associated with the applications, internet search history, text communications, voice and video communications, calendar information, social media posts and related information, etc.). Other examples of contextual information include signals and information obtained from a sending device external to computing device 210. For example, context module 230 may receive, via a radio or communication unit of computing device 210, beacon information transmitted from an external beacon located at or near the actual location of the merchant.

The assistant module 222 can include all of the functionality of the local assistant module 122A of the computing device 110 of fig. 1 and can perform similar operations as the local assistant module 122A for providing assistants. In some examples, the assistant module 222 may execute locally (e.g., at the processor 240) to provide assistant functionality. In some examples, assistant module 222 may act as an interface to a remote assistant service accessible to computing device 210. For example, assistant module 222 may be an interface or Application Programming Interface (API) to remote assistant module 122B of assistant server system 160 of fig. 1.

The agent selection module 227 may include functionality to select one or more agents to satisfy a given utterance. In some examples, the agent selection module 227 may be a stand-alone module. In some examples, agent selection module 227 may be included in assistant module 222.

Similar to the agent indexes 124A and 124B of the system 100 of FIG. 1, the agent index 224 may store information related to agents, such as 3P agents. In addition to any information provided by context module 230 and/or search module 282, assistant module 222 and/or agent selection module 227 may rely on information stored at agent index 224 to perform assistant tasks and/or select agents to satisfy utterance satisfaction.

The agent selection module 227 may select one or more agents to satisfy the user utterance. As described above, some utterances (i.e., user requests) cannot be directly communicated to the 3P proxy, such as utterances that require special assistance (e.g., assistance from the publisher of the assistant module 222). The situation where special help is needed to satisfy the utterance may be referred to as a 1P experience because the publisher of the assistant module 222 may implement (at least a portion of, but not always all of) the logic necessary to satisfy the utterance. Two examples of utterances that may require special help are home automation and overly broad queries.

An example of a home automation utterance is "set my downstairs thermostat to 71 (set my downstairs thermostat to 71)". The publisher of the assistant module 222 may enable the user to register their home automation device with the assistant module 222, for example. To satisfy the above language, the assistant module 222 can look up configuration information for the user's home automation device and then send an appropriate request to the automation system (e.g., downstairs thermostat) based on the details. The agent selection module 227 may select a 1P agent to satisfy the utterance because special assistance to knowledge and/or access of the user's home automation devices is required to satisfy the utterance. In other words, the process may be implemented as a 1P experience.

An example of an overly broad utterance is "I am bored". There are many possible ways to satisfy such utterances, but sufficient satisfaction depends on the user's preferences at the time. In this way, the agent selection module 227 may select a 1P experience. In such a 1P experience, the assistant module 222 may ask a series of questions to the user to determine what they want to do. For example, the assistant module 222 may say "Do you heel like a movie or a game? (do you want a movie or a game)? (good, are you like a strategy or a fantasy game?

For these and other similar cases, the publisher of the assistant module 222 can provide a collection of internally-built 1P solutions. In some examples, the 1P solution may be referred to as a 1P proxy. The 1P agent can be associated with (i.e., identified by) a set of 1P trigger phrases identified by the publisher of the assistant module 222.

In general, the 1P experience can follow two basic models. In the first model, the publisher of the assistant module 222 can process the complete experience. One such example is the utterance "Ok assistant, where are you made? (good assistant, where do you do.

The second model is where the publisher of the assistant module 222 implements some dialog to determine the exact parameters of the task and then transfers control to the 3P broker or API. To continue with the "I am bound" example above, if the dialog box finds that the user wants to play a game of strategy, the agent selection module 227 may invoke an agent that implements such a game. As another example, if the utterance is "book a taxi", the agent selection module 227 may initially select the assistant module 222 (i.e., initially trigger the 1P experience) to ask the user for the necessary information, such as pickup and departure locations, times, and taxi classes. In some examples, agent selection module 227 may select a 3P agent capable of booking taxis, and assistant module 222 may pass the collected information to the selected 3P agent. In some examples, assistant module 222 may pass the collected information directly to an external API of an entity capable of booking taxis. In any of these approaches, the agent selection module 227 may proxy a reference to the 3P agent.

In operation, a user of computing device 210 may provide an utterance at UID 212, and UID 212 may generate audio data based on the utterance. Some example utterances include, but are not limited to, "1 need a ride to the airport", "tell me a joke", and "show me the repeat for be bee wellton." (tell me formula for hulington beef) ". In some cases, the utterance includes an identification of a 3P help for which the user wishes to perform an action, such as "order small cheese pizza using the GoPizza app". In many other cases, the user does not explicitly mention the 3P agents, in which case it is necessary to identify candidate 3P agents and select a preferred 3P agent from those candidates.

The agent selection module 227 can select one or more agents to satisfy the utterance. For example, the agent selection module 227 may determine whether the utterance includes any trigger phrases associated with the 1P agent or the 3P agent. If the utterance matches the 1P trigger phrase, the agent selection module 227 may perform a 1P experience. For example, agent selection module 227 can cause assistant module 222 to satisfy the utterance. If the phrase matches the 3P experience, the agent selection module 227 may then send an agent search request to the service engine. The proxy search request may contain a user utterance (i.e., audio data corresponding to the utterance), any matching trigger phrases, and a user context (e.g., a unique identifier of the user, a location of the user, etc.) determined by context module 230. For simplicity, the service engine may be a component of the agent selection module 227, and in some examples, the actions performed by the agent selection module 227 may be performed by a separate service engine.

The agent selection module 227 can consult the agent index 224 based on the utterance and any matching trigger phrases. The agent selection module 227 may identify agent documents in the agent index 224 that match the utterance or trigger phrase. The agent selection module 227 may rank the identified agent documents (e.g., based on the ability level to satisfy the utterance). For example, the agent selection module 227 may multiply the text match score by the agent quality score. As described above, the proxy quality scores may be stored in the proxy index 224. The text matching score may be a weighted sum of matches between text in the utterance and text in the proxy document. In some examples, the agent selection module 227 may internally present a title field, a trigger phrase, or a high weight category. In some examples, the agent selection module 227 may give lower weight to matches in the description.

In some examples, the agent selection module 227 may also send the utterance over a normal web search (i.e., cause the search module 282 to search the web based on the utterance). In some examples, the agent selection module 227 may also send utterances through a normal web search in parallel with the consultation of the agent index 224.

The agent selection module 227 may analyze the rankings and/or results from the web search to select an agent to satisfy the utterance. For example, the agent selection module 227 may examine the web results to determine if there are web page results associated with the agent. If there are web page results associated with an agent, the agent selection module 227 may insert the agent associated with the web page result into the ranked results (if the agent is not already included in the ranked results). The agent selection module 227 may boost the ranking of the agents according to the strength of the network score. In some examples, agent selection module 227 may then also query the personal history store to determine whether the user interacted with any agent in the result set. If so, the agent selection module 227 may give the agents an boost (i.e., an increased ranking) depending on how often the user has a history with them.

The agent selection module 227 can select a 3P agent to satisfy the utterance based on the final ranked agent result set. For example, the agent selection module 227 may select the 3P agent with the highest ranking to satisfy the utterance. In some examples, the agent selection module 227 may request user input to select a 3P language agent to satisfy the utterance, for example, when there is a tie in the ranking and/or if the ranking of the 3P agent with the highest ranking is less than a ranking threshold. For example, agent selection module 227 may cause UI module 220 to output a user interface (i.e., select a UI) requesting the user to select a 3P agent from among N (e.g., 2, 3, 4, 5, etc.) medium-ranked agents to satisfy the utterance. In some examples, the N medium-ranked 3P agents may include the top N ranked agents. In some examples, the N medium-ranked 3P agents may include agents other than the top N-ranked agents.

The agent selection module 227 may examine the attributes of the agents and/or obtain results from various 3P agents, rank those results, and then invoke (i.e., select) the 3P agent that provides the highest ranked result. For example, if the utterance is "order a pizza," the agent selection module 227 may determine the current location of the user, determine which pizza source is closest to the user's current location, and rank the agent associated with the current location highest. Similarly, the agent selection module 227 can poll a plurality of 3P agents according to the price of the item and then provide the agents to allow the user to complete the purchase based on the lowest price. Finally, the agent selection module 227 may first determine that no 1P agent can complete the task, try multiple 3P agents to see if they can, and assume that only one or a few of them can provide only those agents as options for performing the task for the user.

The selected agent (1P agent, 3P agent, or some combination of 1P language agent and 3P agent) may attempt to satisfy the utterance. For example, the agent selection module 227 may output the request to the entry point of the selected agent, which may be determined by the consulting agent index 224. In an attempt to satisfy the utterance, the selected agent may perform one or more actions (e.g., output useful information based on the utterance, respond to user needs indicated by the utterance, or otherwise perform certain operations to help the user complete various real-world or virtual tasks based on the utterance).

In some examples, there may be an indication of the type of agent (1P versus 3P) that is performing the action. For example, where the one or more actions include "speaking" to the user, the 1P proxy and the 3P proxy may use different voices. As one example, the 1P proxy may utilize all of the reserved voices in the plurality of voices, and the 3P proxy may utilize other voices in the plurality of voices, but may prohibit the use of the reserved voices.

In some examples, the agent selection module 227 may cause the assistant module 222 to request feedback from the user regarding the extent to which the agent has just completed its request. For example, the assistant module 222 may say "You just interconnected with the City Transit Schedule agent. in one or two words, how well did it word? (how well you have just interacted with a city bus schedule agent in one or two words. The boost module 222 may determine a score for the experience and feed back the determined score to the ranking. For example, the assistant module 222 may modify the agent quality score for agents that satisfy the request based on the user's feedback regarding fulfillment. In this manner, the techniques of this disclosure enable the agent selection module 227 to select an agent based on how well the agent has performed in the past.

As described above, in some cases, the utterance includes an identification of a 3P assistant that the user wishes to perform an action (e.g., "Order a small cheese pizza using the GoPizza app"). In many other cases, the user does not explicitly mention the 3P agents, in which case it is necessary to identify candidate 3P agents and select a preferred 3P agent from those candidates. A complication of this situation is that there may be multiple agents that may be able to order pizza to a user. In this way, if there are multiple matching agents, the user may be asked to select one of the matching agents that satisfies the utterance. For example, the agent selection module 227 may output a selection UI that requires the user to select between the Pizza House agent and the Pizza4U agent.

Fig. 3 is a block diagram illustrating an example computing system configured to execute an example virtual assistant in accordance with one or more aspects of the present disclosure. The assistant server system 360 of fig. 3 is described below as an example of the assistant server system 160 of fig. 1. Fig. 3 shows only one particular example of the assistant server system 360, and many other examples of the assistant server system 360 may be used in other instances and may include a subset of the components included in the exemplary assistant server system 360, or may include additional components not shown in fig. 3.

As shown in the example of fig. 3, the assistant server system 360 includes the user one or more processors 340, one or more communication units 342, and one or more storage components 348. Storage components 348 include assistant module 322, search module 382, context module 330, and user agent index 324.

Processor 340 is similar to processor 240 of computing system 210 of fig. 2. Communication unit 342 is similar to communication unit 242 of computing system 210 of fig. 2. Storage 348 is similar to storage 248 of computing system 210 of fig. 2. Communication channel 350 is similar to communication channel 250 of computing system 210 of FIG. 2, and thus, each of

components

340, 342, and 348 may be interconnected for inter-component communication. In some examples, communication channel 350 may include a system bus, a network connection, an interprocess communication data structure, or any other method for communicating data.

The search module 382 of the assistant server system 360 is similar to the search module 282 of the computing device 210 and may perform integrated search functions on behalf of the assistant server system 360. That is, the search module 382 may perform search operations on behalf of the assistant module 322. In some examples, search module 382 may interface with an external search system, such as search server system 180, to perform search operations on behalf of assistant module 322. When invoked, the search module 382 may perform search functions, such as generating search queries and performing searches across various local and remote information sources based on the generated search queries. The search module 382 may provide the results of the executed search to the calling component or module. That is, the search module 382 may output the search results to the assistant module 322.

Context module 330 of assistant server system 360 is similar to context module 230 of computing device 210. Context module 330 may collect context information associated with computing devices, such as computing device 110 of fig. 1 and computing device 210 of fig. 2, to define a context of the computing device. Context module 330 may be used primarily by assistant module 322 and/or search module 382 to define a context for computing devices that interface and access services provided by assistant server system 160. A context may specify characteristics of a physical and/or virtual environment of a computing device and a user of the computing device at a particular time.

Assistant module 322 may include all of the functionality of local assistant module 122A and remote assistant module 122B of fig. 1 and assistant module 222 of computing device 210 of fig. 2. The assistant module 322 may perform similar operations as the remote assistant module 122B for providing assistant services accessible through the assistant server system 360. That is, the assistant module 322 can act as an interface to a remote assistant service accessible by computing devices in communication with the assistant server system 360 over a network. For example, assistant module 322 can be an interface or API to remote assistant module 122B of assistant server system 160 of fig. 1.

Similar to the agent index 224 of FIG. 2, the agent index 324 may store information related to agents, such as 3P agents. In addition to any information provided by context module 330 and/or search module 482, assistant module 322 and/or agent selection module 327 may rely on information stored at agent index 324 to perform assistant tasks and/or select agents to satisfy an utterance.

In general, the agent description and trigger phrase may provide only a relatively small amount of information about the agent. The more information available about an agent, a better agent selection module (e.g., agent selection module 224 and/or agent selection module 324) may select an agent for an applicable user utterance. In accordance with one or more techniques of this disclosure, agent accuracy module 331 may collect additional information about the agent. In some examples, the agent accuracy module 331 can be considered an automated agent crawler. For example, the agent accuracy module 331 may query each agent and store the information it receives. As one example, the agent accuracy module 331 may send a request to a default agent entry point and will receive a description of its capabilities from the agent. The proxy accuracy module 331 may store this received information in the proxy index 324 (i.e., to improve the goal).

In some examples, the assistant server system 360 may receive the proxy inventory information where applicable. As one example, an agent for an online grocery store may provide a data feed (e.g., a structured data feed) of its products, including descriptions, prices, quantities, and the like, to the assistant server system 360. The agent selection module (e.g., agent selection module 224 and/or agent selection module 324) may access this data as a part of selecting an agent to satisfy the user utterance. These techniques may enable the system to better respond to queries such as "order a bottle of proceedings". In this case, the agent selection module may more confidently match the utterance with the agent if the agent provides their real-time inventory and the inventory indicates that the agent sells and the inventory holds the proxgram.

In some examples, the assistant server system 360 may provide an agent directory that users may browse to find/find agents they may want to use. The directory may have a description of each agent, a list of capabilities (in natural language; e.g., "you can use the agent to order a taxi)," you can use the agent to find food recipies "). If the user finds an agent in the directory that they want to use, the user may select an agent, and the agent may be available to the user. For example, assistant module 322 can add an agent to agent index 224 and/or agent index 324. As such, the agent selection module 227 and/or the agent selection module 327 may select an agent to add to satisfy future utterances. In some examples, one or more agents may be added to agent index 224 or agent index 324 without user selection. In some such examples, the agent selection module 227 and/or the agent selection module 327 may be capable of selecting and/or suggesting agents not selected by the user to satisfy the user utterance. In some examples, agent selection module 227 and/or agent selection module 327 may further rank agents based on whether an agent is selected by a user.

In some examples, one or more agents listed in the agent catalog may be free (i.e., offered free). In some examples, one or more agents listed in the agent catalog may not be free (i.e., a user may have to pay money or some other consideration to use an agent).

In some examples, the proxy catalog may collect user comments and ratings. The collected user reviews and ratings may be used to modify the agent quality score. As one example, the agent accuracy module 331 may increase the agent quality score of an agent in the agent index 224 or the agent index 324 when the agent receives a positive review and/or rating. As another example, the agent accuracy module 331 may decrease the agent quality score of an agent in the agent index 224 or the agent index 324 when the agent receives negative comments and/or ratings.

Fig. 4 is a block diagram illustrating an example computing system configured to execute an example third party agent in accordance with one or more aspects of the present disclosure. The proxy server system 470 of fig. 4 is described below as an example of a 3P proxy server system of the 3P proxy server system 170 of fig. 1. Fig. 4 shows only one particular example of the 3P proxy server system 470, and many other examples of the 3P proxy server system 470 may be used in other instances and may include a subset of the components included in the example 3P proxy server system 470, or may include additional components not shown in fig. 4.

As shown in the example of fig. 4, the 3P proxy server system 470 includes one or more processors 440 of a user, one or more communication units 442, and one or more storage components 448. The storage component 348 includes a 3P proxy module 428.

The processor 440 is similar to the processor 340 of the assistant server system 360 of fig. 3. The communication unit 442 is similar to the communication unit 342 of the assistant server system 360 of fig. 3. The storage 448 is similar to the storage 348 of the assistant server system 360 of fig. 3. The communication channel 450 is similar to the communication channel 350 of the assistant server system 360 of fig. 3, and thus, each of the

components

440, 442, and 448 can be interconnected for inter-component communication. In some examples, communication channel 450 may include a system bus, a network connection, an interprocess communication data structure, or any other method for communicating data.

The 3P proxy module 428 may include all of the functionality of the local 3P proxy module 128 and the corresponding remote 3P proxy module of the remote 3P proxy module 128 of fig. 1. The 3P proxy module 428 may perform similar operations as the remote 3P proxy module of the remote 3P proxy module 128 for providing third party proxies accessible via the 3P proxy server system 470. That is, the 3P proxy module 428 may act as an interface to a remote assistant service accessible by computing devices in communication with the 3P proxy server system 470 over a network. For example, the 3P proxy module 428 may be an interface or API to a remote 3P proxy module of the remote 3P proxy module 128 of the 3P proxy system 170 of fig. 1.

In operation, the 3P proxy module 428 may be invoked by a user's computing assistant. For example, the 3P proxy module 428 may be invoked by an assistant provided by the assistant module 122 of fig. 1 to perform one or more actions to satisfy a user utterance received at the computing device 110. After performing at least some actions (e.g., performing one or more elements of a multi-element task), 3P agent module 428 may send an indication of the performed operation to the calling assistant. For example, if invoked to process a subscription, the 3P proxy module 428 may output one or more details of the subscription to the assistant.

FIG. 5 is a flow diagram illustrating example operations performed by one or more processors executing an example virtual assistant in accordance with one or more aspects of the present invention. Fig. 5 is described below in the context of the system 100 of fig. 1. For example, in accordance with one or more aspects of the present disclosure, the local assistant module 122A may perform

operation

502 and 506 when executed at one or more processors of the computing device 110. And in some examples, the remote assistant module 122B may perform operations 500-506 when executed at one or more processors of the assistant server system 160 in accordance with one or more aspects of the present disclosure. For purposes of illustration only, fig. 5 is described below in the context of computing device 110 of fig. 1.

In operation, the computing device 110 may receive an indication of a user input indicating a dialog between a user of the computing device and the assistant (502). For example, the user of computing device 110 may provide the utterance "I connected a ride to the airport" at UID 112 that is received by local assistant module 122A as voice data.

Computing device 110 may select an agent from a plurality of agents based on user input (504). For example, the local assistant module 122A can determine whether the utterance includes one or more trigger words associated with an agent of the plurality of agents. If the computing device 110 is capable of identifying one or more agents associated with a trigger word included in an utterance, the computing device 110 may rank the identified agents based at least in part on a comparison between information related to the identified agents and text determined from the utterance. The computing device 110 may select an agent of the identified agents to satisfy the utterance based at least in part on the ranking.

In some examples, the plurality of agents may be one or more first party agents and a plurality of third party agents. In some such examples, the computing device 110 may determine to select a third-party agent (i.e., determine to introduce a third-party agent) when the user input does not include any trigger words associated with the first-party agent.

The selected agent may determine one or more actions responsive to the user input. In some examples, computing device 110 may perform, at least in part, one or more actions determined by the selected agent (506). For example, where the utterance is "play a song by Artist a," the selected agent may cause the computing device 110 to play the song by Artist a. In some examples, the one or more actions determined by the selected agent may be performed, at least in part, by a computing device other than computing device 110. For example, where the utterance is "I connected a ride to the airport," the selected agent may output a request to a computing device of a carrier, and the computing device of the carrier may route the vehicle to transport the user of the computing device 110 to an airport.

In some examples, an agent may refer to another agent in the course of interacting with a user. For example, a product search agent may refer to a payment agent to arrange for payment from a user (e.g., when the product search agent is unable to process payment by itself). This may be done for the convenience of the user (e.g., so the users may use a common payment interface and/or enhance security), or at a fee or other cost paid by the party being recommended (e.g., the publisher of the payment broker may receive some consideration to process the payment).

6A-6B are flow diagrams illustrating example operations performed by one or more processors to select a virtual agent to perform a task according to one or more aspects of the present disclosure. Fig. 6A-6C are described below in the context of the system 100 of fig. 1. For example, in accordance with one or more aspects of the present disclosure, the local assistant module 122A, when executed at one or more processors of the computing device 110, may perform one or more of

operations

602 and 628. And in some examples, the remote assistant module 122B may perform one or more of operations 602-628 when executed at one or more processors of the assistant server system 160 in accordance with one or more aspects of the present disclosure. For purposes of illustration only, fig. 6A-6C are described below in the context of computing device 110 of fig. 1.

In operation, the computing device 110 may receive a representation of an utterance spoken at the computing device 110 (602). For example, one or more microphones of UID 112 of computing device 110 may generate audio data representing a user of computing device 110 saying "turn on my basement lights". UID 112 may provide the audio data to assistants provided by local assistant module 122A and/or remote assistant module 122B of assistant server system 160.

The assistant can identify a task to perform based on the utterance (604). As one example, where the utterance is "turn on my base lights," the assistant may parse the audio data to determine that the task is to activate lights in a room called a basement. As another example, when the utterance is "order me a Pizza from Pizza Joint delivered home (order me a Pizza that goes from Pizza Joint to home), the assistant may parse the audio data to determine that the task is the next order for a Pizza for distribution from the place where the Pizza Joint is called to the user's home address. As another example, where the utterance is "ask Search Company what is the average airspeed of crow," the assistant may parse the audio data to determine that the task is a web Search of the average airspeed of crow.

The assistant can determine whether the utterance includes any trigger (e.g., a word or phrase) associated with a first party agent of the plurality of agents (606). For example, the assistant may compare words in the utterance to first party trigger phrases included in the agent index 124A. If the utterance includes any trigger words associated with the first party agent ("yes" branch of 606), the assistant may select the first party agent to perform the task (608), and cause the selected first party agent to perform the task (610). For example, when the utterance is "ask what the average airspeed of the crow is for the search company" and the agent index 124A indicates that "search company" is the trigger associated with the first-party search agent, the assistant may select and cause the first-party search agent to perform a web search for the average airspeed of the crow.

If the utterance does not include any trigger words associated with the first party agent ("NO" branch of 606), the assistant may determine whether the utterance includes any trigger words associated with a third party agent of the plurality of agents (612). For example, the assistant may compare words in the utterance to third party trigger phrases included in the agent index 124A. If the utterance includes any trigger associated with a particular third-party agent ("yes" branch of 612), the assistant may select the particular third-party agent to perform the task (608), and cause the particular third-party agent to perform the task (610). For example, where the utterance is "order me a Pizza from Pizza Joint delivered home" and the agent index 124A indicates that "order" and "Pizza Joint" are triggers associated with a particular third party subscription agent, the assistant may select and cause the particular third party subscription agent to create an order for Pizza for delivery to the user's residence.

The assistant may rank the agents (e.g., based on their ability to perform tasks). For example, if the utterance does not include any trigger associated with the third party agent ("no" branch of 612), the assistant may determine the capability levels of the first party agent (616) and the third party agent (618) to perform the identified task. As one example, to calculate the competency level of the first-party agent, the assistant may calculate a metric indicating the ability of the assistant to perform the identified task. As another example, the assistant may calculate respective metrics for the respective third-party agents that indicate the ability of the respective third-party agents to perform the identified task. For example, the assistant may compute: a metric of the first 3P agent, the metric indicating an ability of the first 3P agent to perform the identified task; and a metric of the second 3P agent indicating a capability of the second 3P agent to perform the identified task. In some examples, the metric may have a positive correlation with the capability, such that a higher value indicates a more capable execution. In some examples, the metric may have a negative correlation with capacity, such that a lower value indicates more capable execution. The metrics may be calculated in various ways. As one example, the metrics may be calculated based on the proxy quality scores (either modified or unmodified based on web searches) or other information stored in the proxy index 124 as described above.

The assistant may select an agent based on the ranking. For example, the assistant may determine whether the competency level of the first-party agent satisfies a threshold competency level (620). For example, if the metric is correlating to competency, the assistant may determine whether the competency level of the first party agent is greater than or equal to a threshold competency level. If the capability level of the first-party agent satisfies the threshold capability level ("yes" branch of 620), the assistant may select the first-party agent to perform the task (608), and cause the selected first-party agent to perform the task (610).

If the first party agent's competency level does not meet the threshold competency level ("NO" branch of 620), the assistant may determine that the third party agent with the greatest competency level (hereinafter "the particular third party agent") meets the threshold competency level (622). If the capability level of the particular third-party agent meets the threshold capability level ("yes" branch of 622), the assistant may select the particular third-party agent to perform the task (608), and cause the particular third-party agent to perform the task (610).

As indicated above, in some examples, the assistant may select an agent with a bias for the first party agent. For example, by evaluating the first-party agent before evaluating the third-party agent, the assistant may select the first-party agent to perform the task as long as the capability level of the first-party agent meets the threshold capability level (even if the third-party agent has a higher capability level than the first-party agent). In other examples, the assistant may select an agent without biasing the first party agent. For example, if the agent with the greatest level of capability satisfies the threshold level of capability, the assistant may select the agent to perform the task regardless of whether the agent is the first party or a third party.

If the capability level of the particular third party agent does not satisfy the threshold capability level (the "No" branch of 622), the assistant may determine the capability level of the unconfigured third party agent (624) and determine whether the unconfigured third party agent having the greatest capability level (hereinafter referred to as the "particular unconfigured third party agent") satisfies the threshold capability level (626). If the competency level of a particular unconfigured third party agent satisfies the threshold competency level ("yes" branch of 626), the assistant may provide for configuring the particular unconfigured third party agent. For example, the assistant may output synthesized voice data to ask the user if they want to configure a particular unconfigured third party agent. If the user indicates that they want to configure a particular unconfigured third party agent (convert the particular unconfigured third party agent to a particular third party agent), the assistant may select the particular third party agent to perform the task (608) and cause the particular third party agent to perform the task (610).

As described above, some 3P agents may need to be configured (e.g., enabled or activated) before being used by the assistant. In general, it may be desirable for the assistant to select a preconfigured agent to perform the task. However, if no other agent is capable, it may be desirable for the assistant to evaluate an unconfigured agent to perform a task. For example, if the first party agent and any configured third party agents are unable to perform the identified task, the assistant may evaluate the unconfigured agents to perform the identified task.

If the capability level of the particular unconfigured third party agent does not meet the threshold capability level ("NO" branch of 626), the assistant may output an indication that the utterance cannot be met (628). For example, the assistant may output synthesized speech data to indicate that the assistant is "unsure how to help that point".

FIG. 7 is a flowchart illustrating example operations performed by one or more processors to facilitate task execution of a plurality of virtual agents in accordance with one or more aspects of the present invention. Fig. 7 is described below in the context of the system 100 of fig. 1. For example, in accordance with one or more aspects of the present disclosure, the local assistant module 122A, when executed at one or more processors of the computing device 110, may perform one or more of

operations

702 and 710. And in some examples, according to one or more aspects of the present disclosure, the remote assistant module 122B may perform one or more of

operations

702 and 710 when executed at one or more processors of the assistant server system 160. For purposes of illustration only, FIG. 7 is described below in the context of computing device 110 of FIG. 1.

Some tasks that may be performed by an assistant and/or agent may be considered multi-element tasks. A multi-element task may be a task having elements that may be executed by different agents in order to complete the entire task. While elements of a multi-element task may be executed by multiple agents (e.g., a first agent may execute a first element of a two-element task and a second agent may execute a second element), a single agent may still be able to execute all elements. In some examples, the selection of another agent for performing a subset of the elements of the multi-element task may be considered an element of the multi-element task.

In operation, computing device 110 may receive a representation of an utterance spoken at computing device 110 (702). For example, one or more microphones of UID 112 of computing device 110 may generate audio data representing a user of computing device 110 saying "get me large cheese pizza delayed home (send me back to home)". UID 112 may provide the audio data to assistants provided by local assistant module 122A and/or remote assistant module 122B of assistant server system 160.

A first computing agent of the plurality of computing agents may identify a multi-element task to perform based on the utterance (704). For example, where the utterance is "get me a large cheese pizza for delivery," a first computing agent (e.g., an assistant or agent provided by one of the local 3P agent modules 128A of fig. 1) may identify a multi-element task as having the following elements: 1) determining the location of delivery, 2) selecting an agent to order pizza, and 3) processing the order for large cheese pizza.

The first computing agent may perform a first subset of elements of the multi-element task (706), including selecting a second computing agent to perform a second subset of the multi-element task (708). For example, a first computing agent may determine the location of the dispatch and select an agent to order pizza. To determine the location of the distribution, the first computing agent may ask the user where they want to distribute the pizza. For example, the first computing agent may cause the computing device 110 to output the synthesized audio data, asking "where you want to distribute it" where. The first computing agent may receive the user's reply via one or more microphones of the computing device 110. The first computing agent may select a second computing agent to order the pizza to the provided address. For example, a first computing agent may utilize the techniques of fig. 6A-6C to select a second computing agent to order pizza. In this example, assuming that the utterance does not include any trigger words for the agent, the first computing agent may select the second computing agent to order pizzas based on the agent's capability level to schedule the distribution of the pizzas to the address. The first computing agent may communicate with the selected second computing agent to cause the second computing agent to process orders for large cheese pizzas.

The first computing agent may receive an indication of an action performed by the second computing agent (710). For example, a first computing agent may receive confirmation from a second computing agent that a large cheese pizza has been ordered and is expected to be delivered to the provided address within a specified time. In the case where the first computing agent is a first-party agent, the first computing agent may monitor execution of the second computing agent (e.g., modify an agent quality score of the second computing agent) with an indication of an action performed by the second computing agent. Determining that a task is multitasking and splitting each element of the task between different agents allows the most appropriate agent to perform any given step of the task. It may also allow multitasking tasks to be performed in parallel. In addition, user interaction with the computing device 110 is improved. For example, as described above, the user may be guided through the process of ordering pizza.

Fig. 8 is a flow diagram illustrating example operations performed by one or more processors to select a voice for use in outputting synthesized audio data for text generated by a virtual agent in accordance with one or more aspects of the present invention. Fig. 8 is described below in the context of the system 100 of fig. 1. For example, in accordance with one or more aspects of the present disclosure, the local assistant module 122A, when executed at one or more processors of the computing device 110, may perform one or more of

operations

802 and 812. And in some examples, according to one or more aspects of the present disclosure, the remote assistant module 122B may perform one or more of

operations

802 and 812 when executed at one or more processors of the assistant server system 160. For purposes of illustration only, fig. 8 is described below in the context of computing device 110 of fig. 1.

In operation, the computing device 110 can receive a representation of an utterance spoken at the computing device 110 (802). For example, one or more microphones of UID 112 of computing device 110 may generate audio data representing a user of computing device 110 saying "ask Food Agent what I can be used by the Food Agent for baking powder". UID 112 may provide the audio data to assistants provided by local assistant module 122A and/or remote assistant module 122B of assistant server system 160.

The assistant can select an agent from the plurality of agents based on the utterance (804). For example, the assistant may use the techniques of fig. 6A-6C to select an agent to satisfy the utterance. In examples where the utterance is "ask Food Agent what I can be regarded as a Food Agent for serving the order" and "Food Agent" is a third party Agent, the assistant may select a Food Agent to satisfy the utterance.

The selected agent may respond to the utterance by causing the computing device 110 to output synthesized audio data. For example, the selected agent may provide text regarding which computing device 110 may execute text-to-speech (TTS) to generate synthesized audio data. However, it may be desirable for different agents to use different voices, as opposed to synthesized audio data that is generated for all agents using the same voice. Additionally, it may be desirable for a user to be able to discern whether they are interacting with a first-party agent or a third-party agent.

According to one or more techniques of this disclosure, a first party agent may output synthesized audio data using a reserved voice of a plurality of voices, while a third party agent outputs the synthesized audio data using a voice of the plurality of voices that is different from the reserved voice. As such, the techniques of this disclosure enable a first 3P agent to output synthesized audio data using different speech than a second 3P agent, while still providing the user with an indication of when they interacted with the 1P agent (i.e., synthesized audio data using reserved speech). Thus, other information may be encoded in the audio data when output to the user. The other information may relate to an agent with which the user interacts, which may be encoded by speech for output. Examples of these speech selection techniques are illustrated in FIG. 8 and described below

The assistant may determine if the selected agent is a first party agent (806). In examples where the utterance is "ask Food Agent what I can be topic for banking order" and Food Agent is selected to satisfy the utterance, the assistant may determine that the selected Agent is not the first party Agent.

If the selected agent is the first party agent (YES branch of 806), the selected agent (e.g., an assistant or another 1P agent) may select a reserved voice from the plurality of voices (808) and output synthesized audio data using the selected voice (812). For example, in the case where the utterance "set my downstairs thermostat to 71 (set my downstairs thermostat to 71)" and the selected agent is a 1P agent, the 1P agent may cause the computing device 110 to synthesize audio data saying "setting your downstairs thermostat to seven-one details" using the reserved speech output.

If the selected agent is not the first-party agent (NO branch of 806), the selected agent may be an unreserved voice of the plurality of voices (810) and output synthesized audio data using the selected voice (812). For example, where the utterance is "ask Food Agent what I can be recognized by the Food Agent for a synthesized speaker" and the selected Agent is a Food Agent 3P Agent, the Food Agent may cause the computing device 110 to output synthesized audio data using a different one of the plurality of voices from the reserved voice, saying "you can replace a teaspoon with one-quarter teaspoon plus one-fifth eight teaspoon of tartar cream," and the computing device 110 may output synthesized audio data using a different one of the plurality of voices from the reserved voice.

In some examples, the list may be read by outputting the synthesized audio data to satisfy the spoken language. For example, where the task based on speech recognition is a search, satisfaction of the speech may include outputting synthesized audio data to read a search result list. In some examples, a single agent may use all elements of a single voice read list. For example, the first party agent may read the complete list of search results using reserved voice. In some examples, a single agent uses different voices when reading different subsets of elements of a list. For example, the first party agent may use unreserved speech when outputting synthesized audio data representing a first subset of search results and use reserved speech when outputting synthesized audio data representing a second subset of search results. In some examples, multiple agents may use different voices to read different portions of the list. For example, a first agent may use a first voice when outputting synthesized audio data representing a first subset of search results, and a second agent may use a second voice when outputting synthesized audio data representing a second subset of search results. An adaptive interface is thus provided in which the output of data is adjusted based on the data itself.

As described above, the assistant may cause the agent to perform the task (or at least some elements of the task). In some examples, the assistant may cause the selected agent to perform the task by invoking the selected agent. For example, the assistant may send a request to perform a task to a selected agent (e.g., at an entry point of the selected agent that may be identified from the agent index). In some examples, when causing the selected agent to perform a task, the selected agent may perform the task locally. For example, when a 3P proxy provided by the local 3P proxy module of the local 3P proxy module 228 of fig. 2 is invoked to perform a task, the local 3P proxy module of the local 3P proxy module 228 may execute at the processor 240 to perform the task. In some examples, when causing the selected agent to perform a task, the selected agent may perform the task locally. For example, when a 3P proxy provided by the 3P proxy module 428 of fig. 4 is invoked to perform a task, the 3P proxy module 428 may execute at the processor 440 to perform the task. In some examples, when causing the selected agent to perform a task, the selected agent may perform the task mixed between the local and remote environments. For example, a 3P proxy provided by a local 3P proxy module of the local 3P proxy module 228 of fig. 2 and/or a corresponding remote 3P proxy module of the remote 3P proxy module 128 is invoked to perform a task, the proxy may be executed at one or both of the processors of the computing device 110 and a processor of a 3P proxy system that includes the corresponding remote 3P proxy module of the remote 3P proxy module 128.

The following numbered examples may illustrate one or more aspects of the present disclosure:

example 1. a method, comprising: receiving, by a computing assistant executing at one or more processors, a representation of an utterance spoken at a computing device; selecting an agent from a plurality of agents based on the utterance, wherein the plurality of agents includes one or more first-party agents and a plurality of third-party agents; in response to determining that the selected agent comprises a first party agent, selecting a reserved voice from a plurality of voices; and outputting, using the selected speech, synthesized audio data for playback by one or more speakers of the computing device to satisfy the utterance.

Example 2. the method of example 1, wherein the utterance comprises a first utterance, the method further comprising: receiving a representation of a second utterance spoken at the computing device; selecting a second agent from the plurality of agents based on the second utterance; in response to determining that the selected second agent comprises a third party agent, selecting a voice from the plurality of voices that is different from the reserved voice; and outputting synthesized audio data to satisfy the second utterance using the selected speech.

Example 3. the method of any combination of examples 1-2, further comprising: obtaining a plurality of search results based on the utterance; and outputting synthesized audio data representing a first subset of the search results using a different one of the plurality of voices than the reserved voice, wherein outputting the synthesized audio data to satisfy the utterance using the selected voice comprises: outputting synthesized audio data representing a second subset of the search results using the reserved speech.

Example 4 the method of any combination of examples 1-3, wherein the one or more processors are included in the computing device.

Example 5 the method of any combination of examples 1-3, wherein the one or more processors are included in a computing system.

Example 6. a computing device, comprising: at least one processor; and at least one memory including instructions that, when executed, cause the at least one processor to perform an assistant configured to perform the method of any combination of examples 1-3. The computing device may include or be operatively connected to one or more microphones. The one or more microphones may be used to receive a representation of an utterance.

Example 7. a computing system, comprising: one or more communication units; at least one processor; and at least one memory including instructions that, when executed, cause the at least one processor to perform an assistant configured to perform the method of any combination of examples 1-3. The computing system may also include one or more communication units. The computing system may receive, via one or more communication units, a representation of an utterance from a computing device.

Example 8. a computing system comprising means for performing the method of any combination of examples 1-3.

Example 9 a computer-readable storage medium storing instructions that, when executed, cause one or more processors to perform an assistant configured to perform the method of any combination of examples 1-3.

In one or more examples, the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium and executed by a hardware-based processing unit. The computer readable medium may comprise a computer readable storage medium corresponding to a tangible medium such as a data storage medium or a communication medium including any medium that facilitates transfer of a computer program from one place to another, e.g., according to a communication protocol. In this manner, the computer-readable medium may generally correspond to (1) a tangible computer-readable storage medium, which is non-transitory, or (2) a communication medium, such as a signal or carrier wave. A data storage medium may be any available medium that can be accessed by one or more computers or one or more processors to retrieve instructions, code and/or data structures for implementing the techniques described in this disclosure. The computer program product may include a computer-readable medium.

By way of example, and not limitation, such computer-readable storage media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, flash memory, or any other medium that can be used to store desired program code in the form of instructions or data structures and that can be accessed by a computer. And any connection is properly termed a computer-readable medium. For example, if instructions are transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. However, it should be understood that computer-readable storage media and data storage media do not include connections, carrier waves, signals, or other transitory media, but are instead directed to non-transitory tangible storage media. Disk and disc, as used herein, includes Compact Disc (CD), laser disc, optical disc, Digital Versatile Disc (DVD), floppy disk and blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.

The instructions may be executed by one or more processors, such as one or more Digital Signal Processors (DSPs), general purpose microprocessors, Application Specific Integrated Circuits (ASICs), field programmable logic arrays (FPGAs), or other equivalent integrated or discrete logic circuitry. Accordingly, the term "processor" as used herein may refer to any of the foregoing structure or any other structure suitable for implementation of the techniques described herein. In addition, in some aspects, the functions described herein may be provided within dedicated hardware and/or software modules. Furthermore, these techniques may be fully implemented in one or more circuits or logic elements.

The techniques of this disclosure may be implemented in various devices or apparatuses, including a wireless handset, an Integrated Circuit (IC), or a set of ICs (e.g., a chipset). Various components, modules, or units are described in this disclosure to emphasize functional aspects of devices configured to perform the disclosed techniques, but do not necessarily require implementation by different hardware units. Rather, as noted above, the various units may be combined in hardware units, or provided by a collection of interoperative hardware units including interoperative hardware units of one or more processors as described above in conjunction with suitable software and/or firmware.

Various examples have been described. These and other examples are within the scope of the following claims.

Claims

1. A method, comprising:

receiving, by a computing assistant executing at one or more processors, a representation of an utterance spoken at a computing device;

identifying a task to be performed based on the utterance;

determining a level of capability of a first party agent to perform the task;

determining a capability level of a respective third-party agent of a plurality of third-party agents to perform the task;

in response to determining that the capability level of the first-party agent does not satisfy a threshold capability level, the capability level of a particular third-party agent of the plurality of third-party agents being a maximum of the determined capability levels and the capability level of the particular third-party agent satisfying the threshold capability level, selecting the particular third-party agent to perform the task; and

causing the selected agent to perform the task.

2. The method of claim 1, wherein the plurality of third party agents comprises a plurality of preconfigured third party agents, the method further comprising:

determining a capability level of a respective third-party agent of a plurality of unconfigured third-party agents to perform the task; and

in response to determining that the capability level of the particular one of the plurality of preconfigured third-party agents does not satisfy a threshold capability level, the capability level of the particular one of the plurality of unconfigured third-party agents being the greatest of the determined capability levels of the plurality of unconfigured third-party agents and the capability level of the particular one of the plurality of unconfigured third-party agents satisfying the threshold capability level, selecting the particular one of the plurality of unconfigured third-party agents to perform the task.

3. The method of claim 1, further comprising:

in response to determining that the utterance includes one or more trigger words associated with the first party agent, selecting the first party agent to perform the task; and

in response to determining that the utterance includes one or more trigger words associated with a third-party agent, selecting the third-party agent to perform the task.

4. The method of claim 3, wherein selecting a third-party agent in response to determining that the utterance includes one or more trigger words associated with the third-party agent further comprises:

identifying one or more of the plurality of third-party agents associated with a trigger included in the utterance;

ranking the identified third-party agents based at least in part on a comparison between information related to the identified third-party agents and text determined from the utterance; and

based at least in part on the ranking, selecting an agent of the identified third-party agents to satisfy the utterance.

5. The method of claim 4, wherein the ranking is further based on a predetermined score for each of the identified third-party agents.

6. The method of claim 1, wherein the one or more processors are included in the computing device.

7. The method of claim 1, wherein the one or more processors are included in a computing system.

8. A computing device, comprising:

at least one processor; and

at least one memory comprising instructions that, when executed, cause the at least one processor to perform an assistant configured to:

receiving a representation of an utterance spoken at one or more microphones operatively connected to the computing device;

identifying a task to be performed based on the utterance;

determining a level of capability of a first party agent to perform the task;

causing the selected agent to perform the task.

9. A computing system, comprising:

one or more communication units;

at least one processor; and

receiving, from a computing device via the one or more communication units, a representation of an utterance spoken at the computing device;

identifying a task to be performed based on the utterance;

determining a level of capability of a first party agent to perform the task;

causing the selected agent to perform the task.

10. A computer-readable storage medium storing instructions that, when executed, cause one or more processors to execute an assistant configured to:

receiving a representation of an utterance spoken at the computing device;

identifying a task to be performed based on the utterance;

determining a level of capability of a first party agent to perform the task;

causing the selected agent to perform the task.

11. A method, comprising:

receiving, by a computing assistant executing at one or more processors, a representation of an utterance spoken by a user of a computing device;

selecting an agent from a plurality of agents based on the utterance, wherein the plurality of agents includes one or more first-party agents and a plurality of third-party agents;

in response to determining that the selected agent comprises a third party agent, selecting a voice from a plurality of voices of the third party agent, wherein the selected voice is different from a reserved voice, and wherein the reserved voice is associated with the one or more first party agents; and

outputting, by one or more speakers of the computing device, synthesized audio data of the third-party agent to satisfy the utterance using the selected speech;

after outputting the synthesized audio data using the selected speech to satisfy the utterance:

outputting, by one or more of the speakers of the computing device, a request for feedback from a user of the computing device regarding the third-party agent using the reserved voice associated with the one or more first-party agents, and

receiving, in response to outputting the request for feedback, a representation of a user emotion of the third-party agent; and

updating, in one or more databases, a value that affects whether the third-party agent was selected in response to a future occurrence of the utterance based on the user emotion.

12. The method of claim 11, further comprising:

after updating the value that affects whether the third-party agent was selected in response to a future occurrence of the utterance:

receiving, by the computing assistant, additional representations of the utterances spoken by additional users of one or more additional computing devices; and

selecting the third-party agent of the plurality of third-party agents based at least in part on the updated values.

13. The method of claim 11, further comprising:

prior to receiving a representation of the utterance spoken from a user of the computing device:

receiving, by the computing assistant, a plurality of prior representations of the utterance previously spoken by a plurality of users of a plurality of additional computing devices,

outputting, by one or more speakers of the additional computing device, a corresponding prior instance of synthesized audio data of the third-party agent to satisfy the utterance using the selected speech, and

after outputting the corresponding prior instance of the synthesized audio data using the selected speech to satisfy the utterance:

outputting, by one or more of the speakers of the one or more additional computing devices, a corresponding prior additional request for feedback from each of the plurality of users regarding the third-party agent using the reserved voice associated with the one or more first-party agents, and

receiving from each of the plurality of users a corresponding additional representation of prior user emotions of the third party agent in response to outputting a corresponding prior additional request for feedback, an

Updating a prior version of the value in one or more of the databases based on the prior user emotion.

14. The method of claim 11, wherein the representation of the user emotion of the third party agent is at least one of: user comments or user ratings.

15. The method of claim 14, wherein updating the value based on the user emotion comprises: updating the value based on whether the representation of the user emotion indicates a positive user emotion or a negative user emotion.

16. The method of claim 11, wherein updating the value that affects whether the third-party agent was selected in response to a future occurrence of the utterance affects a ranking of the third-party agent among the plurality of third-party agents.

17. The method of claim 16, wherein the one or more capabilities of the third-party agent further affect a ranking of the third-party agent among the plurality of third-party agents.

18. The method of claim 17, wherein the one or more capabilities of the third party broker are registered with the computing assistant when the third party broker is published.

19. The method of claim 11, further comprising:

determining that the utterance includes a multi-element task to be performed by at least one of the plurality of agents, wherein the multi-element task includes at least a first subset of elements and a second subset of elements;

causing the selected third-party agent to perform the first subset of elements in the multi-element task;

determining that the selected third-party agent is unable to perform the second subset of elements in the multi-element task;

selecting, based on the utterance and based on a determination that the selected third-party agent is unable to perform the second subset of elements, an additional third-party agent to perform the second subset of elements in the multi-element task;

in response to selecting the additional third party agent, selecting an additional voice from the plurality of voices, wherein the selected additional voice is different from the reserved voice; and

causing the additional third-party agent to perform the second subset of elements of the multi-element task,

wherein outputting the synthesized audio data further comprises outputting, by one or more speakers of the computing device, additional synthesized audio data from the additional third-party agent to satisfy the utterance using the selected additional speech, and

wherein outputting the request for feedback further comprises outputting, by one or more of the speakers of the computing device, an additional request for feedback from a user of the computing device regarding the additional third-party agent using the reserved voice associated with the one or more first-party agents.

20. The method of claim 19, further comprising:

receiving, in response to outputting the additional request for feedback, an additional representation of an additional user emotion of the additional third-party agent; and

updating, in one or more databases, additional values that affect whether the third-party agent is selected in response to a future occurrence of the utterance based on the additional user emotion.

21. The method of claim 11, wherein the request for feedback from the user of the computing device regarding the third-party agent includes identification of the third-party agent that satisfies the utterance.

22. A computing system, comprising:

at least one processor; and

receiving, by the assistant, a representation of an utterance spoken by a user of a computing device;

outputting, by one or more of the speakers, a request for feedback from the user regarding the third-party agent using the reserved voice associated with the one or more first-party agents, and

23. A non-transitory computer-readable storage medium storing instructions that, when executed, cause one or more processors to execute an assistant configured to:

receiving a representation of an utterance spoken by a user of a computing device;

24. A method, comprising:

receiving, by one or more processors, a representation of an utterance spoken at a computing device;

identifying, by a first computing agent from a plurality of computing agents, a multi-element task to be performed based on the utterance; and

executing, by the first computing agent, a first subset of elements in the multi-element task, wherein executing the first subset of elements comprises selecting a second computing agent from the plurality of computing agents to execute a second subset of elements in the multi-element task, wherein:

the first computing agent is a first-party computing agent and the second computing agent is a third-party computing agent, or the first computing agent is a third-party computing agent and the second computing agent is a first-party computing agent.

25. The method of claim 24, wherein executing the first subset of elements further comprises:

determining, by the first computing agent, that additional information is needed to execute the second subset of elements;

collecting, by the first computing agent, the additional information; and

outputting, by the first computing agent, the collected additional information to the selected second computing agent.

26. The method of claim 25, wherein outputting the collected additional information to the selected second computing agent comprises:

in response to determining, by the first computing agent based on the authorization data store, that the selected second computing agent is authorized to receive the collected additional information, outputting, by the first computing agent, the collected additional information to the selected second computing agent.

27. The method of claim 24, wherein selecting the second computing agent from the plurality of computing agents to execute the second subset of elements comprises:

determining a capability level of a respective computing agent of the plurality of agents to execute the second subset of elements; and

in response to determining that the capability level of a particular agent of the plurality of agents is the greatest of the determined capability levels and that the capability level of the particular agent satisfies a threshold capability level, selecting the particular agent as the second computing agent.

28. The method of claim 24, further comprising:

receiving, by the first computing agent from the second computing agent, an indication of an action performed by the second computing agent.

29. The method of claim 24, wherein the one or more processors are included in the computing device.

30. The method of claim 24, wherein the one or more processors are included in a computing system.

31. A computing device, comprising:

at least one processor; and

at least one memory comprising instructions that, when executed, cause the at least one processor to execute a first computing agent from a plurality of computing agents, the first computing agent configured to:

receiving a representation of an utterance spoken at a computing device;

identifying a multi-element task to be performed based on the utterance; and

executing, by the first computing agent, a first subset of elements of the multi-element task, wherein executing the first subset of elements comprises selecting a second computing agent from the plurality of computing agents to execute a second subset of elements of the multi-element task.

32. A computing system, comprising:

one or more communication units;

at least one processor; and

at least one memory comprising instructions that, when executed, cause the at least one processor to execute a first computing agent configured to:

identifying a multi-element task to be performed based on the utterance; and