CN116802601A

CN116802601A - Digital assistant control for application program

Info

Publication number: CN116802601A
Application number: CN202180073163.0A
Authority: CN
Inventors: K·皮索尔; N·泰勒; L·N·珀金斯; T·R·奥里奥尔; K·S·布里森; C·布雷; H·加斯德诺; J·配克; L·西莫内利
Original assignee: Apple Inc
Current assignee: Apple Inc
Priority date: 2020-08-27
Filing date: 2021-08-27
Publication date: 2023-09-22

Abstract

Systems and processes for operating a digital assistant are provided. An exemplary method includes, at an electronic device having one or more processors and memory, when an application is opened on the electronic device: the method includes receiving spoken input including a command, determining whether the command matches at least a portion of metadata associated with an operation of the application, and in accordance with the determination that the command matches at least the portion of the metadata associated with the operation of the application, associating the command with the operation, storing the association of the command with the operation for subsequent use by the digital assistant through the application, and executing the operation by the application.

Description

Digital assistant control for application program

RELATED APPLICATIONS

The present application relates to U.S. provisional patent application serial No. 63/071,087, entitled "DIGITAL ASSISTANT CONTROL OF APPLICATIONS", filed 8/27/2020, and claims the benefit of U.S. provisional patent application serial No. 63/113,032, entitled "DIGITAL ASSISTANT CONTROL OF APPLICATIONS", filed 11/12/2020, the contents of both of which are incorporated herein by reference in their entirety for all purposes.

Technical Field

The present disclosure relates generally to digital assistants, and more particularly, to enabling digital assistants to understand new commands.

Background

An intelligent automated assistant (or digital assistant) may provide an advantageous interface between a human user and an electronic device. Such assistants may allow a user to interact with a device or system in voice form and/or text form using natural language. For example, a user may provide a voice input containing a user request to a digital assistant running on an electronic device. The digital assistant may interpret the user intent from the voice input and operate the user intent into a task. These tasks may then be performed by executing one or more services of the electronic device, and the relevant output in response to the user request may be returned to the user.

In some cases, the digital assistant may interact with a new application or receive a new command. Thus, the digital assistant may need training to be able to interact with the application or process commands to perform one or more tasks as described above. This can be cumbersome and time consuming, creating a hurdle for developers who wish to integrate their applications with digital assistants and users who want to gain higher level access to different tasks through the digital assistant.

Disclosure of Invention

Exemplary methods are disclosed herein. An exemplary method includes, at an electronic device having one or more processors and memory, when an application is opened on the electronic device: the method includes receiving spoken input including a command, determining whether the command matches at least a portion of metadata associated with an operation of the application, and in accordance with the determination that the command matches at least the portion of the metadata associated with the operation of the application, associating the command with the operation, storing the association of the command with the operation for subsequent use by the digital assistant through the application, and executing the operation by the application.

Exemplary non-transitory computer readable media are disclosed herein. An exemplary non-transitory computer readable storage medium stores one or more programs. The one or more programs include instructions, which when executed by one or more processors of an electronic device, cause the electronic device to, when an application is opened on the electronic device: the method includes receiving spoken input including a command, determining whether the command matches at least a portion of metadata associated with an operation of the application, and in accordance with the determination that the command matches at least the portion of the metadata associated with the operation of the application, associating the command with the operation, storing the association of the command and the operation for subsequent use by the digital assistant through the application, and executing the command by the application.

An exemplary electronic device is disclosed herein. An exemplary electronic device includes one or more processors; a memory; and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs comprising instructions for, when an application is opened on the electronic device: the method includes receiving spoken input including a command, determining whether the command matches at least a portion of metadata associated with an operation of the application, and in accordance with the determination that the command matches at least the portion of the metadata associated with the operation of the application, associating the command with the operation, storing the association of the command with the operation for subsequent use by the digital assistant through the application, and executing the operation by the application.

An exemplary electronic device includes, when an application is opened on the electronic device: means for receiving spoken input including a command, means for determining whether the command matches at least a portion of metadata associated with an operation of the application, and means for associating the command with the operation in accordance with the determination that the command matches at least the portion of the metadata associated with the operation of the application, means for storing the association of the command with the operation for subsequent use by the application by a digital assistant, and means for performing the operation by the application.

Determining whether the command matches at least a portion of metadata associated with the operation of the application allows the digital assistant to quickly learn a new command and interface with the new application without requiring a lengthy and laborious registration process. In this way, the developer can more effectively interface with the digital assistant. In addition, developers can release their applications more quickly without having to determine how the application may need to be modified, or what content of the application is needed to teach the digital assistant to interact with the application. In addition, this approach allows the user to use the digital assistant and the application more efficiently because the digital assistant can learn over time how to interact with the application, thus presenting fewer errors to the user. Thus, the efficiency of the electronic device increases and the power requirements decrease, such that overall battery efficiency also increases (e.g., because the user does not need to provide requests as frequently as usual or to check for updates to the application program frequently).

In addition, associating commands with operations and storing associations of commands with operations for subsequent use by the digital assistant through the application program allows the operations to be performed more efficiently. In particular, the digital assistant may access the stored associations as the spoken input is processed to determine whether the user is invoking a command, and then perform the associated operation by performing a predetermined determination based on the metadata. In this way, the digital assistant and electronic device may more efficiently respond to subsequent user requests, improving the efficiency of the electronic device, such that overall battery efficiency is also improved (e.g., by reducing the processing required to perform the operation).

An example method includes, at an electronic device having one or more processors and memory, receiving an utterance from a user, determining a first natural language recognition score for the utterance using a first lightweight natural language model associated with a first application, determining a second natural language recognition score for the utterance using a second lightweight natural language model associated with a second application, determining whether the first natural language recognition score exceeds a predetermined threshold, and providing the utterance to a complex natural language model associated with the first application in accordance with determining that the first natural language recognition score exceeds the predetermined threshold, and determining a user intent corresponding to the utterance using the complex natural language model.

An exemplary non-transitory computer readable storage medium stores one or more programs. The one or more programs include instructions, which when executed by one or more processors of an electronic device, cause the electronic device to: the method includes receiving an utterance from a user, determining a first natural language recognition score for the utterance using a first lightweight natural language model associated with a first application, determining a second natural language recognition score for the utterance using a second lightweight natural language model associated with a second application, determining whether the first natural language recognition score exceeds a predetermined threshold, and providing the utterance to a complex natural language model associated with the first application in accordance with determining that the first natural language recognition score exceeds the predetermined threshold, and determining a user intent corresponding to the utterance using the complex natural language model.

An exemplary electronic device includes one or more processors; a memory; and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs comprising instructions for: the method includes receiving an utterance from a user, determining a first natural language recognition score for the utterance using a first lightweight natural language model associated with a first application, determining a second natural language recognition score for the utterance using a second lightweight natural language model associated with a second application, determining whether the first natural language recognition score exceeds a predetermined threshold, and providing the utterance to a complex natural language model associated with the first application in accordance with determining that the first natural language recognition score exceeds the predetermined threshold, and determining a user intent corresponding to the utterance using the complex natural language model.

An exemplary electronic device includes: the method includes receiving an utterance from a user, determining a first natural language recognition score for the utterance using a first lightweight natural language model associated with a first application, determining a second natural language recognition score for the utterance using a second lightweight natural language model associated with a second application, determining whether the first natural language recognition score exceeds a predetermined threshold, and providing the utterance to a complex natural language model associated with the first application based on determining that the first natural language recognition score exceeds the predetermined threshold, and determining a user intent corresponding to the utterance using the complex natural language model.

Determining a first natural language recognition score for the utterance and determining whether the first natural language recognition score exceeds a predetermined threshold using a first lightweight natural language model associated with the first application allows the digital assistant to determine whether further processing of the utterance is required for the particular application while reducing processing power and conserving battery power. In particular, the lightweight natural language model is simpler than other natural language recognition models, so that fewer resources may be used to determine the natural language score than would otherwise be required to determine the user's intent. Thus, applications determined to be speech-independent may be omitted, and these applications need not perform further processing. This further improves the user experience by increasing the accuracy and response speed of the digital assistant.

An example method includes, at an electronic device having one or more processors and memory, receiving an utterance from a user, determining one or more representations of the utterance using a speech recognition model that is at least partially trained with data representing an application, providing the one or more representations of the utterance to a plurality of natural language models, wherein at least one of the plurality of natural language models is associated with the application and registered upon receiving data representing the application from a second electronic device, and determining a user intent of the utterance based on the at least one of the plurality of natural language models and a database including a plurality of operations and objects associated with the application.

An exemplary non-transitory computer readable storage medium stores one or more programs. The one or more programs include instructions, which when executed by one or more processors of an electronic device, cause the electronic device to: the method includes receiving an utterance from a user, determining one or more representations of the utterance using a speech recognition model that is at least partially trained with data representing an application, providing the one or more representations of the utterance to a plurality of natural language models, wherein at least one of the plurality of natural language models is associated with the application and registered upon receipt of data representing the application from a second electronic device, and determining a user intent of the utterance based on the at least one of the plurality of natural language models and a database including a plurality of operations and objects associated with the application.

An exemplary electronic device includes one or more processors; a memory; and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs comprising instructions for: the method includes receiving an utterance from a user, determining one or more representations of the utterance using a speech recognition model that is at least partially trained with data representing an application, providing the one or more representations of the utterance to a plurality of natural language models, wherein at least one of the plurality of natural language models is associated with the application and registered upon receipt of data representing the application from a second electronic device, and determining a user intent of the utterance based on the at least one of the plurality of natural language models and a database including a plurality of operations and objects associated with the application.

An exemplary electronic device includes: the apparatus includes means for receiving an utterance from a user, means for determining one or more representations of the utterance using a speech recognition model that is at least partially trained with data representing an application, means for providing the one or more representations of the utterance to a plurality of natural language models, wherein at least one of the plurality of natural language models is associated with the application and registered upon receipt of data representing the application from a second electronic device, and means for determining a user intent of the utterance based on the at least one of the plurality of natural language models and a database comprising a plurality of operations and objects associated with the application.

Determining the user intent of the utterance based on at least one of the plurality of natural language models and a database including a plurality of operations and objects associated with the application enables the digital assistant to determine different user intents based on the application installed and integrated with the digital assistant. Thus, new applications may be integrated over time, thereby adding functionality to the digital assistant. This in turn increases the enjoyment of the digital assistant and the electronic device by the user, while also improving the energy saving efficiency of the electronic device.

An example method includes, at an electronic device having one or more processors and memory, receiving a user utterance including a request, determining whether the request includes ambiguous terms, providing the user utterance to an reference resolution model in accordance with a determination that the request includes ambiguous terms, determining a plurality of relevant reference factors using the reference resolution model, determining a relevant application based on the relevant reference factors, and determining an object to which the request refers based on the relevant application.

An exemplary non-transitory computer readable storage medium stores one or more programs. The one or more programs include instructions, which when executed by one or more processors of an electronic device, cause the electronic device to: the method includes receiving a user utterance including a request, determining whether the request includes ambiguous terms, providing the user utterance to an reference resolution model in accordance with a determination that the request includes ambiguous terms, determining a plurality of relevant reference factors using the reference resolution model, determining a relevant application based on the relevant reference factors, and determining an object to which the request refers based on the relevant application.

An exemplary electronic device includes one or more processors; a memory; and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs comprising instructions for: the method includes receiving a user utterance including a request, determining whether the request includes ambiguous terms, providing the user utterance to an reference resolution model in accordance with a determination that the request includes ambiguous terms, determining a plurality of relevant reference factors using the reference resolution model, determining a relevant application based on the relevant reference factors, and determining an object to which the request refers based on the relevant application.

An exemplary electronic device includes: means for receiving a user utterance comprising a request, means for determining whether the request comprises ambiguous terms, means for providing the user utterance to an reference resolution model in accordance with a determination that the request comprises ambiguous terms, means for determining a plurality of relevant reference factors using the reference resolution model, means for determining a relevant application based on the relevant reference factors, and means for determining an object to which the request refers based on the relevant application.

Determining the object to which the request refers based on the relevant application enables the digital assistant to perform tasks related to the user input even if the user input is ambiguous. This increases user satisfaction with the device because less time is required for the user to communicate back and forth between the user and the digital assistant, and the task requested by the user is performed. In addition, this increases the efficiency of the electronic device because battery power is conserved by determining objects without querying the user for more information and providing an associated output using the disambiguation process.

Drawings

1A-1B depict exemplary systems for various computer-generated reality techniques, including virtual reality and mixed reality.

FIG. 2 depicts an exemplary system for mapping and executing user commands.

FIG. 3 depicts an exemplary link interface of the system for mapping user commands to operations.

FIG. 4 depicts an example input command to be mapped and executed.

FIG. 5 depicts an example input command to be mapped and executed.

FIG. 6 depicts an example input command to be mapped and executed.

FIG. 7 depicts an example input command to be mapped and executed.

Fig. 8 is a flowchart illustrating a process for mapping and executing user commands.

Fig. 9 depicts an exemplary digital assistant for performing natural language processing.

Fig. 10 is a flowchart showing a procedure for performing natural language processing.

FIG. 11 is a flow chart illustrating a process for determining and executing tasks with an integrated application.

FIG. 12 depicts an exemplary digital assistant for resolving ambiguous terms of a user utterance.

FIG. 13 depicts an example view of an electronic device for use with an reference resolution process.

FIG. 14 depicts an example view of an electronic device for use with an reference resolution process.

Fig. 15 is a flowchart showing a process for resolving ambiguous terms of a user utterance.

Detailed Description

Various examples of electronic systems and techniques for using such systems in connection with various computer-generated reality techniques are described.

A physical environment (or real environment) refers to the physical world in which people can sense and/or interact without the assistance of an electronic system. Physical environments such as physical parks include physical objects (or physical objects or real objects), such as physical trees, physical buildings, and physical people. People can directly sense and/or interact with a physical environment, such as by visual, tactile, auditory, gustatory, and olfactory.

Conversely, a computer-generated reality (CGR) environment refers to a completely or partially simulated environment in which people perceive and/or interact via an electronic system. In the CGR, a subset of the physical movements of the person, or a representation thereof, is tracked and in response one or more characteristics of one or more virtual objects simulated in the CGR environment are adjusted in a manner consistent with at least one physical law. For example, the CGR system may detect human head rotation and, in response, adjust the graphical content and sound field presented to the human in a manner similar to the manner in which such views and sounds change in the physical environment. In some cases (e.g., for reachability reasons), the adjustment of the characteristics of the virtual object in the CGR environment may be made in response to a representation of physical motion (e.g., a voice command).

A person may utilize any of his sensations to sense and/or interact with CGR objects, including visual, auditory, tactile, gustatory, and olfactory. For example, a person may sense and/or interact with audio objects that create a 3D or spatial audio environment that provides perception of a point audio source in 3D space. As another example, an audio object may enable audio transparency that selectively introduces environmental sounds from a physical environment with or without computer generated audio. In some CGR environments, a person may sense and/or interact with only audio objects.

Examples of CGR include virtual reality and mixed reality.

A Virtual Reality (VR) environment (or virtual environment) refers to a simulated environment designed to be based entirely on computer-generated sensory input for one or more senses. The VR environment includes a plurality of virtual objects that a person can sense and/or interact with. For example, computer-generated images of trees, buildings, and avatars representing people are examples of virtual objects. A person may sense and/or interact with virtual objects in the VR environment through a simulation of the presence of the person within the computer-generated environment and/or through a simulation of a subset of the physical movements of the person within the computer-generated environment.

In contrast to VR environments designed to be based entirely on computer-generated sensory input, a Mixed Reality (MR) environment refers to a simulated environment designed to introduce sensory input from a physical environment or a representation thereof in addition to including computer-generated sensory input (e.g., virtual objects). On a virtual continuum, an MR environment is any condition between a full physical environment as one end and a VR environment as the other end, but does not include both ends.

In some MR environments, the computer-generated sensory input may be responsive to changes in sensory input from the physical environment. In addition, some electronic systems for rendering MR environments may track the position and/or orientation relative to the physical environment to enable virtual objects to interact with real objects (i.e., physical objects or representations thereof from the physical environment). For example, the system may cause the motion such that the virtual tree appears to be stationary relative to the physical ground.

Examples of MR include augmented reality and augmented virtualization.

An Augmented Reality (AR) environment refers to a simulated environment in which one or more virtual objects are superimposed over a physical environment or representation thereof. For example, an electronic system for presenting an AR environment may have a transparent or translucent display through which a person may directly view the physical environment. The system may be configured to present the virtual object on a transparent or semi-transparent display such that a person perceives the virtual object superimposed over the physical environment with the system. Alternatively, the system may have an opaque display and one or more imaging sensors that capture images or videos of the physical environment, which are representations of the physical environment. The system combines the image or video with the virtual object and presents the composition on an opaque display. A person utilizes the system to indirectly view the physical environment via an image or video of the physical environment and perceive a virtual object superimposed over the physical environment. As used herein, video of a physical environment displayed on an opaque display is referred to as "pass-through video," meaning that the system captures images of the physical environment using one or more image sensors and uses those images when rendering an AR environment on the opaque display. Further alternatively, the system may have a projection system that projects the virtual object into the physical environment, for example as a hologram or on a physical surface, such that a person perceives the virtual object superimposed on top of the physical environment with the system.

AR environment also refers to a simulated environment in which a representation of a physical environment is transformed by computer-generated sensory information. For example, in providing a passthrough video, the system may transform one or more sensor images to apply a selected viewing angle (e.g., a viewpoint) that is different from the viewing angle captured by the imaging sensor. As another example, the representation of the physical environment may be transformed by graphically modifying (e.g., magnifying) portions thereof such that the modified portions may be representative but not real versions of the original captured image. For another example, the representation of the physical environment may be transformed by graphically eliminating or blurring portions thereof.

Enhanced virtual (AV) environments refer to simulated environments in which a virtual or computer-generated environment incorporates one or more sensory inputs from a physical environment. The sensory input may be a representation of one or more characteristics of the physical environment. For example, an AV park may have virtual trees and virtual buildings, but the face of a person is realistically reproduced from an image taken of a physical person. As another example, the virtual object may take the shape or color of a physical object imaged by one or more imaging sensors. For another example, the virtual object may employ shadows that conform to the positioning of the sun in the physical environment.

There are many different types of electronic systems that enable a person to sense and/or interact with various CGR environments. Examples include head-mounted systems, projection-based systems, head-up displays (HUDs), vehicle windshields integrated with display capabilities, windows integrated with display capabilities, displays formed as lenses designed for placement on a human eye (e.g., similar to contact lenses), headphones/earphones, speaker arrays, input systems (e.g., wearable or handheld controllers with or without haptic feedback), smart phones, tablet computers, and desktop/laptop computers. The head-mounted system may have one or more speakers and an integrated opaque display. Alternatively, the head-mounted system may be configured to accept an external opaque display (e.g., a smart phone). The head-mounted system may incorporate one or more imaging sensors for capturing images or video of the physical environment, and/or one or more microphones for capturing audio of the physical environment. The head-mounted system may have a transparent or translucent display instead of an opaque display. The transparent or translucent display may have a medium through which light representing an image is directed to the eyes of a person. The display may utilize digital light projection, OLED, LED, uLED, liquid crystal on silicon, laser scanning light sources, or any combination of these techniques. The medium may be an optical waveguide, a holographic medium, an optical combiner, an optical reflector, or any combination thereof. In one example, a transparent or translucent display may be configured to selectively become opaque. Projection-based systems may employ retinal projection techniques that project a graphical image onto a person's retina. The projection system may also be configured to project the virtual object into the physical environment, for example as a hologram or on a physical surface.

Fig. 1A and 1B depict an exemplary system 100 for use in various computer-generated reality techniques.

In some examples, as shown in fig. 1A, system 100 includes a device 100a. Device 100a includes various components such as a processor 102, RF circuitry 104, memory 106, image sensor 108, orientation sensor 110, microphone 112, position sensor 116, speaker 118, display 120, and touch-sensitive surface 122. These components optionally communicate via a communication bus 150 of the device 100a.

In some examples, elements of system 100 are implemented in a base station device (e.g., a computing device, such as a remote server, mobile device, or laptop computer), and other elements of system 100 are implemented in a Head Mounted Display (HMD) device designed to be worn by a user, where the HMD device is in communication with the base station device. In some examples, the device 100a is implemented in a base station device or an HMD device.

As shown in fig. 1B, in some examples, the system 100 includes two (or more) devices in communication, such as through a wired connection or a wireless connection. The first device 100b (e.g., a base station device) includes a processor 102, RF circuitry 104, and a memory 106. These components optionally communicate via a communication bus 150 of the device 100 b. The second device 100c (e.g., a head mounted device) includes various components such as a processor 102, RF circuitry 104, memory 106, image sensor 108, orientation sensor 110, microphone 112, position sensor 116, speaker 118, display 120, and touch-sensitive surface 122. These components optionally communicate via a communication bus 150 of the device 100 c.

In some examples, system 100 is a mobile device. In some examples, the system 100 is a Head Mounted Display (HMD) device. In some examples, the system 100 is a wearable HUD device.

The system 100 includes a processor 102 and a memory 106. Processor 102 includes one or more general-purpose processors, one or more graphics processors, and/or one or more digital signal processors. In some examples, memory 106 is one or more non-transitory computer-readable storage media (e.g., flash memory, random access memory) storing computer-readable instructions configured to be executed by processor 102 to perform the techniques described below.

The system 100 includes an RF circuit 104.RF circuitry 104 optionally includes circuitry for communicating with electronic devices, networks such as the internet, intranets, and/or wireless networks such as cellular networks and wireless Local Area Networks (LANs). The RF circuitry 104 optionally includes circuitry for using near field communication and/or short range communication (such as) And a circuit for performing communication.

The system 100 includes a display 120. In some examples, the display 120 includes a first display (e.g., a left-eye display panel) and a second display (e.g., a right-eye display panel), each for displaying images to a respective eye of a user. Corresponding images are displayed on both the first display and the second display. Optionally, the corresponding image comprises representations of the same virtual object and/or the same physical object from different viewpoints, thereby producing a parallax effect that provides the user with a stereoscopic effect of the object on the display. In some examples, the display 120 includes a single display. For each eye of the user, the corresponding images are displayed simultaneously on the first and second areas of the single display. Optionally, the corresponding images comprise representations of the same virtual object and/or the same physical object from different viewpoints, thereby producing a parallax effect that provides the user with a stereoscopic effect of objects on a single display.

In some examples, the system 100 includes a touch-sensitive surface 122 for receiving user inputs, such as tap inputs and swipe inputs. In some examples, display 120 and touch-sensitive surface 122 form a touch-sensitive display.

The system 100 includes an image sensor 108. The image sensor 108 optionally includes one or more visible light image sensors, such as Charge Coupled Device (CCD) sensors, and/or Complementary Metal Oxide Semiconductor (CMOS) sensors operable to obtain an image of a physical object from a real environment. The image sensor also optionally includes one or more Infrared (IR) sensors, such as passive IR sensors or active IR sensors, for detecting infrared light from the real environment. For example, active IR sensors include IR emitters, such as IR point emitters, for emitting infrared light into the real environment. The image sensor 108 also optionally includes one or more event cameras configured to capture movement of physical objects in the real environment. The image sensor 108 also optionally includes one or more depth sensors configured to detect the distance of the physical object from the system 100. In some examples, the system 100 uses a combination of a CCD sensor, an event camera, and a depth sensor to detect the physical environment surrounding the system 100. In some examples, the image sensor 108 includes a first image sensor and a second image sensor. The first image sensor and the second image sensor are optionally configured to capture images of physical objects in the real environment from two different perspectives. In some examples, the system 100 uses the image sensor 108 to receive user input, such as gestures. In some examples, the system 100 uses the image sensor 108 to detect the position and orientation of the system 100 and/or the display 120 in the real environment. For example, the system 100 uses the image sensor 108 to track the position and orientation of the display 120 relative to one or more stationary objects in the real environment.

In some examples, system 100 includes microphone 112. The system 100 uses the microphone 112 to detect sound from the user and/or the user's real environment. In some examples, microphone 112 includes a microphone array (including a plurality of microphones) that optionally operate in series to identify ambient noise or to localize a sound source in the space of the real environment.

The system 100 includes an orientation sensor 110 for detecting the orientation and/or movement of the system 100 and/or display 120. For example, the system 100 uses the orientation sensor 110 to track changes in the position and/or orientation of the system 100 and/or the display 120, such as with respect to physical objects in the real environment. Orientation sensor 110 optionally includes one or more gyroscopes and/or one or more accelerometers.

FIG. 2 depicts an exemplary system 200 for mapping and executing user commands. In some examples, as shown in fig. 2, system 200 includes a digital assistant 201, a link interface 202, and an application program interface 203. In some examples, system 200 is implemented on electronic device 100. In some examples, system 200 is implemented on a device other than electronic device 100 (e.g., a server). In some examples, some of the modules and functions of system 200 are divided into a server portion and a client portion, where the client portion resides on one or more user devices (e.g., electronic device 100) and communicates with the server portion over one or more networks.

In some examples, digital assistant 201 is a digital assistant system. In some examples, the digital assistant system is implemented on a standalone computer system. In some examples, the digital assistant system is distributed across multiple electronic devices. In some examples, some of the modules and functions of the digital assistant are divided into a server portion and a client portion, where the client portion resides on one or more user devices (e.g., device 100) and communicates with the server portion over one or more networks. The various components of the digital assistant system are implemented in hardware, software instructions for execution by one or more processors, firmware (including one or more signal processing integrated circuits and/or application specific integrated circuits), or a combination thereof.

It should be noted that system 200 is only one example, and that system 200 may have more or fewer components than shown, may combine two or more components, or may have different component configurations or arrangements. The various components shown in fig. 2 are implemented in hardware, software instructions for execution by one or more processors, firmware (including one or more signal processing integrated circuits and/or application specific integrated circuits), or a combination thereof.

The system 200 receives spoken input 204 including commands and provides the spoken input 204 to the digital assistant 201. After receiving spoken input 204, digital assistant 201 performs semantic analysis on spoken input 204. In some examples, performing semantic analysis includes performing Automatic Speech Recognition (ASR) on spoken input 204. In particular, the digital assistant 201 may include one or more ASR systems that process spoken input 204 received through an input device (e.g., microphone) of the electronic device 100. The ASR system extracts representative features from the speech input. For example, the ASR system preprocessor performs a fourier transform on the spoken input 204 to extract spectral features characterizing the speech input as a sequence of representative multidimensional vectors.

In addition, each ASR system of the digital assistant 201 includes one or more speech recognition models (e.g., acoustic models and/or language models) and implements one or more speech recognition engines. Examples of speech recognition models include hidden Markov models, gaussian mixture models, deep neural network models, n-gram language models, and other statistical models. Examples of speech recognition engines include dynamic time warping based engines and Weighted Finite State Transducer (WFST) based engines. The extracted representative features of the front-end speech pre-processor are processed using one or more speech recognition models and one or more speech recognition engines to produce intermediate recognition results (e.g., phonemes, phoneme strings, and sub-words), and ultimately text recognition results (e.g., words, word strings, or symbol sequences).

In some examples, performing semantic analysis includes performing natural language processing on spoken input 204. Specifically, after the digital assistant 201 generates recognition results containing text strings (e.g., words or sequences of symbols) via ASR, the input analyzer can infer intent of the spoken input 204. In some examples, digital assistant 201 generates a plurality of candidate text representations of the speech input. Each candidate text representation is a sequence of words or symbols corresponding to the spoken input 204. In some examples, each candidate text representation is associated with a speech recognition confidence score. Based on the speech recognition confidence scores, the digital assistant 201 ranks the candidate text representations and may provide the n best (e.g., highest ranked n) candidate text representations to other modules of the system 200 for further processing.

In some examples, based on the semantic analysis of the spoken input 204, the digital assistant 201 determines an operation or task corresponding to the command of the spoken input 204 and performs the operation or task. For example, the system 200 may receive spoken input "how weather is? "as spoken input 204". Thus, the digital assistant 201 may perform semantic analysis on the spoken input 204 and determine a task to provide current weather based on the input. In particular, the digital assistant 201 may be trained in advance to recognize that the term "weather" is relevant to determining the current weather, and perform the task. In addition, the digital assistant 201 may identify one or more applications associated with the operation of determining the current weather and invoke one or more of the applications to perform the task. The digital assistant 201 may then provide an output indicating the current weather after performing the task.

However, in some examples, the digital assistant 201 may not recognize the command of the spoken input 204. In particular, the command of spoken input 204 may be a new command related to an application that digital assistant 201 cannot recognize, a command used in a context that digital assistant 201 cannot recognize, or any other command that digital assistant 201 has not been previously trained to recognize or interact with. For example, as shown in FIG. 4, the system 200 may receive spoken input 404 "bold text' hey-! '". The digital assistant 201 may process the spoken input to determine that the command is "bolded," but may not understand the meaning of the command "bolded" or what operation to perform based on the command. Thus, the system 200 and digital assistant 201 may determine the operations to be performed on the command "bold" by accessing the link interface 202, as described below.

FIG. 3 illustrates an exemplary link interface 202 according to various examples. The link interface 202 includes a link model 305 for each application installed on the electronic device 100. In some examples, the link interface 202 includes a link model 305 for each application available to the digital assistant 201 (including those applications not installed on the electronic device 100). For example, the link interface 202 may include a link model for applications installed on a server or other networked electronic device with which the digital assistant 201 may interact.

While the present application relates to how a digital assistant (such as digital assistant 201) can interact with the link interface 202 to satisfy user-provided commands, it should be understood that other interfaces with which a user can interact in a similar manner and utilize the link interface 202 and the information it includes. For example, a graphical user interface of a device may interact with the link interface 202 to determine how to display operations and sub-operations described below to facilitate interaction with a user without requiring a developer of an application to map each portion of the link interface 202 to specific elements and sub-elements of the user interface. Instead, the process may be automated by automatically generating a user interface that connects the user to the application using information in the link interface 202.

The link model 305 includes a plurality of operations 306 that can be performed by its associated application. In some examples, the plurality of operations 306 further includes one or more sub-operations 307 of each operation 306. In some examples, the link model 305 includes a plurality of hierarchical links between the related operations 306 and sub-operations 307 of the plurality of operations. For example, as shown in FIG. 3, the sub-operations "bold", "italic", and "underline" are nested under the operation "edit". Thus, these three sub-operations are hierarchically under the editing operation, and are linked to the editing operation. In this way, the link model 305 of the link interface 202 presents various operations associated with the application in a tree-like link model that can be efficiently searched.

In some examples, operation 306 and sub-operation 307 of link model 305 represent different link classes of link model 305. In some examples, operation 306 and sub-operation 307 represent a link class associated with the container class. For example, an "insert" operation may be associated with a container class link because many different types of objects may be inserted into a document. In some examples, operation 306 and sub-operation 307 represent linked classes associated with a single class. For example, a "delete" operation may be associated with a single class link because a single item is typically deleted from a document.

Operations 306 and sub-operations 307 include active and inactive operations of the associated application. The active operation of the associated application is the operation currently displayed by the application. For example, as shown in fig. 4, "insert," "bold," "italic," "underline" operations are being displayed, and thus these operations are active operations. The inactive operation of the associated application is an operation that is not currently displayed by the application. For example, returning to FIG. 4, there may be many other operations not currently shown, including a sub-operation of "insert" or an operation "refer" as shown in FIG. 3. Thus, these operations, which are not shown, are inactive operations. In some examples, as described further below, the digital assistant 201 may interact with the link interface 202 to search for active operations, inactive operations, or both, depending on the received command, what the electronic device is displaying, and which applications are currently available.

Each of operations 306 and sub-operations 307 are associated with one or more pieces of metadata 308, as shown in fig. 3. In some examples, metadata 308 includes synonyms for the associated sub-operations or operations. For example, as shown in FIG. 3, metadata for the operation "insert" may include the synonyms "add" and "embed". By associating these terms as metadata for the "insert" operation, the digital assistant need not learn the particular language required by the word processing application associated with the link model 305. Conversely, when the user provides a command that the digital assistant 201 cannot recognize, the digital assistant 201 may search the operations 306 and metadata 308 to determine the operation of the word processing application corresponding to the command, as described further below.

In some examples, metadata 308 includes an ontology corresponding to an associated sub-operation or operation. For example, as shown in fig. 3, metadata of the sub-operations "bold", "italic", and "underline" includes the ontology "word processing" and "document editing". As another example, metadata for operating a "file" may include more and different ontologies, as it typically includes different functions (such as "document creation", "send message", and "print") under the "file" operation of a word processing application. In some examples, metadata 308 includes related operations or sub-operations. In some examples, the related operations or sub-operations may be operations or sub-operations located in different portions of the link model 305 or located in different link models. For example, metadata for a "review" operation may include operations "view" and "compare" of a word processing application that may be used to help "review" a document. In addition, the metadata of the "refer" operation may also include operations "create PDF" and "open PDF" of the PDF creation application that may also be used to refer to and/or edit the current document.

In this way, the link model 305 and through the extended link interface 202 may include a wider selection of ideas and terms for the digital assistant 201 to search when attempting to match a user-entered command, as described above with respect to using synonyms in the metadata 308. This may enable a more successful match without the digital assistant 201 or the user having to learn and provide a particular language. Instead, new applications and new operations may simply be added to the functionality of digital assistant 201, with minimal additional effort required by the developer and user frustration being reduced.

In some examples, the link model 305 of the link interface 202 is provided by an associated application. For example, when an application is installed on an electronic device or otherwise available to link interface 202, the application may provide link model 305 including operation 306, sub-operation 307, and metadata 308 to add to link interface 202. Thus, the link model 305 can be quickly incorporated into the link interface 202 and made accessible to the digital assistant 201. In some examples, each application connected to the link interface 202 provides an associated link model that includes operations, sub-operations, and metadata. Thus, many different applications and their associated link models can be quickly integrated into the link interface 202 and thus into the system 200.

In some examples, the link model 305 of the link interface 202 is updated over time. In some examples, a new link model 305 is added to the link interface 202, such as when a new application is available to the digital assistant 201. For example, a new application may be installed on the electronic device so as to become accessible by the digital assistant 201, and a link model associated with the application is added to the link interface 202. In some examples, operation 306 is added to the link model 305 when an application update or a function of the application changes. In some examples, the metadata 308 of the link model 305 may be updated when a new synonym is determined or an ontology associated with operation 306 is changed.

In some examples, the link model 305 of the link interface 202 is created based on source code provided by a developer of an application associated with the link model. In some examples, the developer provided source code is combined with the source code of the user interface or digital assistant to create the link model 305 of the link interface 202. In some examples, the link model 305 of the link interface 202 is created based on source code associated with a user interface or a digital assistant. Thus, the link model 305 of the link interface 202 may be created using source code provided separately for one or more applications and digital assistants, and may also be created using source code that is combined to include the source code of the applications and digital assistants.

In some examples, the link model 305 of the link interface 202 and the operations of the link model 305 are created from a data file. In some examples, the data file may be downloaded from a developer along with the application update. In some examples, the data file may be provided at the time of installation of the application. For example, a developer may provide application updates as well as additional data files, including new or updated operations annotated with appropriate metadata. The data file may be automatically converted to an appropriate link model for the application and stored in the link interface 202.

In some examples, the link model 305 of the link interface 202 is created based on a GUI of the provided application. For example, a developer of an application may provide a GUI that it creates for an application to be integrated with the link interface 202 and the digital assistant 201. Accordingly, the digital assistant 201 and/or other components of the system may convert various components of the GUI (e.g., selectable buttons, pages, etc.) into corresponding operations, sub-operations, and metadata for storage as a link schema of the link interface 202. Thus, the link model 305 of the link interface 202 may be created from various different portions of the application program in order to make it less labor intensive for the developer to create a more complete link model.

In some examples, as shown in fig. 4, spoken input 404 is received when an application 405 for word processing is opened (e.g., running) on an electronic device 400. After receiving the spoken input 404, the digital assistant 201 may perform semantic analysis on the spoken input 404 to determine whether a command (e.g., "bold") of the spoken input 404 is recognized by the digital assistant 201. When a command of spoken input 404 is recognized by digital assistant 201, digital assistant 201 determines a task or operation corresponding to the command and performs the task or operation, or prompts application 405 to perform the task or operation.

However, when the digital assistant 201 does not recognize a command of the spoken input 404, as shown in fig. 4, the digital assistant 201 accesses the link interface 202 to determine an operation corresponding to the unidentified command. In some examples, accessing the link interface 202 includes accessing a link model corresponding to an application 405 currently open on the electronic device 400. For example, the digital assistant 201 receives the "bold text' hey-! Upon spoken input 404 of the'' the digital assistant 201 can access the link model of the link interface 202 associated with the application 405 because the application 405 is open.

In some examples, the application 405 is open and is the focus of the electronic device 400, as shown in fig. 4. In some examples, application 405 is open, but the focus of electronic device 400 is another application or a general user interface. When this occurs, the digital assistant 201 may still access the link model 305 of the link interface 202 associated with the application 405, as described further below. For example, the digital assistant 201 may receive an input back to the home screen of the user interface of the electronic device 400 after entering text. Subsequently, the digital assistant 201 may receive the spoken input 404 "bold text' he-! ' accordingly, the digital assistant 201 can search the link model 305 associated with the application 405 even if the application 405 is no longer the focus of the electronic device 400. Instead, the digital assistant 201 searches the link model 305 associated with the application 405 because the application 405 continues to open (e.g., run) on the electronic device 400.

In some examples, the digital assistant 201 may preferentially search for a link model associated with an application that is the focus of the electronic device, as described further below. Additionally, in some examples, the digital assistant 201 may search all link models available in the link interface 202 regardless of which applications are open on the electronic device 400 and which applications are the focus of the electronic device 400. In some examples, the process associated with the link interface 202 determines the applications that are installed and the applications that are running and facilitates the connection between the digital assistant 201 and the various applications. Thus, the link interface 202 may send a request to an appropriate application, including launching or launching a particular application after the digital assistant 201 determines that the application is needed.

In some examples, determining the operation corresponding to the command includes searching for operations and sub-operations of the link model of the accessed link interface 202. For example, as shown in FIG. 3, a link model 305 corresponding to a word processing application 405 may include various operations 306 (such as "file," "insert," and "edit") and sub-operations 307 (such as "bold," "italic," and "underline"). Accordingly, the digital assistant 201 may search the operations 306 and the sub-operations 307 to determine if any of them match the command included in the spoken input 404. Thus, the digital assistant 201 may search for the command "bold" operation 306 and the sub-operations 307 and determine that one of the sub-operations matches the command "bold" of the spoken input 404.

In some examples, determining the operation corresponding to the command includes searching a link model of the link interface 202 and sub-operations, including a link model belonging to an application that is not currently active or open on the electronic device 400. For example, when the digital assistant 201 receives the spoken input 404 of "bold this," the digital assistant 201 may search the link models of the link interface 202 (including those belonging to applications that are unopened or inactive, called, reserved, etc.). In some examples, the digital assistant 201 may first search for a link model belonging to a currently open application (e.g., application 405) and then search for a link model belonging to a link interface 202 of a currently inactive or unopened application.

In some examples, determining the operation corresponding to the command includes determining whether the command matches at least a portion of metadata associated with the operation (e.g., sub-operation) of the link interface 202. For example, as shown in fig. 5, the spoken input 504 of "add picture of dog" includes the command "add". Accordingly, the digital assistant 201 may search the link model of the link interface 202 for metadata including the word "add". The digital assistant 201 may then recognize that the operation "insert" associated with the metadata "add" is capable of executing the command included in the spoken input 504.

In some examples, digital assistant 201 may search for operations and sub-operations for commands first, and then search for metadata associated with the operations for commands. For example, as described above, the digital assistant 201 may search for "add" operations in the link models of the link interface 202, including the link model corresponding to the application 505 currently open on the electronic device 500. After not finding the match command "add," the digital assistant 201 may then search for metadata, as described above.

In some examples, digital assistant 201 may search for metadata associated with an operation for commands at the same time as the search operation and/or sub-operation. Thus, the digital assistant 201 can customize the search of the link interface 202 based on available resources including processing power and time. For example, if it is determined that speed is more important and there is no concern about processing power, the digital assistant 201 may search for the operation and metadata associated with the operation simultaneously. Conversely, if it is determined that speed is not important and/or processing power needs to be preserved or used elsewhere, the digital assistant 201 may search only for operations or only for metadata at a time.

In some examples, the digital assistant 201 determines whether the command matches a portion of metadata associated with the operation of an application program open on the electronic device 400. For example, because the spoken input 504 "add a picture of a dog" is received when the word processing application 505 is opened, the digital assistant 201 searches the metadata associated with the operations and sub-operations of the application 505 to determine whether the command "add" matches a portion of the metadata. The digital assistant 201 may then determine that the command "add" matches metadata associated with the operation "insert".

In some examples, determining whether the command matches at least a portion of metadata associated with the operation of the link interface 202 includes determining whether the command matches metadata associated with an active operation of the open application. For example, when spoken input 404 is received and the command "bold" is identified, digital assistant 201 may search for operations of "edit" (including sub-operations "bold", "italics", and "underline") and metadata associated with those operations because those operations are active (e.g., displayed) on electronic device 400.

In some examples, determining whether the command matches at least a portion of metadata associated with the operation of the link interface 202 includes determining whether the command matches metadata associated with an inactive operation of the open application. For example, when spoken input 504 is received and the command "add" is recognized, digital assistant 201 may search for all operations and metadata associated with those operations, including the operation "insert", even if those operations are inactive (e.g., not shown) on electronic device 500.

In some examples, determining whether the command matches at least a portion of metadata associated with the operation of the link interface 202 includes searching for metadata associated with a plurality of open applications. For example, when spoken input 404 is received, several applications (including navigation applications, restaurant reservation applications, and web browsing applications) may be opened in addition to application 405 on electronic device 400. Thus, the digital assistant 201 may search the operation and associated metadata of each of the navigation application, the restaurant reservation application, and the web browsing application to determine whether any of their operations or associated metadata matches the command "bold".

In some examples, determining whether the command matches at least a portion of metadata associated with the operation of the link interface 202 includes determining whether the command matches metadata associated with an application that is a focus of the electronic device. Thus, as in the above example, when several different applications may be open on the electronic device 400, the digital assistant 201 will search for a link model of the link interface 202 associated with the application 405 that is the focus of the electronic device 400 (e.g., currently being displayed).

In some examples, determining whether the command matches at least a portion of metadata associated with the operation of the link interface 202 includes determining whether the command matches metadata associated with an application that is a preferred application for the user. Thus, the digital assistant 201 will search the link model of the link interface 202, which is associated with an application that the user has previously indicated as its preferred application for a particular task.

In some examples, the digital assistant 201 determines an operation corresponding to a command by providing the command to and receiving the operation from a machine learning language understanding model. In some examples, the machine-learning language understanding model is trained using data derived from metadata of the link model 305, and thus the model is trained to match commands to the operation of the link model 305 based on the data derived from the metadata. Thus, the digital assistant 201 determines that the operation corresponding to the command does not merely include simply searching the link model 305 for operation and metadata. Instead, the digital assistant 201 may use a machine-learning language understanding model to compare received commands to underlying data representing operations and metadata to determine operations that may not be understood by matching or similar means.

In some examples, determining whether the command matches an operation or a portion of metadata associated with an operation of the link interface 202 includes determining a confidence score that represents a degree to which the command matches the operation or the portion of metadata. In some examples, determining the confidence score includes determining a confidence score for each possible application of the plurality of applications associated with the link model 305 of the link interface 202. For example, when the command "Add New document" is provided, the digital assistant 201 may determine that the command may match the word processing application's "Create word processing document" operation and the PDF application's "Create PDF" operation, respectively. Thus, the digital assistant 201 determines a confidence score associated with each application based on the received commands and possible operations. Thus, the digital assistant 201 can determine that the confidence score of the word processing application is higher because "documents" are included in the input. Thus, the digital assistant 201 can select a word processing application and the operation "create word processing document" as the operation that matches the provided command.

After finding an operation or a portion of metadata that matches a command of spoken input 204 (e.g., determining that the command matches at least a portion of metadata associated with the operation), digital assistant 201 associates the command with the operation and stores this association for subsequent use by digital assistant 201 through the application. For example, after determining that the operation "bold" matches the command "bold" of the spoken input 404, the digital assistant 201 associates the command "bold" with the operation "bold" and saves this association in the link interface 202 or another database for reference. In this way, when "bolded" is provided in the later spoken input, the digital assistant 201 will recognize that the user intends to invoke the "bolded" operation of the application 405 and perform the "bolded" operation without performing the above determination.

In some examples, the digital assistant 201 also stores a portion of the metadata with the association of the command with the operation. In some examples, the digital assistant 201 stores the portion of metadata that matches the command along with the association of the command with the operation. For example, after determining that the command "add" matches the operation "insert" based on the presence of the synonym "add" in the metadata associated with the "insert" operation, the digital assistant 201 stores an association between the command "add" and the operation "insert" and may further annotate this association with metadata indicating that this connection was established by using the synonym. In this way, a developer of the application 505 or other party of possible interest may make relevant determinations with reference to how the digital assistant 201 is and which data stored in the link interface 202 has been used.

In some examples, the digital assistant 201 stores the association of commands with operations in a link model of the link interface 202 associated with the relevant application. In some examples, the digital assistant 201 stores the association of commands with operations in a separate database, such as a database maintained by the digital assistant 201 for quick access to learned commands. In some examples, the digital assistant 201 stores the association of commands in a database dedicated to frequently used commands, newly learned commands, or recently used commands. For example, the digital assistant 201 may add the commands "bold" and "add" to a list of recently used commands that are accessible for further reference. Similarly, the digital assistant 201 may add the commands "bold" and "add" to the newly learned command list for further reference.

After storing the determined association, the digital assistant 201 executes the command through the application program by accessing the application program interface 203. For example, the digital assistant 201 may send commands to the appropriate application program through the application program interface 203. In this way, the digital assistant 201 invokes commands and handles interactions between the user and the application without requiring the user to interact directly with the application. In some examples, the digital assistant 201 provides a prompt confirming the command prior to performing the operation. For example, after digital assistant 201 determines that the command "add" is associated with the operation "insert," digital assistant 201 may provide an audio prompt "do it want to insert a picture? The user may then provide a positive or negative response to the prompt, and the digital assistant 201 will perform or forego performing the appropriate operation based on the provided response.

In some examples, the digital assistant 201 may add a command or application to a favorites list associated with a user of the electronic device. In some examples, the digital assistant 201 adds commands and/or applications to the favorites list upon user indication. In some examples, the digital assistant 201 adds the command to the collection list after the command is received a predetermined number of times (e.g., 5 times, 10 times, 15 times, 20 times, etc.). In some examples, the digital assistant 201 adds the application to the collection list after the application has been accessed or opened a predetermined number of times (e.g., 5 times, 10 times, 15 times, 20 times, etc.).

In some examples, digital assistant 201 may determine a plurality of operations that it has previously accessed and metadata associated with each of the plurality of operations, and compile the operations and metadata into a transcript. For example, the digital assistant 201 may save a database of received commands and operations associated therewith for future reference, as described above. Thus, the digital assistant 201 can access this database and compile the data into a transcript that can display which commands were received and which operations the commands correspond to, as well as metadata associated with the operations. In some examples, the digital assistant 201 determines the transcript by recording a function call when an operation or task is performed and the associated code is compiled. Thus, in some examples, transcripts referenced by digital assistant 201 may be created over time as various operations or tasks are continually requested and performed.

In some examples, digital assistant 201 may determine a plurality of operations that have been previously performed based on input detected on a GUI of a device running digital assistant 201. For example, a user may provide input on a GUI of the device to highlight a portion of the screen or select buttons displayed on the screen. These operations may also be recorded in transcripts and/or databases and then referenced by the digital assistant 201 to disambiguate spoken language inputs, as described below. Additionally, the transcript determined by the digital assistant 201 may include a plurality of operations based on the input detected on the GUI and operations performed by the digital assistant 201 based on the spoken input. Thus, the transcript may include various operations performed on the device and requested by the user, which increases the responsiveness of the digital assistant 201 and helps the digital assistant 201 understand the various user requests.

In some examples, digital assistant 201 may receive ambiguous spoken input and use a transcript to resolve the ambiguous spoken request. In some examples, resolving ambiguous spoken requests using a transcript includes determining an operation to perform based on the transcript and ambiguous terms in the spoken request. For example, after receiving spoken input 504 "add a picture of a dog," associate operation "insert" with command "add," and insert the picture into a document, digital assistant 201 may receive spoken input "rotate it. Thus, the received second spoken input contains the ambiguous term "it" that the digital assistant 201 must resolve to execute the provided command "rotate". Thus, the digital assistant 201 may access a transcript of a previous operation and determine that "it" may refer to a picture in the previous command based on various factors such as when a different input was received, whether the received command or its associated operation is relevant, and historical interactions between the user and the digital assistant 201.

In some examples, the system 200 receives spoken input 204 without an open application on the electronic device. For example, the digital assistant 201 may receive the spoken input 504 "add a picture of a dog" and then the electronic device 500 may receive an indication to close the application 505. Subsequently, the digital assistant 201 may receive spoken input "rotate it". When no application is open on the electronic device, the digital assistant 201 determines whether the command (e.g., rotation) entered in the spoken language matches at least a portion of metadata associated with the operation of a plurality of applications available (e.g., installed) to the digital assistant 201, as described above. In some examples, the operations of the link model 305 are annotated as global operations, which are operations that may be invoked or performed even when the associated application is inactive or running. Thus, the digital assistant 201 can invoke the global operation of the link model 305 based on at least a portion of the matching metadata.

In some examples, the digital assistant 201 determines the operations and metadata searches by determining a subset of the available applications. In some examples, the subset of applications includes applications listed in the transcript. In some examples, the subset of applications includes applications that are frequently accessed by the digital assistant 201. In some examples, the applications frequently accessed by digital assistant 201 are applications that are accessed beyond a predetermined threshold within a predetermined period of time. For example, applications that are accessed more than 10 times within 5 days may be considered frequently accessed applications. In some examples, the subset of applications includes applications that are marked as favorites by the digital assistant 201, as described above. In some examples, the application marked as favorite is an application marked as favorite in a user profile associated with the user providing the spoken input.

After determining the subset of applications, the digital assistant 201 searches the operations and associated metadata of each application in the link interface 202 to determine if any of the operations or associated metadata match the command, as described above with respect to other examples. Thus, the digital assistant 201 can determine when an operation or metadata associated with the operation matches a command, associate the command with the operation, and store the association for future reference.

In some examples, the digital assistant 201 receives another spoken input including the same command after storing the association between the command and the operation. Thus, the digital assistant 201 accesses the stored associations, retrieves the stored operations, and performs the operations in the appropriate application. For example, after associating the command "add" with the operation "insert" in advance, the digital assistant 201 may receive a spoken input "add a chart of this data". Thus, the digital assistant 201 can access the stored associations, retrieve the "insert" operation, and cause the word processing application 505 to insert a chart.

In some examples, digital assistant 201 receives spoken input that instructs digital assistant 201 to associate a command with an operation, as shown in fig. 6. For example, the digital assistant 201 may receive spoken input 604, "assistant, learn how to add pictures. When digital assistant 201 receives such spoken input, digital assistant 201 begins recording the activity performed with electronic device 600 to determine which operation of application 605 should be associated with the command "add".

Subsequently, the digital assistant 201 records an activity including selecting the insert tab 606 and selecting the picture category 607 under the insert tab. The digital assistant 201 then associates the "insert" operation and the selected category with the command "add" and stores the association, as described above. In some examples, the selection of the various user interface elements is a selection made using sound input. For example, the user may provide the input "select insert" after initiating the recording process. In some examples, the selection of the user interface element is a tap on a touch-sensitive screen of the electronic device 605.

In some examples, the association of commands with operations and recorded activities is stored in the same manner as the association of commands, operations, and metadata described above. Thus, the digital assistant 201 can access these associations to resolve subsequent commands and perform the required operations.

In some examples, the system 200 and digital assistant 201 are part of a system or electronic device for creating or interacting with VR, MR, or AR environments, as shown in fig. 7. Thus, the electronic device 100 or a similar electronic device may generate a VR environment that includes one or more virtual objects with which the digital assistant 201 may interact based on user input. In some examples, electronic device 100 may generate or receive a view of a virtual environment, including the one or more virtual objects. For example, as shown in fig. 7, the electronic device 100 may receive a view 700 including a virtual drawing 702 and a virtual sofa 703. In some examples, the virtual drawing 702 and the virtual sofa 703 are generated based on user-specified parameters. In some examples, the virtual drawing 702 and the virtual sofa 703 are generated based on predetermined parameters.

Upon interaction with view 700, digital assistant 201 may receive spoken input 701 that includes commands that digital assistant 201 cannot recognize, similar to spoken input 404 and spoken input 504 described above. For example, as shown in fig. 7, the digital assistant 201 may receive spoken input 701 "turn sofa blue", but may not recognize the command "turn". Thus, the digital assistant 201 determines whether the command of the spoken input 701 matches at least a portion of the operation, sub-operation, or metadata of the link model (e.g., link model 305) of the link interface 202 to determine which operation should be performed, as described above.

In some examples, the digital assistant 201 accesses a link model of an application that is open (e.g., running) on the electronic device 100 that generated (e.g., generated, received) the view 700. For example, to generate view 700 including virtual drawing 702 and virtual sofa 703, electronic device 100 may have an art history application and an interior design application that open simultaneously. These applications may allow the electronic device 100 to retrieve and display data needed for the virtual drawing 702 and the virtual sofa 703, including how the drawing looks, the colors available for the sofa, and so forth. Thus, the digital assistant 201 accesses and searches the link model corresponding to the art history application and the interior design application because these applications are currently open (e.g., in use).

The digital assistant 201 prioritizes these applications even though they are not the focus of the electronic device 100 or the digital assistant 201. That is, in a virtual reality or augmented reality environment, applications that open (e.g., run) to create the environment and associated objects are not currently typically displayed by the electronic device or digital assistant 201. However, the digital assistant 201 is able to recognize that these applications may be relevant to the received spoken input 701 because these applications are actively working to produce the view 700.

In some examples, digital assistant 201 accesses a link model of an application displayed by electronic device 100 and/or digital assistant 201. For example, when electronic device 100 is providing view 700, digital assistant 201 may receive input to open a messaging application and then display the messaging application as part of view 700. In addition, the digital assistant 201 may receive spoken input "open my new email," but may not recognize the command "open. Thus, the digital assistant 201 can access a link model associated with a messaging application because the messaging application is currently displayed as part of the view 700.

Similarly, in some examples, digital assistant 201 accesses a link model of an application that is the focus of electronic device 100 and/or digital assistant 201. For example, the electronic device 100 may display a virtual TV as part of the view 700 and cause content from a streaming application to be displayed on the virtual TV. Meanwhile, the electronic device 100 may display a messaging application including several emails. The digital assistant 201 may receive the input "open my new email" when the view 700 of the electronic device 100 is focused (e.g., facing) the messaging application. Thus, the digital assistant 201 will access and preferentially search the link model associated with the messaging application rather than the link model associated with the streaming application because the messaging application is the focus of the electronic device 100.

After determining which link models to access and search, the digital assistant 201 may compare the command with the operations, sub-operations, and metadata of one or more link models of the link interface 202 to determine if any of the operations, sub-operations, or metadata match the command. Upon determining that one of the operations, sub-operations, or metadata matches the command, the digital assistant 201 may associate the corresponding operation with the command and store the association for further use, as described above.

In some examples, spoken input 701 may include an ambiguous reference, and digital assistant 201 may access a transcript to disambiguate. For example, the digital assistant 201 may receive spoken input "turn it blue", but may not immediately understand what object "it" refers to. Thus, the digital assistant 201 may review the transcript to determine the most recently taken actions and for which objects those actions were performed. The digital assistant 201 may then determine that the color of the digital sofa 703 was recently changed to red, so the user may want to change the color of the digital sofa 703 from red to blue. Thus, the digital assistant 201 can determine that "it" is referring to the digital sofa 703.

As another example, the digital assistant 201 may receive the spoken input "restore it" but cannot determine what "it" refers to. Thus, the digital assistant 201 may review the transcript to determine that the user has recently provided the spoken input "delete sofa," and then the digital assistant 201 deletes the digital sofa 703 from view 700. Thus, the digital assistant 201 can determine that the user is likely to use "it" to refer to a sofa.

It should be appreciated that accessing the operations, sub-operations, and metadata of the system 200 using the digital assistant 201 in this manner may be advantageous in VR, MR, and AR environments such as those described above, as new virtual objects and physical objects are frequently added to and deleted from views of the electronic device. Thus, the digital assistant 201 can quickly adapt to new objects, new commands, and new applications without requiring the user or developer to provide extensive training, thereby creating a more pleasant and immersive experience for the user.

FIG. 8 is a flow chart illustrating a process for mapping and executing user commands according to various examples. The method 800 is performed at a device (e.g., device 100, 400, 500, 600) having one or more input devices (e.g., touch screen, microphone, camera) and a wireless communication radio (e.g., bluetooth connection, wiFi connection, mobile broadband connection such as a 4G LTE connection). In some embodiments, the electronic device includes a plurality of cameras. In some embodiments, the electronic device includes only one camera. In some examples, the device includes one or more biometric sensors, optionally including a camera, such as an infrared camera, a thermal imaging camera, or a combination thereof. Some operations in method 700 are optionally combined, the order of some operations is optionally changed, and some operations are optionally omitted.

At block 802, when an application (e.g., application 405, 505, 605) is opened, spoken input (e.g., spoken input 204, 404, 504, 604) including a command is received. In some examples, an application that is open on an electronic device (e.g., device 100, 400, 500, 600) is the focus of the electronic device. In some examples, the application that is open on the electronic device is one of a plurality of open applications.

At block 804, it is determined whether the command matches at least a portion of metadata (e.g., metadata 308) associated with an operation (e.g., operation 306, sub-operation 307) of the application (e.g., application 405, 505, 605). In some examples, determining whether the command matches at least a portion of metadata associated with the operation of the application further includes determining whether the command matches at least a portion of metadata associated with the operation of any of the plurality of open applications.

In some examples, the operations (e.g., operation 306, sub-operation 307) of the application (e.g., application 405, 505, 605) are active operations. In some examples, the operation of the application is one of a plurality of operations, and the plurality of operations includes a plurality of active operations and a plurality of inactive operations. In some examples, the plurality of active operations are operations currently displayed by the application. In some examples, the plurality of inactive operations are operations that are not currently displayed by the application.

In some examples, the plurality of operations (e.g., operation 306, sub-operation 307) are presented in a tree-like link model (e.g., link model 305). In some examples, the tree link model includes a plurality of hierarchical links between related ones of the plurality of operations.

In some examples, each of the plurality of operations (e.g., operation 306, sub-operation 307) is associated with a respective portion of metadata (e.g., metadata 308). In some examples, the metadata includes synonyms of operations. In some examples, the metadata includes an ontology corresponding to the operation.

At block 806, in accordance with a determination that the command matches at least a portion of metadata (e.g., metadata 308) associated with an operation (e.g., operation 306, sub-operation 307) of the application (e.g., application 405, 505, 605), the command is associated with the operation.

At block 808, the association of the command with the operation (e.g., operation 306, sub-operation 307) is stored for subsequent use by the digital assistant (e.g., digital assistant 201) through the application (e.g., application 405, 505, 605). In some examples, in accordance with a determination that the command matches at least a portion of metadata (e.g., metadata 308) associated with the operation of the application, the metadata for the portion is stored with the association of the command with the operation.

At block 810, operations (e.g., operation 306, sub-operation 307) are performed by an application (e.g., application 405, 505, 605).

In some examples, a plurality of operations (e.g., operation 306, sub-operation 307) previously accessed by the digital assistant (digital assistant 201) are determined. In some examples, respective metadata (e.g., metadata 308) associated with each of a plurality of operations (e.g., applications 405, 505, 605) previously accessed by the digital assistant is determined. In some examples, the plurality of operations and corresponding metadata are compiled into a transcript. In some examples, the transcript is provided to resolve ambiguous requests.

In some examples, spoken input (e.g., spoken input 204, 404, 504, 604) is received without an application (e.g., application 405, 505, 605) being opened on an electronic device (e.g., device 100, 400, 500, 600). In some examples, it is determined whether the command matches at least a portion of metadata (e.g., metadata 308) associated with operations (e.g., operation 306, sub-operation 307) of the plurality of applications. In some examples, the plurality of applications includes applications listed in a transcript. In some examples, the plurality of applications includes applications that are frequently accessed by a digital assistant (e.g., digital assistant 201). In some examples, the plurality of applications includes applications marked as favorites in a user profile associated with a user providing spoken input. In some examples, the operation is an operation previously stored in association with the command and the application.

In some examples, spoken input (e.g., spoken input 204, 404, 504, 604) associating the second command with the second operation (e.g., operation 306, sub-operation 307) is received. In some examples, activity on an electronic device (e.g., device 100, 400, 500, 600) is recorded. In some examples, the recorded activity is stored as a second operation, and an association of the second command with the second operation is stored for subsequent use by a digital assistant (e.g., digital assistant 201).

Fig. 9 depicts an exemplary digital assistant 900 for performing natural language processing. In some examples, as shown in fig. 9, the digital assistant 900 includes a lightweight natural language model 901, a lightweight natural language model 902, a complex natural language model 903, and a complex natural language model 904. In some examples, the digital assistant 900 is implemented on the electronic device 100. In some examples, the digital assistant 900 is implemented on a device other than the electronic device 100 (e.g., a server). In some examples, some of the modules and functions of the digital assistant 900 are divided into a server portion and a client portion, where the client portion resides on one or more user devices (e.g., the electronic device 100) and communicates with the server portion over one or more networks. It should be noted that the digital assistant 900 is only one example, and that the digital assistant 900 may have more or fewer components than shown, may combine two or more components, or may have different component configurations or arrangements. The various components of digital assistant 900 are implemented in hardware, software instructions for execution by one or more processors, firmware (including one or more signal processing integrated circuits and/or application specific integrated circuits), or a combination thereof.

The digital assistant 900 receives an utterance 905 from a user and determines a user intent 908 corresponding to the utterance 905. As described further below, the digital assistant 900 provides the utterance 905 to one or more lightweight natural language models to determine a corresponding natural language recognition score. Based on these natural language recognition scores, the digital assistant 900 determines whether to provide the utterance to a complex natural language model associated with an application that is also associated with a corresponding lightweight natural language model. The complex natural language model may then determine a user intent 908 corresponding to the utterance 905.

In some examples, the utterance 905 from the user is received during an active digital assistant session between the user and the digital assistant 900. For example, it may be that the user asks the digital assistant 900 "what is now? During a dialogue or communication in which "2:15 pm" is responded to by the digital assistant 900, the utterance 905 from the user is received "order me a vehicle to go to the airport". Thus, the utterance 905 is received as part of an ongoing communication between the user and the digital assistant 900.

In some examples, the utterance 905 from the user is received outside of an active digital assistant session between the user and the digital assistant 900. Thus, the digital assistant 900 determines whether the utterance 905 is intended for the digital assistant 900 or for another person. In some examples, the digital assistant 900 determines whether the utterance 905 is intended for the digital assistant 900 based on various factors, such as the view and/or orientation of the electronic device 100, the direction in which the user is facing, the gaze of the user, the volume of the utterance 905, the signal-to-noise ratio associated with the utterance 905, and the like. For example, the utterance 905 "order me a vehicle to an airport" may be received from a user while the user is looking at the device 100. Thus, the view of the electronic device 100 may be the user's face and the volume of the utterance 905 may indicate that the user is looking at the electronic device 100. Thus, the digital assistant 900 can determine that the user intends to direct the utterance 905 to the digital assistant 900.

In some examples, utterance 905 includes a trigger phrase. In some examples, the digital assistant 900 determines whether the utterance 905 includes a trigger phrase and initiates the digital assistant session in accordance with determining that the utterance 905 includes the trigger phrase. For example, utterance 905 may include "assistant" to order me a vehicle to go to the airport ". Thus, the digital assistant 900 determines that the word "assistant" is a trigger phrase, initiating a digital assistant session to interact with the user.

Thus, in some examples, the digital assistant 900 pre-processes the utterance 905 before providing the utterance 905 to the lightweight natural language model 901 and the lightweight natural language model 902, as described further below. In some examples, the preprocessing of the utterance 905 includes determining a start and/or an end of the utterance 905. For example, when the utterance 905 "order me a car to an airport" is received as part of an ongoing conversation between the user and the digital assistant 900, the digital assistant 900 may pre-process the received audio to determine which part of the conversation is the utterance 905. Thus, the digital assistant 900 can determine which portion of the dialog needs to be further processed by the natural language model, as will be discussed in more detail below.

After the digital assistant 900 receives the utterance 905 (and optionally, pre-processing), the digital assistant 900 provides the utterance 905 to the lightweight natural language model 901 and the lightweight natural language model 902. In some examples, the lightweight natural language model 901 is associated with a first application and the lightweight natural language model 902 is associated with a second application that is different from the first application.

Lightweight natural language models 901 and 902 are simplified natural language models that can determine whether further processing of utterance 905 is required. Specifically, after receiving the utterance 905, the lightweight natural language model 901 then determines the natural language recognition score 906 of the utterance 905, and the lightweight natural language model 902 determines the natural language recognition score 907 of the utterance 905. Since each of the lightweight natural language models 901 and 902 is associated with a respective first application and second application, the natural language identification score 906 determined by the lightweight natural language model 901 is associated with the first application and the natural language identification score 907 determined by the lightweight natural language model 902 is associated with the second application.

Thus, the lightweight natural language models 901 and 902 determine whether further processing of the utterance 905 is required to perform a task by the first application or the second application based on the utterance 905. Thus, the lightweight natural language models 901 and 902 are relatively simple models for determining whether the utterance 905 should be provided to a complex natural language model of each application to determine user intent.

In some examples, determining the natural language recognition score 906 of the utterance 905 includes determining whether the utterance 905 is related to a first application associated with the lightweight natural language model 901. In particular, the lightweight natural language model 901 can analyze particular words or phrases in the utterance 905 that are related to a subject or task of the first application, and determine the natural language recognition score 906 based on the presence of the words or phrases, how the words or phrases are used in the utterance 905, and so on.

For example, when the lightweight natural language model 901 is associated with a car pooling application, the lightweight natural language model 901 can analyze words or phrases in the utterance 905 that are related to driving, automotive, location, travel, etc., and determine the natural language recognition score 906 based on whether the utterance 905 includes these words, how close the words are together, etc. Thus, when the lightweight natural language model 901 processes the utterance 905 "order me a car to an airport," the lightweight natural language model 901 can determine that the utterance 905 is relevant to a car pooling application because of the presence of the words "car" and "airport" and their relative locations in the utterance 905. Thus, the natural language recognition score 906 determined by the lightweight natural language model 901 may be relatively high because the utterance 905 is determined to be relevant to a car pool application.

Conversely, when the utterance 905 is "how does the weather in germany? "when the lightweight natural language model 901 can determine that the utterance 905 is not relevant to the carpool application because the word" weather "is not relevant to carpool, and while the use of" germany "may be relevant to travel, any other content in the utterance 905 is not relevant to carpool. Thus, the natural language recognition score 906 determined by the lightweight natural language model 901 may be relatively low because the utterance 905 is determined to be irrelevant to the car pool application.

Similarly, in some examples, determining the natural language recognition score 907 of the utterance 905 includes determining whether the utterance 905 is related to a second application that is associated with the lightweight natural language model 902. In particular, the lightweight natural language model 902 may analyze particular words or phrases in the utterance 905 that are related to the subject or task of the second application, and determine the natural language recognition score 907 based on the presence of the words or phrases, how the words or phrases are used in the utterance 905, and so on.

For example, when the lightweight natural language model 902 is associated with a weather application, the lightweight natural language model 902 may analyze words or phrases in the utterance 905 that are related to location, travel, weather, climate, temperature, cloud, etc., and determine the natural language recognition score 907 based on whether the utterance 905 includes these words, how close the words are together, etc. Thus, when the utterance 905 is "how does the weather in germany? "when the lightweight natural language model 902 can determine that the utterance 905 is relevant to a weather application because of the presence of the words" weather "and" germany "in the utterance 905 and their relative locations. Thus, the natural language recognition score 907 determined by the lightweight natural language model 902 may be relatively high because the utterance 905 is determined to be relevant to a weather application.

Conversely, when the utterance 905 is "order me a vehicle to go to the airport," the lightweight natural language model 902 may determine that the utterance 905 is not relevant to the weather application because the word "car" is not relevant to weather, and while the use of "airport" may be relevant to weather, any other content in the utterance 905 is irrelevant to weather. Thus, the natural language recognition score 907 determined by the lightweight natural language model 902 may be relatively low because the utterance 905 is determined to be irrelevant to the weather application.

In some examples, after the natural language identification scores 906 and 907 are determined, the natural language identification scores 906 and 907 are adjusted based on context data associated with the electronic device on which the digital assistant 900 operates (e.g., the electronic device 100). The context data associated with the electronic device includes various characteristics of the electronic device. For example, the context data may indicate a location of the electronic device (e.g., GPS coordinates), whether the electronic device is connected to a network (e.g., a WiFi network), whether the electronic device is connected to one or more other devices (e.g., headphones), and/or a current time, date, and/or workday. If the electronic device is connected to a network or device, the context data may further indicate the name and/or type of the network or device, respectively.

For example, when the utterance 905 "how do the weather of palo alto? "when the context data may include GPS coordinates indicating that the electronic device 100 is located in san francisco. Thus, the digital assistant 900 may adjust the natural language identification score 906 associated with the car pooling application by increasing the natural language identification score 906 because the user is more likely to be interested in the car pooling of the de-palo alto given that the user is located relatively close to san francisco. Thus, the natural language recognition score 906 is adjusted because the digital assistant 900 recognizes that the utterance 905 is more relevant to the car pool application based on the location of the electronic device 100.

In some examples, after the natural language identification scores 906 and 907 are determined, the natural language identification scores 906 and 907 are adjusted based on a view of an electronic device (e.g., electronic device 100) on which the digital assistant 900 is operating. For example, when the utterance 905 is "when they are next hit? "and natural language identification score 907 is associated with a sports score application, natural language identification score 907 may be initially determined to be relatively low. However, the view of the electronic device 100 may include a poster of the san francisco giant team. Thus, the digital assistant 900 determines that the utterance 905 is relevant to a sports score application and that the natural language recognition score 907 may be raised because of the view that includes the poster.

In some examples, the natural language identification scores 906 and 907 are adjusted based on a view of the virtual environment generated by the electronic device 100 or similar electronic device. For example, when utterance 905 is "what is now playing? "and the natural language identification score 906 is associated with a media application, the natural language identification score 906 may be raised because the view of the electronic device 100 includes a virtual environment with a virtual television. Thus, the digital assistant 900 can determine that the utterance 905 is relevant to the media application because of the view that includes the virtual television.

After the lightweight natural language model 901 determines (and optionally adjusts) the natural language identification score 906, the digital assistant 900 determines whether the natural language identification score 906 exceeds a predetermined relevance threshold. In accordance with determining that the natural language recognition score 906 exceeds a predetermined relevance threshold, the digital assistant 900 determines that the utterance 905 is relevant to the first application. Thus, the digital assistant 900 provides the utterance 905 to the complex natural language model 903 to determine a user intent 908 that corresponds to the utterance 905. In contrast, in accordance with a determination that the natural language recognition score 906 does not exceed the predetermined relevance threshold, the digital assistant 900 determines that the utterance 905 is not relevant to the first application and does not provide the utterance 905 to the complex natural language model 903.

For example, when the utterance 905 is "order me a vehicle to an airport" such that the natural language recognition score 906 associated with the carpool application is relatively high, the digital assistant 900 may determine that the natural language recognition score 906 exceeds a predetermined relevance threshold and provide the utterance 905 to the complex natural language model 903 to determine the user intent 908. Alternatively, when the utterance 905 is "how does the weather in germany? "so that the natural language score 906 associated with the car pooling application is relatively low, the digital assistant 900 may determine that the natural language recognition score 906 does not exceed a predetermined relevance threshold and may not provide the utterance 905 to the complex natural language model 903.

Similarly, after the lightweight natural language model 902 determines the natural language identification score 907, the digital assistant 900 determines whether the natural language identification score 907 exceeds a predetermined relevance threshold. In accordance with determining that the natural language recognition score 907 exceeds a predetermined relevance threshold, the digital assistant 900 determines that the utterance 905 is relevant to a second application. Thus, the digital assistant 900 provides the utterance 907 to the complex natural language model 904 to determine a user intent 909 that corresponds to the utterance 905. In contrast, in accordance with a determination that the natural language recognition score 907 does not exceed the predetermined relevance threshold, the digital assistant 900 determines that the utterance 905 is not relevant to the second application and does not provide the utterance 905 to the complex natural language model 904.

For example, when the utterance 905 is "how does the weather in germany? "so that the natural language recognition score 907 associated with the weather application is relatively high, the digital assistant 900 may determine that the natural language recognition score 907 exceeds a predetermined relevance threshold and provide the utterance 905 to the complex natural language model 904 to determine the user intent 909. Alternatively, when the utterance 905 is "to order me a vehicle to an airport" such that the natural language score 907 associated with the weather application is relatively low, the digital assistant 900 may determine that the natural language recognition score 907 does not exceed a predetermined relevance threshold and may not provide the utterance 905 to the complex natural language model 904.

The complex natural language models 903 and 904 are sophisticated natural language models that are capable of performing complete natural language processing on the dialect 905 to determine user intent (e.g., user intent 908 and 909) and tasks associated with the user intent. Thus, when complex natural language model 903 and complex natural language model 904 receive utterance 905, complex natural language model 903 determines user intent 908 corresponding to utterance 905 and complex natural language model 904 determines user intent 909 corresponding to utterance 905. In some examples, the complex natural language models 903 and 904 also determine one or more parameters of the task corresponding to the determined user intent.

For example, when the utterance 905 is "order me a car to an airport," the complex natural language model 903 associated with the carpool application determines that the user intent 908 is to order a car to the airport from their current location. Thus, the complex natural language model 903 determines that the task corresponding to the determined intent is a carpool task, and the parameters of the carpool task include the current location of the user as a starting location and the nearest airport as an ending location.

As another example, when the utterance 905 is "how does the weather in germany? "when, the complex natural language model 904 associated with the weather application determines that the user intent 909 is to determine the current weather in Germany. Thus, the complex natural language model 904 determines that the task corresponding to the determined intent is to view weather, and that the parameters of the task include coordinates of the location within germany.

In some examples, prior to receiving utterance 905, lightweight natural language model 901 and lightweight natural language model 902 are trained to determine whether the utterance is relevant to a first application and a second application, respectively. In some examples, the lightweight natural language model 901 is trained using a first training data set associated with a first application. In some examples, the first training data set includes a set of utterances related to the first application. For example, when the first application is a carpool application, the first training data set includes a data set such as "where is my car? "help me find his home", "arrange to get home from airport ride", "drive to see movie", "is the car on road? "etc.

In some examples, the lightweight natural language model 901 trains by calibrating the natural language recognition score based on a plurality of utterances of the first training data set that are not relevant to the first application. For example, a plurality of utterances that are not relevant to the carpool application (such as "how hot the weather is.

Thus, when the first training data set is provided to the lightweight natural language model 901, the lightweight natural language model 901 is trained to determine that the set of utterances and similar utterances included in the first training data set are relevant to the first application based on the factors described above (including the presence of specific terms or phrases, the placement of these terms, the relationship between terms and phrases, etc.).

Similarly, in some examples, the lightweight natural language model 902 is trained using a second training data set associated with a second application. In some examples, the second training data set includes a set of utterances related to a second application. For example, when the second application is a weather application, the second training data set includes a data set such as "how hot weather? "," what is the outside temperature? "," will the next week be sunny? "what is the weather in florida? Utterances such as "," weather forecast telling me the next tuesday ", etc.

In some examples, the lightweight natural language model 902 trains by calibrating the natural language recognition score based on a plurality of utterances of the first training data set that are not relevant to the first application. For example, a plurality of utterances that are not relevant to the weather application (such as "where are my cars.

Thus, when the second training data set is provided to the lightweight natural language model 902, the lightweight natural language model 902 is trained to determine that the set of utterances and similar utterances included in the second training data set are relevant to the second application based on the factors described above (including the presence of particular terms or phrases, placement of the terms, relationships between terms and phrases, etc.).

Similarly, in some examples, prior to receiving utterance 905, complex natural language model 903 and complex natural language model 904 are trained to determine a user intent, a task associated with the user intent, and parameters of a training data set that includes multiple utterances. Thus, various utterances (such as "help me go to airport", "how hot the weather is.

In some examples, the lightweight natural language model and the complex natural language model are trained on devices other than the electronic device 100. In some examples, the lightweight natural language model and the complex natural language model are trained on a server and then provided to the electronic device 100. In some examples, the lightweight natural language model and the complex natural language model are trained simultaneously. In some examples, the lightweight natural language model and the complex natural language model are trained at different times.

In some examples, the lightweight natural language model requires less training data and therefore fewer parameters to successfully train and calibrate than the complex natural language model because the lightweight natural language model is simpler and the determination performed is less complex. In some examples, the lightweight natural language model includes a logistic regression network or convolutional neural network, thereby processing individual words or symbols of the utterance in parallel. In this way, the lightweight natural language model processes each word or symbol of an utterance in the context of neighboring words or symbols, but not in the full context of the utterance. Thus, training of lightweight natural language models is performed faster and processed less than training of complex natural language models.

In some examples, the digital assistant 900 determines whether the natural language identification score 906 is greater than the natural language identification score 907 and performs the task associated with the user intent 908 in accordance with determining that the natural language identification score 906 is greater than the natural language identification score 907. For example, when the utterance 905 is "order me a car to an airport" and the natural language recognition score 906 associated with the car pooling application is greater than the natural language recognition score 907 associated with the weather application, the digital assistant 900 causes the car pooling application to perform a pooling task related to the user intent 908 (i.e., order a pooling).

Similarly, in some examples, the digital assistant 900 determines whether the natural language identification score 907 is greater than the natural language identification score 906 and performs the task associated with the user intent 909 in accordance with determining that the natural language identification score 907 is greater than the natural language identification score 906. For example, when the utterance 905 is "how does the weather in germany? "and the natural language identification score 907 associated with the weather application is greater than the natural language identification score 906 associated with the car pooling application, the digital assistant 900 causes the weather application to perform the task of determining the weather of the location associated with the user intent 909 (i.e., determining the weather).

In some examples, the digital assistant 900 provides the utterance 905 to the complex natural language model 903 associated with the first application regardless of whether the natural language recognition score 906 exceeds a predetermined relevance threshold. In particular, the digital assistant 900 provides the utterance 905 to the complex natural language model 903 because the first application is active on the electronic device 100. In some examples, the first application is active on the electronic device 100 when the first application is open on the electronic device 100. In some examples, the first application is active on the electronic device 100 when the first application is the focus of the electronic device 100.

Thus, in some examples, in accordance with a determination that the natural language identification score 906 does not exceed a predetermined threshold, the digital assistant 900 determines whether an application associated with the lightweight natural language model 901 is active. In accordance with determining that the application associated with the lightweight natural language model 901 is active, the digital assistant 900 provides the utterance 905 to the complex natural language model 903 associated with the application. The complex natural language model 903 then determines the user intent 908 corresponding to the utterance 905.

For example, when the utterance 905 is "how does the weather in germany? When "and the natural language recognition score 906 does not exceed the predetermined relevance threshold, the digital assistant 900 may determine that the car pool application is the focus of the electronic device 100 and provide the utterance 905 to the complex natural language model 903 for further processing.

In some examples, the digital assistant 900 provides the utterance 905 to all lightweight natural language models available to the digital assistant 900. In some examples, the digital assistant 900 has access to a lightweight natural language model for each application installed on the electronic device 100. Thus, the digital assistant 900 can provide the utterance 905 to a lightweight natural language model of each application installed on the electronic device 100.

In some examples, the digital assistant 900 selects a subset of applications and provides the utterance 905 to a lightweight natural language model of each application in the selected subset of applications. In some examples, the application is selected based on a user's preference. For example, the user may indicate to the digital assistant 900 that they prefer to use the first car pool application over the second car pool application in the user settings. Thus, the digital assistant 900 can automatically provide the utterance 905 to a lightweight natural language model associated with the first car-pooling application based on the user setting.

In some examples, the application is selected based on historical interactions between the user and the application. For example, the digital assistant 900 may provide the user with the option of the first and second carpool applications several times, and the user may select the second carpool application each time. Thus, the digital assistant 900 may determine that the user is more likely to select the second car-sharing application, thereby automatically providing the utterance 905 to a lightweight natural language model associated with the second car-sharing application based on historical interactions of the user selecting the second car-sharing application.

In some examples, the application is selected based on its popularity. For example, the digital assistant 900 may determine that the first car-sharing application was selected more frequently by multiple users when seeking a car-sharing. Thus, the digital assistant 900 can automatically provide the utterance 905 to the lightweight natural language model associated with the first car-pooling application because the digital assistant 900 determines that the first car-pooling application is more popular with most users.

In some examples, the application is selected based on the time the application was last installed on the electronic device 100. For example, the digital assistant 900 may have downloaded a second car pool application during the last day. Accordingly, the digital assistant 900 may determine that the user intends to provide an utterance to a car pooling application because the user has recently downloaded the new pooling application. Thus, the digital assistant 900 can automatically provide the utterance 905 to the lightweight natural language model associated with the second car-pooling application because the digital assistant 900 determines that the second car-pooling application was recently installed.

In some examples, the digital assistant 900 has access to a lightweight natural language model associated with applications that are available to the digital assistant 900 but not installed on the electronic device 100. Thus, the digital assistant 900 can provide the utterance 905 to this lightweight natural language model and utilize the lightweight natural language model to determine a natural language recognition score. Additionally, the digital assistant 900 may determine whether the natural language identification score exceeds a predetermined threshold. If the digital assistant 900 determines that the natural language identification score exceeds a predetermined threshold, the digital assistant 900 may retrieve an application associated with the lightweight natural language model (e.g., from a server) and install the application. In some examples, installing the application includes downloading a complex natural language model of the application and providing the utterance 905 to the complex natural language model.

For example, when the utterance 905 is "what is the game score? "when the digital assistant 900 may provide the utterance 905 to a lightweight natural language model associated with a sports application that is available to the digital assistant 900 but not installed on the electronic device 100. The lightweight natural language model may determine that the natural language recognition score is relatively high because of the "score" and "match" used in the utterance 905. Thus, the digital assistant 900 can determine that the utterance 905 exceeds a predetermined correlation threshold, retrieve the sports application from the server, and install the sports application. The digital assistant 900 may then provide the utterance 905 to a complex natural language model associated with the sports application to determine the user intent.

In some examples, the applications available to the digital assistant 900 are selected in the same manner that the applications may be selected as described above.

It should be appreciated that this process may include any number of lightweight natural language models and any number of complex natural language models based on the number of applications available to or installed on the electronic device 100 by the digital assistant 900. Thus, the digital assistant 900 may include a third, fourth, fifth, sixth, or seventh lightweight natural language model and a complex natural language model. Similarly, the digital assistant 900 may determine a third, fourth, fifth, sixth, or seventh natural language identification score and a third, fourth, fifth, sixth, or seventh user intent associated with the third, fourth, fifth, sixth, or seventh application.

Fig. 10 is a flow chart illustrating a process for determining user intent according to various examples. Process 1000 is performed at a device (e.g., device 100, 400, 500, 600) having one or more input devices (e.g., touch screen, microphone, camera) and a wireless communication radio (e.g., bluetooth connection, wiFi connection, mobile broadband connection such as a 4G LTE connection). In some embodiments, the electronic device includes a plurality of cameras. In some embodiments, the electronic device includes only one camera. In some examples, the device includes one or more biometric sensors, optionally including a camera, such as an infrared camera, a thermal imaging camera, or a combination thereof. Some operations in process 1000 are optionally combined, the order of some operations is optionally changed, and some operations are optionally omitted.

In some examples, process 1000 is performed using a client-server system, and the blocks of process 1000 are divided in any manner between a server and a client device (e.g., device 100). In other examples, the blocks of process 1000 are divided between a server and a plurality of client devices (e.g., mobile phones and smart watches). Thus, while portions of process 1000 are described herein as being performed by a particular device of a client-server system, it should be understood that process 1000 is not so limited. In other examples, process 1000 is performed using only one client device or only multiple client devices. In process 1000, some blocks are optionally combined, the order of some blocks is optionally changed, and some blocks are optionally omitted. In some examples, additional steps may be performed in connection with process 1000.

At block 1010, an utterance (e.g., utterance 905) from a user is received. At block 1020, a first natural language recognition score (e.g., natural language recognition scores 906, 907) of the utterance is determined utilizing a first lightweight natural language model (e.g., lightweight natural language model 901, 902) associated with the first application. In some examples, determining the first natural language recognition score of the utterance using the first lightweight natural language model associated with the first application further includes determining whether the utterance is related to the first application.

At block 1030, a second natural language recognition score (e.g., natural language recognition score 906, 907) of the utterance (e.g., utterance 905) is determined using a second lightweight natural language model (e.g., lightweight natural language model 901, 902) associated with the second application.

In some examples, a first lightweight natural language model (e.g., lightweight natural language model 901, 902) is trained based on a first training data set comprising a first plurality of utterances related to a first application and a second lightweight natural language model (e.g., lightweight natural language model 901, 902) is trained based on a second training data set comprising a second plurality of utterances related to a second application before receiving an utterance (e.g., utterance 905) from a user. In some examples, training the first lightweight natural language model based on the first training data set further includes calibrating a third natural language recognition score based on a plurality of utterances of the first training data set that are not related to the first application.

In some examples, a first lightweight natural language model (e.g., lightweight natural language model 901, 902) and a complex natural language model (e.g., complex natural language model 903, 904) associated with a first application are received from a second electronic device. In some examples, the first lightweight natural language model and the complex natural language model associated with the first application are trained simultaneously on the second electronic device.

At block 1040, it is determined whether the first natural language identification score (e.g., natural language identification scores 906, 907) exceeds a predetermined threshold.

At block 1050, in accordance with a determination that the first natural language recognition score (e.g., natural language recognition score 906, 907) exceeds a predetermined threshold, an utterance (e.g., utterance 905) is provided to a complex natural language model (e.g., complex natural language model 903, 904) associated with the first application. At block 1060, a complex natural language model is utilized to determine a user intent (e.g., user intents 908, 911) corresponding to the utterance.

In some examples, a complex natural language model (e.g., complex natural language model 903, 904) associated with the first application is trained to determine a user intent (e.g., user intent 908, 911) and a task associated with the user intent, and wherein the first lightweight natural language model (e.g., lightweight natural language model 901, 902) is not trained to determine the user intent. In some examples, the first lightweight natural language model is a simplified natural language model and the complex natural language model associated with the first application is a refined natural language model.

In some examples, it is determined whether the first natural language identification score (e.g., natural language identification score 906, 907) is higher than the second natural language identification score (e.g., natural language identification score 906, 907). In accordance with a determination that the first natural language identification score is higher than the second natural language identification score, a task associated with a user intent (e.g., user intent 908, 911) is performed. In some examples, it is determined whether the second natural language identification score is higher than the first natural language identification score. In accordance with a determination that the second natural language identification score is higher than the first natural language identification score, a task associated with a second user intent (e.g., user intents 908, 911) is performed.

In some examples, in accordance with a determination that the first natural language identification score (e.g., natural language identification scores 906, 907) does not exceed a predetermined threshold, it is determined whether the first application is active. In addition, in accordance with determining that the first application is active, an utterance (e.g., utterance 905) is provided to a complex natural language model (e.g., complex natural language model 903, 904) associated with the first application, and a user intent (e.g., user intent 908, 911) corresponding to the utterance is determined using the complex natural language model.

In some examples, it is determined whether the second natural language identification score (e.g., natural language identification scores 906, 907) exceeds a predetermined threshold. In accordance with a determination that the second natural language recognition score exceeds a predetermined threshold, the utterance (e.g., the utterance 905) is provided to a complex natural language model (e.g., complex natural language model 903, 904) associated with the second application, and a second user intent (e.g., user intent 908, 911) corresponding to the utterance is determined using the complex natural language model.

In some examples, a third natural language recognition score (e.g., natural language recognition score 906, 907) of the utterance (e.g., utterance 905) is determined using a third lightweight natural language model (e.g., lightweight natural language model 901, 902) associated with a third application, wherein the third application is available to the electronic device but not installed on the electronic device. In some examples, in accordance with a determination that the third natural language identification score exceeds a predetermined threshold, a third application is retrieved and installed on an electronic device (e.g., electronic device 100). In some examples, the third application is selected based on previous interactions with the third application. In some examples, the third application is selected based on its popularity.

In some examples, the first natural language identification score (e.g., natural language identification scores 906, 907) is adjusted based on context data associated with the electronic device (e.g., electronic device 100). In some examples, the second natural language identification score (e.g., natural language identification scores 906, 907) is adjusted based on the view of the electronic device.

FIG. 11 illustrates a process 1100 for determining and executing tasks with an integrated application. Process 1100 is performed at a device (e.g., device 100, 400, 500, 600, 1300, 1400) having one or more input devices (e.g., touch screen, microphone, camera) and a wireless communication radio (e.g., bluetooth connection, wiFi connection, mobile broadband connection such as a 4G LTE connection). In some embodiments, the electronic device includes a plurality of cameras. In some embodiments, the electronic device includes only one camera. In some examples, the device includes one or more biometric sensors, optionally including a camera, such as an infrared camera, a thermal imaging camera, or a combination thereof. Some operations in process 1100 are optionally combined, the order of some operations is optionally changed, and some operations are optionally omitted.

In some examples, process 1100 is performed using a client-server system, and the blocks of process 1100 are divided in any manner between a server and a client device (e.g., device 100). In other examples, the blocks of process 1100 are divided between a server and a plurality of client devices (e.g., mobile phones and smart watches). Thus, while portions of process 1200 are described herein as being performed by a particular device of a client-server system, it should be understood that process 1100 is not limited thereto. In other examples, process 1100 is performed using only one client device or only multiple client devices. In process 1100, some blocks are optionally combined, the order of some blocks is optionally changed, and some blocks are optionally omitted. In some examples, additional steps may be performed in connection with process 1100.

At block 1110, an utterance (e.g., utterances 204, 404, 504, 604, 701, 905, 1205, 1304, 1404) is received from a user. In some examples, the utterance includes a request. For example, the utterance may be "how weather is? "turn that green" as described below, or any other utterance that includes various requests for a digital assistant (e.g., digital assistants 900, 1200). In some examples, the digital assistant determines whether the utterance includes a request.

In some examples, utterances (e.g., utterances 204, 404, 504, 604, 701, 905, 1205, 1304, 1404) from the user are received during an active digital assistant session between the user and the digital assistant (e.g., digital assistant 201, 900, 1200). In some examples, the utterance from the user is received outside of an active digital assistant session between the user and the digital assistant. Thus, the digital assistant determines whether the utterance is intended for the digital assistant. In some examples, as described above, the digital assistant determines whether the utterance is intended for the digital assistant based on various factors, such as a view (e.g., view 700, 1301, 1401) of an electronic device (e.g., electronic device 100, 400, 500, 600, 1300, 1400), a direction in which a user faces, a volume of the utterance, a signal-to-noise ratio associated with the utterance, and so on.

In some examples, the utterance (e.g., the utterances 404, 504, 604, 701, 905, 1205, 1304, 1404) includes a trigger phrase. In some examples, a digital assistant (e.g., digital assistant 201, 900, 1200) determines whether an utterance includes a trigger phrase, and in accordance with determining that the utterance includes the trigger phrase, initiates a digital assistant session.

At block 1120, one or more representations of the utterance (e.g., utterances 404, 504, 604, 701, 905, 1205, 1304, 1404) are determined using a speech recognition model that is at least partially trained with data representing the application (e.g., applications 405, 505, 605). In some examples, data representing an application is derived from source code of the application. When a developer of an application creates an application, the developer may include source code that specifies information about how the application interacts with other applications or digital assistants. Data representing an application may be extracted from source code at the time of application creation or at the time of application installation on an electronic device (e.g., device 100, 400, 500, 600, 1300, 1400).

Thus, data representing an application (e.g., application 405, 505, 605) may be received from the second electronic device at the time of application installation. In some examples, when an application is installed on a first electronic device, source code is transferred from a second electronic device to the first electronic device (e.g., devices 100, 400, 500, 600, 1300, 1400). Thus, the first electronic device may extract data from the source code after receiving the source code. In some examples, the first electronic device is a user device such as device 100. Additionally, in some examples, the second electronic device is a server communicatively coupled to the first electronic device.

In some examples, source code of an application (e.g., application 405, 505, 605) includes at least one of a model associated with the application (e.g., model 901, 902, 903, 904), an operation associated with the application (e.g., operation 306, sub-operation 307), and an object associated with the application (e.g., object 1225, 1235, text 1302, picture 1303, virtual chair 1402, virtual table 1403). In some examples, the models associated with the application include the lightweight natural language model and the complex natural language model described above with reference to fig. 9 and 10. Thus, after an application is installed on an electronic device, lightweight natural language models and complex natural language models are extracted from source code and installed on the electronic device and/or added to a digital assistant (e.g., digital assistant 201, 900, 1200).

In some examples, when the electronic device receives source code and/or data representing an application (e.g., application 405, 505, 605), operations associated with the application (e.g., operation 306, sub-operation 307) may be added to a database of possible operations as described above with reference to fig. 1-8. Similarly, when the electronic device receives source code and/or data representing an application, objects associated with the application (e.g., objects 1225, 1235, text 1302, pictures 1303, virtual chairs 1402, virtual tables 1403) may be added to the database of operations and objects as described above with reference to fig. 1-8.

In some examples, models (e.g., models 901, 902, 903, 904), operations (e.g., operation 306, sub-operation 307), and objects (e.g., objects 1225, 1235, text 1302, pictures 1303, virtual chairs 1402, virtual tables 1403) associated with applications (e.g., applications 405, 505, 605) can be interacted with by a digital assistant (e.g., digital assistants 201, 900, 1200). For example, the digital assistant may provide the utterance to a natural language model and detect the natural language model, as described above with reference to fig. 9-10 and 12-15. In addition, the digital assistant may search the operations and objects to determine the operations and objects of the utterance as described above with reference to fig. 1-8. Thus, models, operations, and objects associated with an application are integrated with a digital assistant.

In some examples, the speech recognition model is trained or retrained after receiving data representing an application (e.g., application 405, 505, 605). For example, when the application is a carpool application, the virtual assistant may receive data representing the carpool application, the data including words or terms associated with the carpool application. Thus, the digital assistant can re-train the voice recognition model of the digital assistant (e.g., digital assistant 201, 900, 1200) with the vocabulary and terms associated with the carpool application. In this way, the digital assistant integrates information from the carpool application to understand when a request directed to the carpool application is received. In some examples, the speech recognition model is trained or retrained when any application is installed and data representing the corresponding application is received.

At block 1130, one or more representations of the utterance are provided to a plurality of natural language models (e.g., models 901, 902, 903, 904). In some examples, at least one natural language model of the plurality of natural language models is associated with an application (e.g., application 405, 505, 605) and registered with a digital assistant (e.g., digital assistant 201, 900, 1200) when data representing the application is received from the second electronic device. Thus, as described above, in some examples, the natural language model is received from a second electronic device, such as a server, when an application is downloaded and/or installed on the user's electronic device.

In some examples, at least one of the natural language models (e.g., models 901, 902, 903, 904) was previously trained at the second electronic device using training data determined based on data representing the application (e.g., applications 405, 505, 605) and data representing the digital assistant (e.g., digital assistant 201, 900, 1200). In some examples, the training data is a combination of data determined based on source code of the application and data provided by the digital assistant. Thus, the natural language model is trained such that the digital assistant can adequately interact with the natural language model when the application is installed on the electronic device. In some examples, the natural language model is a neural network or machine learning model, and is trained as described above with reference to fig. 9-10.

In some examples, the training data includes application (e.g., application 405, 505, 605) specific vocabulary, translations of application specific terms, or exemplary text to be provided as output by a digital assistant (e.g., digital assistant 201, 900, 1200). In particular, the training data may be data associated with an application provided by an application developer, as well as source data representing the application. Thus, a developer may provide specific vocabulary, translations, or other data that a digital assistant would not normally be trained to recognize. In this way, application specific vocabulary, translations, and exemplary text may be integrated with the digital assistant through the trained natural language model. For example, for a car pooling application, the training data may include a car model, a car brand, a location, or other vocabulary or text required for the car pooling application to function properly and interact with the digital assistant.

In some examples, when a natural language model (e.g., model 901, 902, 903, 904) is received from the second electronic device, the natural language model is registered with a digital assistant (e.g., digital assistant 201, 900, 1200). In some examples, registering the natural language model with the digital assistant is part of a process of registering an application (e.g., application 405, 505, 605) with the digital assistant. In some examples, registering the natural language model with the digital assistant includes integrating the natural language model with the digital assistant. In some examples, registering the natural language model further includes receiving a lightweight natural language model associated with the application (e.g., the lightweight natural language model described above with reference to fig. 9 and 10). In addition, registering the natural language model further includes adding the application to a list of applications installed on the electronic device.

In some examples, registering at least one natural language model (e.g., models 901, 902, 903, 904) further includes receiving a complex natural language model (e.g., complex natural language model as described above with reference to fig. 9 and 10) associated with an application (e.g., applications 405, 505, 605) and integrating the complex natural language model associated with the application with a natural language model associated with a digital assistant (e.g., digital assistant 201, 900, 1200). In some examples, integrating the complex natural language model of the application with the natural language model associated with the digital assistant includes retraining the natural language model associated with the digital assistant.

In some examples, integrating the complex natural language model of the application (e.g., application 405, 505, 605) with the natural language model associated with the digital assistant (e.g., digital assistant 201, 900, 1200) includes the complex natural language model of the digital assistant exploration application. For example, when the digital assistant 1200 receives a complex natural language model associated with a car pooling application, the digital assistant 1200 may detect the complex natural language model to learn how to interact with the car pooling application. Thus, the digital assistant can determine the functionality of the application and how to interact with the natural language model of the application.

In some examples, providing the one or more representations of the utterance to the plurality of natural language models further includes determining a natural language recognition score (e.g., natural language recognition scores 906, 907) of the one or more representations of the utterance using the lightweight natural language model and determining whether the natural language recognition score exceeds a predetermined threshold, as described above with reference to fig. 9 and 10. In some examples, in accordance with a determination that the natural language identification score exceeds a predetermined threshold, a complex natural language model associated with the application is received. Thus, after receiving a complex natural language model associated with an application, one or more representations of the utterance are provided to the complex natural language model.

At block 1140, a user intent of the utterance is determined based on at least one of the plurality of natural language models and a database including a plurality of operations (e.g., operation 306, sub-operation 307) and objects (e.g., objects 1225, 1235, text 1302, picture 1303, virtual chair 1402, virtual table 1403) associated with the application (e.g., application 405, 505, 605). In some examples, the user intent of the utterance is determined by performing natural language processing using at least one natural language model of the plurality of natural language models. In some examples, the user intent is determined by determining an operation of the database corresponding to the user intent and determining an object of the database corresponding to the user intent, as described above with reference to fig. 1-8. In addition, after determining the user intent, the operation and object based tasks are performed as described above with reference to fig. 1-8.

Fig. 12 illustrates an exemplary digital assistant 1200 for resolving an indication of a user utterance. As shown in fig. 12, the digital assistant 1200 includes an reference resolution model 1210, and a natural language model 1220 and a natural language model 1230. In some examples, the digital assistant 1200 is implemented on an electronic device (e.g., electronic device 100, 1300, 1400). In some examples, the digital assistant 1200 is implemented on a device other than an electronic device (e.g., a server). In some examples, some of the modules and functions of the digital assistant 1200 are divided into a server portion and a client portion, where the client portion resides on one or more user devices (e.g., electronic devices 100, 1300, 1400) and communicates with the server portion over one or more networks. It should be noted that the digital assistant 1200 is only one example, and that the digital assistant 1200 may have more or fewer components than shown, may combine two or more components, or may have different component configurations or arrangements. The various components of digital assistant 1200 are implemented in hardware, software instructions for execution by one or more processors, firmware (including one or more signal processing integrated circuits and/or application specific integrated circuits), or a combination thereof.

The digital assistant 1200 receives the user utterance 1205 and determines an object 1235 to which the ambiguous term of the user utterance 1205 refers. As described further below, the digital assistant 1200 determines whether the user utterance 1205 includes an ambiguous term. If the user utterance 1205 includes ambiguous terms, the digital assistant 1200 provides the user utterance 1205 to an reference resolution model 1210. The reference resolution model 1210 determines a plurality of relevance factors 1215. The digital assistant 1200 then determines a relevant application based on the relevance factor 1215 and determines the object 1235 to which the ambiguous term of the user utterance 1205 refers based on the relevant application.

Fig. 13 and 14 depict example views of an electronic device for use with an reference resolution process, according to various examples. Fig. 13 shows an electronic device 1300 displaying a view 1301 comprising text 1302 and a picture 1303 on a screen of the electronic device 1300, as well as a user utterance 1304 received by the electronic device 1300. Fig. 14 shows a view 1401 of an electronic device 1400 including a virtual chair 1402 and a virtual table 1403, and a user utterance 1404 received by the electronic device 1400. Each of fig. 13 and 14 will be discussed in conjunction with process 1200 of fig. 12 for digesting a reference to a user utterance.

In some examples, utterance 1205 from the user is received during an active digital assistant session between the user and the digital assistant 1200. In some examples, the utterance 1205 from the user is received outside of an active digital assistant session between the user and the digital assistant 1200. Thus, the digital assistant 1200 determines whether the utterance 1205 is intended for the digital assistant 12000. In some examples, as described above, the digital assistant 1200 determines whether the utterance 1205 is intended for the digital assistant 1200 based on various factors, such as the view of the electronic device, the direction in which the user is facing, the volume of the utterance 1205, the signal-to-noise ratio associated with the utterance 1205, and so on.

In some examples, utterance 1205 includes a trigger phrase. In some examples, digital assistant 1200 determines whether utterance 1205 includes a trigger phrase, and initiates a digital assistant session in accordance with determining that utterance 1205 includes the trigger phrase.

In some examples, user utterance 1205 includes a request. In some examples, digital assistant 1200 determines whether user utterance 1205 includes a request. In some examples, the digital assistant 1200 performs automatic speech recognition and/or natural language processing on the user utterance 1205 to determine whether the user utterance 1205 includes a request. In addition, when the user utterance 1205 includes a request, the digital assistant 1200 performs automatic speech recognition and/or natural language processing on the user utterance 1205 to determine the request of the user utterance 1205.

In particular, the digital assistant 1200 may include one or more ASR systems that process user utterances 1205 received through an input device (e.g., microphone) of the electronic device 100. The ASR system extracts representative features from the speech input. For example, the ASR system preprocessor performs a fourier transform on the user utterance 1205 to extract spectral features characterizing the speech input as a sequence of representative multidimensional vectors.

In addition, each ASR system of the digital assistant 1200 includes one or more speech recognition models (e.g., acoustic models and/or language models) and implements one or more speech recognition engines. Examples of speech recognition models include hidden Markov models, gaussian mixture models, deep neural network models, n-gram language models, and other statistical models. Examples of speech recognition engines include dynamic time warping based engines and Weighted Finite State Transducer (WFST) based engines. The extracted representative features of the front-end speech pre-processor are processed using one or more speech recognition models and one or more speech recognition engines to produce intermediate recognition results (e.g., phonemes, phoneme strings, and sub-words), and ultimately text recognition results (e.g., words, word strings, or symbol sequences).

In some examples, digital assistant 1200 determines whether the request for user utterance 1205 includes an ambiguous term. In some examples, the indefinite term refers to an indication. An indication refers to a word or phrase that does not explicitly refer to a certain object, time, person or place, etc. Exemplary indications include, but are not limited to: that, this, here, there, then, those, they, he, she, etc., especially when used with the following questions, such as "what is this? "," where "and" who he is? Thus, the digital assistant 1200 determines whether the request includes some of these words or words similar to them, and thus whether the use of the word is ambiguous. For example, in the user utterance 1304 "bold that," the digital assistant 1200 can determine that "is an indication reference through ASR and/or NLP. Similarly, in spoken input 1404 "turn that green," digital assistant 1200 determines that "refers to the time of day reference. In both examples, the digital assistant 1200 may determine that "is ambiguous because the user input does not include a subject or object that may be referred to as" that "or" this ".

The request to determine the user utterance 1205 from the digital assistant 1200 includes ambiguous terms, the digital assistant 1200 providing the user utterance 1205 to the reference resolution model 1210. For example, when digital assistant 1200 determines that user utterance 1304 "thickens" of that "is indicative of a reference and is thus ambiguous, digital assistant 1200 provides user utterance 1304 to reference resolution model 1210.

In some examples, the reference resolution model 1210 is a neural network, a machine learning model, or similar processing structure. In some examples, the reference resolution model 1210 is trained to determine one or more relevant reference factors prior to receiving the user utterance 1205, as described further below. In some examples, the reference resolution model 1210 is trained on an electronic device that is different from the electronic device that receives the user utterance 1205. In some examples, the reference resolution model 1210 is received at the electronic device 100 from another electronic device after training is complete.

The reference resolution model 1210 then determines a plurality of relevant reference factors 1215. In some examples, reference resolution model 1210 determines a plurality of relevant reference factors 1215 based on user utterance 1205. For example, when reference resolution model 1210 receives user utterance 1304 of "bolded one," reference resolution model 1210 may select reference factors related to "bolded" used in user utterance 1304. As another example, when reference resolution model 1210 receives user utterance 1304 of "bolded that" reference resolution model 1210 may select reference factors that will help resolve reference to "that" rather than reference to "he" or "they".

In some examples, the reference resolution model 1210 determines a plurality of relevant reference factors 1215 based on context information of an electronic device (e.g., electronic devices 1300, 1400). The context data associated with the electronic device includes various characteristics of the electronic device. For example, the context data may indicate a location of the electronic device (e.g., GPS coordinates), whether the electronic device is connected to a network (e.g., a WiFi network), whether the electronic device is connected to one or more other devices (e.g., headphones), and/or a current time, date, and/or workday. If the electronic device is connected to a network or device, the context data may further indicate the name and/or type of the network or device, respectively.

As an example, when receiving the user utterance 1404 that "turns that green," the digital assistant 1200 may determine the location of the electronic device 1400 to determine whether the user utterance 1404 is likely to refer to an object in the real world that is near the user. Thus, the digital assistant 1200 may determine that the user is located in his home, thereby determining that the user is not located near any important or noticeable objects. Thus, the digital assistant 1200 may use this information as a relevant reference factor to help determine a virtual object that the user may be within the reference view 1401.

In some examples, the reference resolution model 1210 determines the plurality of relevant reference factors 1215 based on default settings of an electronic device (e.g., electronic devices 1300, 1400) or digital assistant 1200. In some examples, default settings of the electronic device or digital assistant 1200 are associated with a particular user. For example, the user providing the user utterance 1205 may have designated a particular car-pooling application as the default car-pooling application. Thus, when reference resolution model 1210 receives the utterance "help me find vehicle to there," reference resolution model 1210 may determine that relevant factors include a default carpool application and parameters associated with the default carpool application.

In some examples, the reference resolution model 1210 determines a plurality of relevant reference factors 1215 based on historical interactions of a user with an electronic device (e.g., electronic device 1300, 1400) or digital assistant 1200. In some examples, the digital assistant 1200 may monitor interactions between the user and the digital assistant 1200 and then determine relevant reference factors based on these interactions. In some examples, the digital assistant 1200 may access the transcript described above to determine possibly related reference factors. For example, when the reference resolution model 1210 receives the user utterance 1404 "turn that green," the reference resolution model 1210 can access the transcript to determine reference factors that may be associated with green, such as previous actions taken with other colors.

In some examples, the plurality of related reference factors 1215 includes a view of the electronic device. For example, when the digital assistant 1200 receives the user utterance 1404 "green that" while the electronic device 1400 is providing the view 1401, the reference resolution model 1210 may determine that the view 1401 is a relevant reference factor because the user utterance 1404 may be relevant to the virtual reality view and item that the device 1400 is displaying.

In some examples, the digital assistant 1200 determines whether the view of the electronic device includes an object, and if the view of the electronic device includes an object, the reference resolution model 1210 will include the object as a related reference factor in the plurality of related reference factors 1215. For example, as described above, view 1401 may include virtual chair 1402 and virtual table 1403. Thus, since view 1401 of electronic device 1400 includes these virtual objects, reference resolution model 1210 can determine that virtual chair 1402 and virtual table 1403 are related reference factors.

In some examples, the plurality of related reference factors 1215 includes an ontology of an application installed on the electronic device. For example, the reference resolution model 1210 may retrieve the ontology of all applications installed on the electronic device. As another example, reference resolution model 1210 may determine a particular application (or applications) that are relevant to user utterance 1205 as described above, retrieving the ontology of those particular applications and adding them to a plurality of relevant reference factors 1215.

In some examples, the plurality of related reference factors 1215 includes operations and metadata associated with an application installed on the electronic device. For example, when reference resolution model 1210 determines one or more applications that may be relevant to user utterance 1205, reference resolution model 1210 may retrieve or determine the operation or metadata of the application as relevant reference factors, as described above with reference to fig. 1-7. Additionally, the reference resolution model 1210 may retrieve or determine the operation or metadata of the application from a transcript of previously performed operations, as described above.

In some examples, the plurality of related reference factors 1215 includes an application that is open on the electronic device. For example, when reference resolution model 1210 receives user utterance 1304 "thicken that" or other user utterances, reference resolution model 1210 can determine that an application open on electronic device 1300 is a relevant reference factor. In some examples, the plurality of related reference factors 1215 includes an application that is the focus (e.g., being displayed) on the electronic device.

In some examples, the plurality of related reference factors 1215 includes preferences associated with a user of the electronic device. In some examples, the reference resolution model 1210 determines preferences associated with a user providing user utterances 1205 for one or more applications installed on an electronic device. For example, when the reference resolution model 1210 receives the user utterance 1404 "turn that green," the reference resolution model 1210 may determine that the user has a preference to create a virtual object using a particular application is a relevant reference factor.

In some examples, the plurality of related reference factors 1215 includes a gaze of a user of the electronic device. In some examples, the digital assistant 1200 determines where in the view of the electronic device the user is viewing and determines whether the user is viewing an application or an object associated with the application. For example, upon receiving the user utterance 1304 "bold that" the digital assistant 1200 may determine that the user is viewing a word processing application that is open on the view 1301 of the electronic device 1300. Thus, the digital assistant 1200 may include this gaze as a relevant reference factor.

In some examples, the plurality of relevant reference factors 1215 includes a natural language recognition score for the user utterance 1205. In some examples, the natural language identification score is determined as described above with reference to fig. 9 and 10.

In some examples, reference resolution model 1210 determines a plurality of relevant reference factors 1215 by selecting a plurality of relevant reference factors 1215 from a plurality of reference factors. Thus, the reference resolution model 1210 may select one or more of the relevant reference factors from the list of reference factors available to the digital assistant 1200 based on the relevance of each of the factors in the various examples described above.

After determining the plurality of relevant reference factors 1215, the reference resolution model 1210 provides the plurality of relevant reference factors 1215 to the digital assistant 1200. The digital assistant 1200 then determines relevant applications based on the plurality of relevant reference factors 1215. In some examples, digital assistant 1200 utilizes reference resolution model 1210 to determine relevant applications based on a plurality of relevant reference factors. Thus, in some examples, in addition to the plurality of related reference factors 1215, reference resolution model 1210 provides related applications to digital assistant 1200.

In some examples, the digital assistant 1200 determines relevant applications based on applications included in the plurality of relevant reference factors 1215. For example, when the user utterance 1304 "thickens that" is received while the word processing application is open and acting as a focus of the electronic device 1300, the word processing application is included in the plurality of related reference factors 1215. Thus, since the word processing application is included in the plurality of related reference factors 1215, the digital assistant 1200 can determine that the word processing application is a related application.

In some examples, the digital assistant 1200 determines relevant applications based on attributes of the ontology of the applications included in the plurality of relevant reference factors 1215. For example, when the user utterance 1404 "turn that green" is received, the color attribute of the ontology associated with the application for making virtual furniture may be identified as a relevant reference factor. Accordingly, the digital assistant 1200 may determine that the application used to make the virtual chair 1402 and the virtual table 1403 is a related application due to the color attribute of the ontology.

In some examples, the digital assistant 1200 determines relevant applications based on user preferences for applications included in the plurality of relevant reference factors 1215. For example, when a user utterance "help me find a vehicle" is received, a user's preference to use or order a particular type of vehicle in a carpool application may be determined as a relevant pointing factor. Thus, the digital assistant 1200 may determine the user-preferred carpool application or an application that may order the type of vehicle that the user prefers as the relevant application.

In some examples, the digital assistant 1200 determines the relevant application by selecting an application associated with a majority of the plurality of relevant reference factors 1215. For example, when the user utterance 1404 "turn that green" is received, the color attribute of the ontology associated with the application for making virtual furniture may be identified as a relevant reference factor. In addition, an application for making virtual furniture may also be open on the electronic device 1400, and the virtual chair 1402 created by the application may be the focus of view 1401. Thus, the digital assistant 1200 can identify that a number of the plurality of related reference factors are all associated with an application for making virtual furniture, such that the application can be selected as the related application.

In some examples, the digital assistant 1200 determines the relevant application by applying a weight to each relevant reference factor of the plurality of relevant reference factors 1215. For example, different weights may be applied to applications that are open on the electronic device, the ontology of applications installed on the electronic device, the view of the electronic device, and so on. In some examples, some of the relevant reference factors are given more weight than other relevant reference factors. For example, when the user utterance 1304 "bold that" is received, the pointing element that instructs the word processing application to open may be weighted more than the pointing element that instructs the view 1301 to include the picture 1303.

In addition, the digital assistant 1200 determines the relevant application by determining the application corresponding to the relevant reference factor with the highest weight. Thus, continuing the above example, when the user utterance 1304 "thickens that" and the pointing factor that instructs the word processing application to open is assigned a relatively high weight, the digital assistant 1200 determines that the word processing application is a related application. The digital assistant 1200 determines that a word processing application is a related application, even when the reference factors associated with other applications are also assigned weights, taking into account that the reference factors associated with word processing applications are assigned the highest weights.

In some examples, the digital assistant 1200 determines the relevant application by selecting an application associated with the relevant reference factor having a weight that exceeds a predetermined threshold. For example, when a user utterance 1304 "thickens that" and a pointing factor that instructs the word processing application to open is assigned a weight, the digital assistant 1200 may determine whether the weight associated with the pointing factor exceeds a predetermined threshold. Upon determining that the indicator factor exceeds a predetermined threshold, the digital assistant 1200 may then determine that the word processing application is a related application.

In some examples, the digital assistant 1200 determines the relevant application by determining whether the natural language recognition score of the user utterance 1205 exceeds a predetermined threshold. In particular, the digital assistant 1200 may determine a natural language recognition score of the user utterance 1205 as described above with reference to fig. 9 and 10. In particular, a lightweight natural language model associated with the application may determine a natural language recognition score for the user utterance 1205. Similarly, the digital assistant 1200 may determine whether the natural language recognition score of the user utterance 1205 exceeds a predetermined threshold, and if so, select an application associated with the lightweight natural language model as a relevant application.

It should be appreciated that the digital assistant 1200 may determine the relevant application by combining any of the above-described processes and factors to determine which application (or applications described below) should be selected as the relevant application for further processing.

In some examples, digital assistant 1200 determines a plurality of related applications based on a plurality of related reference factors 1215. For example, the digital assistant 1200 may determine that several reference factors associated with different applications exceed a predetermined threshold. Thus, the digital assistant 1200 may determine that each of the different applications is a related application. Thus, the digital assistant 1200 can select all of these different applications as related applications and use them to determine the object to which the ambiguous term of the request refers, as described below.

After determining the relevant application, the digital assistant 1200 determines the object 1225 to which the ambiguous term of the request refers based on the relevant application. In some examples, the digital assistant 1200 determines the object 1225 to which the ambiguous term of the request refers based on the relevant application by accessing the natural language model 1220 associated with the relevant application. In some examples, natural language model 1220 and natural language model 1230 are complex natural language models, as described above with reference to fig. 9-10. In some examples, the digital assistant 1200 accesses the natural language model 1230 associated with the second related application to determine the object 1235 to which the ambiguous term of the request refers.

In some examples, digital assistant 1200 uses reference resolution model 1210 to determine an object 1225 to which the ambiguous term of the request refers. Thus, reference resolution model 1210 has access to various natural language models associated with an application to determine object 1225. In this manner, all processing to determine object 1225 (including determining a plurality of relevant reference factors 1215, determining one or more relevant applications, and determining object 1225) may be performed using reference resolution model 1210 integrated with digital assistant 1200.

In some examples, accessing the natural language model 1220 associated with the related application includes determining whether a portion of the natural language model 1220 includes an object that is present in a view of the electronic device. For example, upon receiving the user utterance 1304 "bold that" the digital assistant 1200 can access a natural language model 1220 associated with a word processing application. The digital assistant 1200 may then determine whether the natural language model 1220 (or a portion of the natural language model 1220) includes text or pictures because the view 1301 includes text 1302 and pictures 1303. Accordingly, the digital assistant 1200 can determine that the natural language model 1220 includes a literal object, thereby determining that the object 1225 is the literal 1302.

As another example, when a user utterance 1401 "turn that green" is received, the digital assistant 1200 can access a natural language model 1230 associated with an application for producing virtual furniture. The digital assistant 1200 may then determine whether the natural language model 1230 (or a portion of the natural language model 1230) includes a virtual chair 1402 and a virtual table 1403 because the view 1401 includes the chair or table. Accordingly, the digital assistant 1200 can determine that the natural language model 1230 includes a virtual chair object, thereby determining that the object 1235 is the virtual chair 1402.

In some examples, accessing the natural language model 1220 associated with the related application includes determining whether an object of the natural language model 1220 includes an attribute related to a term of the user utterance. For example, upon receiving the user utterance 1304 "bold that" the digital assistant 1200 can access a natural language model 1220 associated with a word processing application. The digital assistant 1200 may then determine whether any objects of the natural language model 1220 have a "bolded" attribute. Thus, the digital assistant 1200 determines that the literal object of the natural language model 1220 has bolded properties, and thus the object 1225 is the literal 1302. Similarly, the digital assistant 1200 determines that the picture object of the natural language model 1220 does not have bolded properties, and thus the object 1225 is not the picture 1303.

As another example, when a user utterance 1401 "turn that green" is received, the digital assistant 1200 can access a natural language model 1230 associated with an application for producing virtual furniture. The digital assistant 1200 may then determine whether any objects of the natural language model 1230 have a "color" attribute because the user utterance 1401 includes a color. Thus, the digital assistant 1200 determines that the chair object of the natural language model 1230 has color properties, and thus the object 1235 is the virtual chair 1402. Similarly, the digital assistant 1200 determines that the table object of the natural language model 1230 does not have color properties, and thus the object 1235 is not a virtual table 1403.

In some examples, the digital assistant 1200 determines a plurality of possible objects 1225 to which the ambiguous term of the request refers based on the relevant application. For example, the natural language model 1220 may have several literal objects, so the digital assistant 1200 may determine that each of the literal objects that satisfy the requirements may be the object to which the request of the user utterance 1205 refers, as described above.

In some examples, the digital assistant 1200 receives a first user intent associated with the related application from the natural language model 1220 and determines a first user intent score based on the object 1225 and the received first user intent. For example, when receiving the user utterance 1304 "bold that," the digital assistant 1200 may receive a user intent to bold the text 1302 from the natural language model 1220 associated with the word processing application. Thus, since the user utterance 1304 and the received user intent are similar, the digital assistant 1200 may determine a relatively high user intent score. In some examples, the natural language model 1220 determines a user intent associated with an application, as described above with reference to fig. 9 and 10.

Similarly, in some examples, the digital assistant 1200 receives a second user intent associated with the related application from the natural language model 1230 and determines a second user intent score based on the object 1235 and the received second user intent. For example, when receiving the user utterance 1304 "bold that" the digital assistant 1200 may receive a user intent to create a bolded virtual furniture from a natural language model 1230 associated with an application for making the virtual furniture. Thus, because the user utterance 1304 and the received user intent are dissimilar, the digital assistant 1200 may determine a relatively low user intent score.

The digital assistant 1200 then determines whether the first user intent score is higher or the second user intent score is higher. In accordance with a determination that the first user intent score is higher than the second user intent score, the digital assistant 1200 causes a related application associated with the first user intent score to perform a first task associated with the first user intent on the object 1225. Continuing with the previous example, the digital assistant 1200 compares a first user intent score associated with the word processing application with a second user intent score associated with the application for making virtual furniture and then determines that the first user intent score is higher. Thus, the digital assistant 1200 causes the word processing application to bold the text 1302 according to the first user intent.

Similarly, in accordance with a determination that the second user intent score is higher than the first user intent score, the digital assistant 1200 causes a related application associated with the second user intent score to perform a second task associated with the second user intent on the object 1235. For example, when receiving the user utterance 1404 that "green that," the digital assistant 1200 may receive a first user intent associated with the word processing application that green the text 1302 and then determine a first user intent score. The digital assistant 1200 also receives a second user intent associated with the virtual furniture application that greens the virtual chair 1402, and then determines a second user intent score. In this example, the first user intent score may not be low or high, indicating that the first user intent may be related to the user utterance 1404. However, the second user intent score may be relatively high because the user utterance 1404 is received as part of a conversation about the virtual chair 1402. Thus, the digital assistant 1200 may compare the second user intent with the first user intent and then determine that the second user intent is higher. Thus, the digital assistant 1200 may cause the virtual furniture application to change the color of the virtual chair 1402 to green.

In some examples, the digital assistant 1200 determines whether the first task is not being performed, and then in accordance with the determination that the first task is not being performed, the digital assistant 1200 provides an output including a prompt indicating that the first task is not being performed. For example, the digital assistant 1200 may determine that the task is to thicken the picture 1303 and then determine that the task was not performed because the picture 1303 cannot be thickened. Thus, the digital assistant 1200 provides an output "the picture cannot be bolded, please specify which object to bolde".

In some examples, the output is a spoken output. For example, the digital assistant 1200 may provide an output "picture cannot be bolded, please specify which object to bold" as an audio output from a speaker of the electronic device 1300. In some examples, the output is an output on a display of the electronic device. For example, the digital assistant 1200 may provide an output "picture cannot be bolded, please specify which object to bolde" on the touch-sensitive screen of the electronic device 1300. For another example, when the electronic device is a virtual reality device, the digital assistant 1200 may project "picture cannot be bolded, please specify which object to bolde" as virtual text.

In some examples, the digital assistant 1200 receives a response to the prompt. In some examples, the response to the prompt is a spoken input. For example, the digital assistant 1200 may receive the spoken input "bold text" from the user. In some examples, the response to the prompt is an input on a touch-sensitive display of the electronic device. For example, the user may select text that they want to bold, thereby providing an indication to the digital assistant 1200.

In response to receiving the response to the prompt, the digital assistant 1200 causes the associated application to perform the first task using the input received in response to the prompt. For example, upon receiving spoken input "bold text" from a user, the digital assistant causes the word processing application to perform the task of bold text 1302 based on the user input.

In some examples, in accordance with a determination that the first task is not performed, the digital assistant 1200 causes the second related application to perform a second task associated with a second user intent. For example, when the digital assistant 1200 determines that the task of the thickened picture 1303 is not performed, the digital assistant 1200 causes the virtual furniture application to perform the task of creating thickened furniture.

In some examples, the digital assistant 1200 determines whether the second task is not being performed, and then in accordance with the determination that the second task is not being performed, the digital assistant 1200 provides an output indicating an error. For example, the digital assistant 1200 may determine that the task of creating the thickened furniture is also not performed, and thus may provide an output "nothing about, i cannot complete at present. "in some examples, the output indicating an error is a spoken output. In some examples, the output indicating the error is an output on a display of the electronic device.

FIG. 15 illustrates a process 1500 for resolving an indication of a user utterance, according to various examples. The process 1500 is performed at a device (e.g., device 100, 400, 500, 600) having one or more input devices (e.g., touch screen, microphone, camera) and a wireless communication radio (e.g., bluetooth connection, wiFi connection, mobile broadband connection such as a 4G LTE connection). In some embodiments, the electronic device includes a plurality of cameras. In some embodiments, the electronic device includes only one camera. In some examples, the device includes one or more biometric sensors, optionally including a camera, such as an infrared camera, a thermal imaging camera, or a combination thereof. Some operations in process 1500 are optionally combined, the order of some operations is optionally changed, and some operations are optionally omitted.

In some examples, process 1500 is performed using a client-server system, and the blocks of process 1500 are divided in any manner between a server and a client device (e.g., device 100). In other examples, the blocks of process 1500 are divided between a server and a plurality of client devices (e.g., mobile phones and smart watches). Thus, while portions of process 1500 are described herein as being performed by a particular device of a client-server system, it should be understood that process 1500 is not so limited. In other examples, process 1500 is performed using only one client device or only multiple client devices. In process 1500, some blocks are optionally combined, the order of some blocks is optionally changed, and some blocks are optionally omitted. In some examples, additional steps may be performed in connection with process 1500.

At block 1510, a user utterance (e.g., user utterances 1205, 1304, 1404) is received that includes a request.

At block 1520, a determination is made as to whether the request includes ambiguous terms.

At block 1530, in accordance with a determination that the request includes ambiguous terms, a user utterance (e.g., user utterances 1205, 1304, 1404) is provided to an reference resolution model (e.g., reference resolution model 1210).

At block 1540, an index resolution model (e.g., index resolution model 1210) is utilized to determine a plurality of relevant index factors (e.g., a plurality of relevant index factors 1215). In some examples, the plurality of related reference factors includes a view (e.g., views 1301, 1401) of an electronic device (e.g., electronic device 100, 1300, 1400). In some examples, the plurality of related reference factors includes an ontology of an application installed on the electronic device. In some examples, the plurality of relevant reference factors includes a transcript of a previously performed operation. In some examples, the plurality of relevant reference factors includes which applications are open on the electronic device. In some examples, the reference resolution model is a neural network trained to determine the objects referenced in the request (e.g., objects 1225, 1235, text 1302, picture 1303, virtual chair 1402, virtual table 1403).

In some examples, determining the plurality of relevant reference factors (e.g., the plurality of relevant reference factors 1215) using the reference resolution model (e.g., reference resolution model 1210) further includes selecting the plurality of relevant reference factors from the plurality of reference factors based on the context data of the requesting and electronic devices (e.g., electronic devices 100, 1300, 1400).

In some examples, it is determined whether a view (e.g., view 1301, 1401) of an electronic device (e.g., electronic device 100, 1300, 1400) includes an object (e.g., object 1225, 1235, text 1302, picture 1303, virtual chair 1402, virtual table 1403). In accordance with a determination that the view of the electronic device includes an object, the object is included as a related reference factor (e.g., a plurality of related reference factors 1215).

At block 1550, a related application is determined based on the plurality of related reference factors (e.g., the plurality of related reference factors 1215). In some examples, determining the relevant application based on the relevant reference factor further includes determining a natural language recognition score (e.g., natural language recognition score 906, 907) of the user utterance (e.g., user utterance 1205, 1304, 1404) using a natural language model (e.g., natural language model 1220, 1230) associated with the first application, determining whether the natural language recognition score exceeds a predetermined threshold, and selecting the first application as the relevant application in accordance with determining that the natural language recognition score exceeds the predetermined threshold.

At block 1560, the object to which the requested ambiguous term refers (e.g., object 1225, 1235, text 1302, picture 1303, virtual chair 1402, virtual table 1403) is determined based on the relevant application. In some examples, determining, based on the relevant application, the object to which the ambiguous term of the request refers further includes determining whether a portion of a natural language model (e.g., natural language model 1220, 1230) associated with the relevant application includes an object that is present in a view (e.g., view 1301, 1401) of an electronic device (e.g., electronic device 100, 1300, 1400). In some examples, determining, based on the relevant application, the object to which the ambiguous term of the request refers further includes determining whether the object of the natural language model associated with the relevant application includes an attribute related to the term of the user utterance (e.g., user utterances 1205, 1304, 1404).

In some examples, a user intent associated with the relevant application is received, and then a user intent score is determined based on the determined object and the received user intent.

In some examples, a second relevant application is determined based on a plurality of relevant reference factors (e.g., a plurality of relevant reference factors 1215), a second object (e.g., objects 1225, 1235, text 1302, picture 1303, virtual chair 1402, virtual table 1403) to which the request refers is determined based on the second relevant application, a second user intent associated with the second relevant application is received, and a second user intent score is then determined based on the second object and the second user intent.

In some examples, in accordance with a determination that the first user intent score is higher than the second user intent score, the first related application performs a first task associated with the first user intent on the first object (e.g., object 1225, 1235, text 1302, picture 1303, virtual chair 1402, virtual table 1403). In some examples, in accordance with a determination that the second user intent score is higher than the first user intent score, the second related application performs a second task associated with the second user intent on a second object (e.g., objects 1225, 1235, text 1302, picture 1303, virtual chair 1402, virtual table 1403).

In some examples, a determination is made as to whether the first task was not performed, and then in accordance with the determination that the first task was not performed, an output is provided that includes a hint indicating that the first task was not performed. In some examples, input is received in response to the prompt, and the first related application performs the first task using the input received in response to the prompt. In some examples, in accordance with a determination that the first task is not being performed, the second related application performs a second task associated with a second user intent. In some examples, a determination is made as to whether the second task is not being performed, and then in accordance with the determination that the second task is not being performed, an output is provided indicating an error.

As described above, one aspect of the present technology is to use sound input to map commands to operations. The present disclosure contemplates that in some examples, such collected data may include personal information data that uniquely identifies or may be used to contact or locate a particular person. Such personal information data may include demographic data, location-based data, telephone numbers, email addresses, tweet IDs, home addresses, data or records related to the user's health or fitness level (e.g., vital sign measurements, medication information, exercise information), date of birth, or any other identifying or personal information.

The present disclosure recognizes that the use of such personal information data in the present technology may be used to benefit users. For example, personal information data may be used to quickly and efficiently determine how to respond to user commands. Thus, the use of such personal information data enables a user to have programmatic control over response resolution. In addition, the present disclosure contemplates other uses for personal information data that are beneficial to the user. For example, health and fitness data may be used to provide insight into the overall health of a user, or may be used as positive feedback to individuals using technology to pursue health goals.

The present disclosure contemplates that entities responsible for collecting, analyzing, disclosing, transmitting, storing, or otherwise using such personal information data will adhere to established privacy policies and/or privacy practices. In particular, such entities should implement and adhere to privacy policies and practices that are recognized as meeting or exceeding industry or government requirements for maintaining the privacy and security of personal information data. Such policies should be readily accessible to the user and should be updated as the collection and/or use of the data changes. Personal information from users should be collected for legal and reasonable use by entities and not shared or sold outside of these legal uses. In addition, such collection/sharing should be performed after informed consent is received from the user. Moreover, such entities should consider taking any necessary steps to defend and secure access to such personal information data and to ensure that others having access to the personal information data adhere to their privacy policies and procedures. In addition, such entities may subject themselves to third party evaluations to prove compliance with widely accepted privacy policies and practices. In addition, policies and practices should be adjusted to collect and/or access specific types of personal information data and to suit applicable laws and standards including specific considerations of jurisdiction. For example, in the united states, the collection or acquisition of certain health data may be governed by federal and/or state law, such as the health insurance flow and liability act (HIPAA); while health data in other countries may be subject to other regulations and policies and should be processed accordingly. Thus, different privacy practices should be maintained for different personal data types in each country.

Regardless of the foregoing, the present disclosure also contemplates examples in which a user selectively prevents use or access to personal information data. That is, the present disclosure contemplates that hardware elements and/or software elements may be provided to prevent or block access to such personal information data. For example, with respect to enabling sensors, the present technology may be configured to allow a user to choose to "opt-in" or "opt-out" to participate in the collection of personal information data during or at any time after registration with a service. As another example, the user may choose to limit the length of time that captured data and/or requests are held, or to prohibit development of save data or requests altogether. In addition to providing the "opt-in" and "opt-out" options, the present disclosure also contemplates providing notifications related to accessing or using personal information. For example, the user may be notified that his personal information data will be accessed when the application is downloaded, and then be reminded again just before the personal information data is accessed by the application.

Further, it is an object of the present disclosure that personal information data should be managed and processed to minimize the risk of inadvertent or unauthorized access or use. Once the data is no longer needed, risk can be minimized by limiting the data collection and deleting the data. In addition, and when applicable, included in certain health-related applications, the data de-identification may be used to protect the privacy of the user. De-identification may be facilitated by removing specific identifiers (e.g., date of birth, etc.), controlling the amount or specificity of stored data (e.g., collecting location data at a city level instead of at an address level), controlling how data is stored (e.g., aggregating data among users), and/or other methods, as appropriate.

Thus, while the present disclosure broadly covers the use of personal information data to implement one or more of the various disclosed examples, the present disclosure also contemplates that the various examples may also be implemented without accessing such personal information data. That is, various examples of the present technology do not fail to function properly due to the lack of all or a portion of such personal information data. For example, the sensor may be enabled by inferring a preference based on non-personal information data or an absolute minimum of personal information (such as non-personal information available to the digital assistant or publicly available information).

Claims

1. A method, comprising:

at an electronic device having one or more processors and memory:

when an application is opened on the electronic device:

receiving spoken input including a command;

determining whether the command matches at least a portion of metadata associated with an operation of the application;

in accordance with a determination that the command matches at least the portion of the metadata associated with the operation of the application:

associating the command with the operation;

store the association of the command with the operation for subsequent use by the digital assistant through the application; and

And executing the operation through the application program.

2. The method of claim 1, wherein the application opened on the electronic device is a focus of the electronic device.

3. The method of any of claims 1-2, wherein the application that is open on the electronic device is one of a plurality of open applications.

4. The method of claim 3, wherein determining whether the command matches at least a portion of metadata associated with operation of the application further comprises:

it is determined whether the command matches at least a portion of metadata associated with operation of any one of the plurality of open applications.

5. The method of any of claims 1-4, wherein the operation of the application is an active operation.

6. The method of any of claims 1-5, wherein the operation of the application is one of a plurality of operations, and wherein the plurality of operations includes a plurality of active operations and a plurality of inactive operations.

7. The method of claim 6, wherein the plurality of active operations are operations currently displayed by the application.

8. The method of claim 6, wherein the plurality of inactive operations are operations that are not currently displayed by the application.

9. The method of any of claims 6 to 8, wherein the plurality of operations are presented in a tree link model.

10. The method of claim 9, wherein the tree link model comprises a plurality of hierarchical links between related ones of the plurality of operations.

11. The method of any of claims 1-10, wherein each operation of the plurality of operations is associated with a respective portion of the metadata.

12. The method of any of claims 1 to 10, wherein the metadata comprises synonyms for the operation.

13. The method of any of claims 1 to 10, wherein the metadata comprises an ontology corresponding to the operation.

14. The method of any one of claims 1 to 13, further comprising:

in accordance with a determination that the command matches at least the portion of the metadata of the operation of the application:

the portion of the metadata is stored with the association of the command with the operation.

15. The method of any one of claims 1 to 14, further comprising:

Determining a plurality of operations previously accessed by the digital assistant;

determining respective metadata associated with each of the plurality of operations previously accessed by the digital assistant; and

the plurality of operations and the respective metadata are compiled into a transcript.

16. The method of claim 15, further comprising:

the transcript is provided to resolve ambiguous requests.

17. The method of any of claims 1-16, wherein the spoken input including the command is received without an open application on the electronic device.

18. The method of claim 17, further comprising:

it is determined whether the command matches at least a portion of metadata associated with operation of a plurality of applications.

19. The method of claim 18, wherein the plurality of applications includes applications listed in the transcript.

20. The method of claim 18, wherein the plurality of applications comprises applications that are frequently accessed by the digital assistant.

21. The method of claim 18, wherein the plurality of applications comprises applications marked as favorites in a user profile associated with a user providing the spoken input.

22. The method of claim 18, wherein the operation is an operation previously stored in association with the command and the application.

23. The method of any one of claims 1 to 22, further comprising:

receiving spoken input associating the second command with a second operation;

recording activity on the electronic device; and

the recorded activity is stored as the second operation and the association of the second command with the second operation is stored for subsequent use by the digital assistant.

24. An electronic device, comprising:

one or more processors;

a memory; and

one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs comprising instructions for:

when an application is opened on the electronic device:

receiving spoken input including a command;

Associating the command with the operation;

and executing the operation through the application program.

25. A non-transitory computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by one or more processors of an electronic device, cause the electronic device to:

when an application is opened on the electronic device:

receiving spoken input including a command;

associating the command with the operation;

and executing the operation through the application program.

26. An electronic device, comprising:

when an application is opened on the electronic device:

Means for receiving spoken input including a command;

means for determining whether the command matches at least a portion of metadata associated with operation of the application;

means for associating the command with the operation;

means for storing the association of the command with the operation for subsequent use by the digital assistant through the application; and

means for performing the operation by the application.

27. An electronic device, comprising:

one or more processors;

a memory; and

one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs comprising instructions for performing the method of any of claims 1-23.

28. A non-transitory computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by one or more processors of an electronic device, cause the electronic device to perform the method of any of claims 1-23.

29. An electronic device, comprising:

apparatus for performing the method of any one of claims 1 to 23.

30. A system comprising means for performing any one of the methods of claims 1-23.

31. A method, comprising:

at an electronic device having one or more processors and memory:

receiving an utterance from a user;

determining a first natural language recognition score for the utterance using a first lightweight natural language model associated with the first application;

determining a second natural language recognition score for the utterance using a second lightweight natural language model associated with a second application;

determining whether the first natural language identification score exceeds a predetermined threshold;

in accordance with a determination that the first natural language identification score exceeds the predetermined threshold:

providing the utterance to a complex natural language model associated with the first application; and

using the complex natural language model, a user intent corresponding to the utterance is determined.

32. The method of claim 31, wherein determining the first natural language recognition score of the utterance using the first lightweight natural language model associated with the first application further comprises determining whether the utterance is related to the first application.

33. The method of any of claims 31-32, further comprising, prior to receiving the utterance from the user:

training the first lightweight natural language model based on a first training data set comprising a first plurality of utterances related to the first application; and

the second lightweight natural language model is trained based on a second training data set comprising a second plurality of utterances related to the second application.

34. The method of claim 33, wherein training the first lightweight natural language model based on the first training data set further comprises calibrating a third natural language recognition score based on a plurality of utterances of the first training data set that are not relevant to the first application.

35. The method of any of claims 31-34, wherein the user intent is a first user intent, and the method further comprises:

determining whether the second natural language identification score exceeds the predetermined threshold; and

in accordance with a determination that the second natural language identification score exceeds the predetermined threshold:

providing the utterance to a complex natural language model associated with the second application; and

A second user intent corresponding to the utterance is determined using the complex natural language model.

36. The method of any of claims 31-35, wherein the complex natural language model associated with the first application is trained to determine the user intent and a task associated with the user intent, and wherein the first lightweight natural language model is not trained to determine the user intent.

37. The method of any of claims 31-36, wherein the first lightweight natural language model is a simplified natural language model and the complex natural language model associated with the first application is a refined natural language model.

38. The method of any of claims 31 to 37, further comprising:

determining whether the first natural language identification score is higher than the second natural language identification score; and

in accordance with a determination that the first natural language identification score is higher than the second natural language identification score, a task associated with the user intent is performed.

39. The method of any one of claims 31 to 38, further comprising:

Determining whether the second natural language identification score is higher than the first natural language identification score; and

in accordance with a determination that the second natural language identification score is higher than the first natural language identification score, a task associated with the second user intent is performed.

40. The method of any one of claims 31 to 39, further comprising:

in accordance with a determination that the first natural language identification score does not exceed the predetermined threshold:

determining whether the first application is active; and

in accordance with a determination that the first application is active:

41. The method of any one of claims 31 to 40, further comprising:

determining a third natural language recognition score for the utterance using a third lightweight natural language model associated with a third application, wherein the third application is available to the electronic device but not installed on the electronic device; and

in accordance with a determination that the third natural language identification score exceeds the predetermined threshold:

Retrieving the third application; and

the third application is installed on the electronic device.

42. The method of claim 41, wherein the third application is selected based on previous interactions with the third application.

43. The method of claim 41 wherein the third application is selected based on its popularity ratings.

44. The method of any of claims 31-43, further comprising adjusting the first natural language identification score based on contextual data associated with the electronic device.

45. The method of any of claims 31-44, further comprising adjusting the second natural language identification score based on a view of the electronic device.

46. The method of any of claims 31-45, wherein the first lightweight natural language model and the complex natural language model associated with the first application are received from a second electronic device.

47. The method of claims 31-46, wherein the first lightweight natural language model and the complex natural language model associated with the first application are trained simultaneously on the second electronic device.

48. An electronic device, comprising:

one or more processors;

a memory; and

receiving an utterance from a user;

determining whether the first natural language identification score exceeds a predetermined threshold; in accordance with a determination that the first natural language identification score exceeds the predetermined threshold:

49. A non-transitory computer readable storage medium storing one or more programs configured for execution by one or more processors of an electronic device, the one or more programs comprising instructions for:

Receiving an utterance from a user;

50. An electronic device, comprising:

means for receiving an utterance from a user;

means for determining a first natural language recognition score of the utterance using a first lightweight natural language model associated with a first application;

means for determining a second natural language recognition score for the utterance using a second lightweight natural language model associated with a second application;

means for determining whether the first natural language identification score exceeds a predetermined threshold;

means for providing the utterance to a complex natural language model associated with the first application; and

means for determining, with the complex natural language model, a user intent corresponding to the utterance.

51. An electronic device, comprising:

one or more processors;

a memory; and

one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs comprising instructions for performing the method of any of claims 31-47.

52. A non-transitory computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by one or more processors of an electronic device, cause the electronic device to perform the method of any of claims 31-47.

53. An electronic device, comprising:

apparatus for performing the method of any one of claims 31 to 47.

54. A system comprising means for performing any one of the methods of claims 31-47.

55. A method, comprising:

at a first electronic device having one or more processors and memory:

receiving an utterance from a user;

determining one or more representations of the utterance using a speech recognition model that is at least partially trained with data representing an application;

providing the one or more representations of the utterance to a plurality of natural language models, wherein at least one natural language model of the plurality of natural language models is associated with the application and registered upon receiving data representing the application from a second electronic device; and

a user intent of the utterance is determined based on the at least one of the plurality of natural language models and a database including a plurality of operations and objects associated with the application.

56. The method of claim 55, wherein the data representing the application is derived from source code of the application.

57. The method of claim 56, wherein the source code of the application includes at least one of: a model associated with the application, an operation associated with the application, and an object associated with the application.

58. The method of claim 57, wherein the model associated with the application, the operation associated with the application, and the object associated with the application are capable of being interacted with by a digital assistant.

59. The method of any of claims 55-58, wherein the at least one natural language model was previously trained at a second electronic device using training data determined based on the data representing the application and data representing the digital assistant.

60. A method as defined in claim 59, wherein the training data includes an application-specific vocabulary, translations of application-specific terms, and exemplary text to be provided as output by the digital assistant.

61. The method of any of claims 55-60, wherein registering the at least one natural language model further comprises:

receiving a lightweight natural language model associated with the application; and

the application is added to a list of applications installed on the electronic device.

62. The method of claims 55 to 61, wherein registering the at least one natural language model further comprises:

Receiving a complex natural language model associated with the application; and

the complex natural language model associated with the application is integrated with a natural language model associated with a digital assistant.

63. The method of claims 55-62, wherein providing the one or more representations of the utterance to a plurality of natural language models further comprises:

determining a natural language recognition score for the one or more representations of the utterance using the lightweight natural language model;

determining whether the natural language identification score exceeds a predetermined threshold; and

in accordance with a determination that the natural language identification score exceeds the predetermined threshold, the complex natural language model associated with the application is received.

64. A method according to any one of claims 55 to 63, wherein the speech recognition model is trained using data representing the application and data representing the digital assistant.

65. The method of any one of claims 55 to 64, further comprising:

the speech recognition model is trained to recognize application specific vocabulary.

66. The method of any of claims 55-65, wherein determining a user intent of the utterance based on the at least one of the plurality of natural language models and a database including a plurality of operations and objects associated with the application further includes:

Determining an operation of the database corresponding to the user intent; and

an object of the database corresponding to the user intent is determined.

67. The method of claim 66, further comprising:

and executing tasks based on the operations and the objects.

68. An electronic device, comprising:

one or more processors;

a memory; and

receiving an utterance from a user;

69. A non-transitory computer readable storage medium storing one or more programs configured for execution by one or more processors of an electronic device, the one or more programs comprising instructions for:

receiving an utterance from a user;

70. An electronic device, comprising:

means for receiving an utterance from a user;

means for determining one or more representations of the utterance using a speech recognition model that is at least partially trained with data representing an application;

Means for providing the one or more representations of the utterance to a plurality of natural language models, wherein at least one natural language model of the plurality of natural language models is associated with the application and registered upon receiving data representing the application from a second electronic device; and

means for determining a user intent of the utterance based on the at least one of the plurality of natural language models and a database comprising a plurality of operations and objects associated with the application.

71. An electronic device, comprising:

one or more processors;

a memory; and

one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs comprising instructions for performing the method of any of claims 55-67.

72. A non-transitory computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by one or more processors of an electronic device, cause the electronic device to perform the method of any of claims 55-67.

73. An electronic device, comprising:

means for performing the method of any one of claims 55 to 67.

74. A method, comprising:

at an electronic device having one or more processors and memory:

receiving a user utterance comprising a request;

determining whether the request includes an ambiguous term;

in accordance with a determination that the request includes the ambiguous term, providing the user utterance to an reference resolution model;

determining a plurality of related reference factors using the reference resolution model;

determining a related application based on the plurality of related reference factors; and

an object to which the ambiguous term of the request refers is determined based on the related application.

75. The method of claim 74, wherein the plurality of related referencing factors comprises a view of the electronic device.

76. The method of any of claims 74-75, wherein the plurality of related reference factors includes an ontology of an application installed on the electronic device.

77. The method of any one of claims 74-76, wherein the plurality of relevant reference factors includes transcripts of previously performed operations.

78. The method of any of claims 74-77, wherein the plurality of related reference factors includes which applications are open on the electronic device.

79. The method of any one of claims 74-78, wherein determining a plurality of relevant reference factors using the reference resolution model further comprises:

the plurality of related reference factors are selected from a plurality of reference factors based on the request and the context data of the electronic device.

80. The method of any one of claims 74-79, further comprising:

determining whether a view of the electronic device includes an object; and

in accordance with a determination that the view of the electronic device includes the object, the object is included as a relevant reference factor.

81. The method of any of claims 74-80, wherein determining the relevant application based on the relevant reference factor further comprises:

determining a natural language recognition score for the user utterance using a natural language model associated with the first application;

in accordance with a determination that the natural language identification score exceeds the predetermined threshold, the first application is selected as the relevant application.

82. The method of any of claims 74-81, wherein determining the object to which the ambiguous term of the request refers based on the relevant application further comprises:

a determination is made as to whether a portion of a natural language model associated with the related application includes an object that is presented in a view of the electronic device.

83. The method of any of claims 74-82, wherein determining the object to which the ambiguous term of the request refers based on the relevant application further comprises:

a determination is made as to whether an object of the natural language model associated with the related application includes an attribute related to a term of the user utterance.

84. The method of any one of claims 74-83, further comprising:

receiving a user intent associated with the related application; and

a user intent score is determined based on the determined object and the received user intent.

85. The method of claim 84, wherein the related application is a first related application, the object is a first object, the user intent is a first user intent, and the user intent score is a first user intent score, the method further comprising:

Determining a second correlation application based on the plurality of correlation reference factors;

determining a second object referred to by the request based on the second related application program;

receiving a second user intent associated with the second related application; and

a second user intent score is determined based on the second object and the second user intent.

86. The method of claim 85, further comprising:

in accordance with a determination that the first user intent score is higher than the second user intent score, causing the first related application to perform a first task associated with the first user intent on the first object; and

in accordance with a determination that the second user intent score is higher than the first user intent score, causing the second associated application to perform a second task associated with the second user intent on the second object.

87. The method of claim 86, further comprising:

determining whether the first task is not performed; and

in accordance with a determination that the first task is not being performed:

providing an output including a hint indicating that the first task was not performed;

receiving an input responsive to the prompt; and

Causing the first related application to perform the first task using the input received in response to the prompt.

88. The method of claim 86, further comprising:

determining whether the first task is not performed; and

in accordance with a determination that the first task is not being performed:

causing a second associated application to perform a second task associated with the second user intent.

89. The method of claim 88, further comprising:

determining whether the second task is not performed; and

in accordance with a determination that the second task is not being performed, an output is provided indicating an error.

90. The method of any of claims 74-89, wherein the reference resolution model is a neural network trained to determine the objects referenced in the request.

91. An electronic device, comprising:

one or more processors;

a memory; and

receiving a user utterance comprising a request;

Determining whether the request includes an ambiguous term;

determining a related application based on the related reference factors; and

an object to which the request refers is determined based on the related application.

92. A non-transitory computer readable storage medium storing one or more programs configured for execution by one or more processors of an electronic device, the one or more programs comprising instructions for:

receiving a user utterance comprising a request;

determining whether the request includes an ambiguous term;

determining a related application based on the related reference factors; and

93. An electronic device, comprising:

Means for receiving a user utterance comprising a request;

means for determining whether the request includes an ambiguous term;

in accordance with a determination that the request includes the ambiguous term, means for providing the user utterance to an reference resolution model;

means for determining a plurality of related reference factors using the reference resolution model;

means for determining a relevant application based on the relevant reference factors; and

means for determining an object to which the request refers based on the related application.

94. An electronic device, comprising:

one or more processors;

a memory; and

one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs comprising instructions for performing the method of any of claims 74-90.

95. A non-transitory computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by one or more processors of an electronic device, cause the electronic device to perform the method of any of claims 74-90.

96. An electronic device, comprising:

means for performing the method of any one of claims 74 to 90.