CN111712790A

CN111712790A - Voice control of computing device

Info

Publication number: CN111712790A
Application number: CN201880083558.7A
Authority: CN
Inventors: M·坦加拉斯纳姆; S·戈帕拉克里希南
Original assignee: Amazon Technologies Inc
Current assignee: Amazon Technologies Inc
Priority date: 2017-12-08
Filing date: 2018-12-07
Publication date: 2020-09-25
Also published as: WO2019113516A1; EP3704569A1

Abstract

Systems and methods for voice control of computing devices are disclosed. An application may be downloaded and/or accessed through a device having a display, and content associated with the application may be displayed. Many applications do not allow the use of voice commands to interact with the display content. The improvements described herein allow non-voice-enabled applications to interact with displayed content using voice commands by determining screen data displayed by the device and by using the screen data to determine an intent associated with the application. Instruction data for performing an action corresponding to the intent may be sent to the device and may be used to perform the action on an object associated with the display content.

Description

Voice control of computing device

Cross Reference to Related Applications

Priority of the present application claims application number 15/836,566, entitled "Voice control of a computing device," filed on 12/8/2017 and U.S. patent application number 15/836,428, entitled "Voice-enabled application," filed on 12/8/2017, the entire contents of which are incorporated herein by reference.

Background

The user typically interacts with the displayed content through tactile members such as a remote control, mouse, keyboard, and/or touch input. Described herein are improvements in the art that would be particularly helpful in providing additional input means for displayed content.

Drawings

The following description will explain specific embodiments with reference to the drawings. In the drawings, the left-most digit(s) of a reference number identifies the drawing in which the reference number first appears. The use of the same reference symbols in different drawings indicates similar or identical items. The systems depicted in the drawings are not drawn to scale, and the components in the drawings may not be drawn to scale relative to each other.

FIG. 1 illustrates a schematic diagram of an example environment for controlling a computing device via audible input.

FIG. 2 illustrates a conceptual diagram of components of a user device and a remote system involved in controlling a computing device via audible input.

FIG. 3 shows a flow diagram of an example process for controlling a computing device via audible input.

FIG. 4 illustrates an example user interface for controlling a computing device via audible input.

FIG. 5 illustrates another example user interface for controlling a computing device via audible input.

FIG. 6 shows a flow diagram of an example process for controlling a computing device via audible input.

FIG. 7 illustrates a flow diagram of another example process for controlling a computing device via audible input.

FIG. 8 illustrates a flow diagram of another example process for controlling a computing device via audible input.

FIG. 9 shows a flow diagram of an example process for ordering instruction data to be sent to a device displaying content.

FIG. 10 illustrates a flow diagram of another example process for ordering instruction data to be sent to a device displaying content.

FIG. 11 illustrates a flow diagram of another example process for ordering instruction data to be sent to a device displaying content.

Fig. 12 illustrates a conceptual diagram of components of a speech processing system for processing audio data provided by one or more devices.

Detailed Description

Systems and methods for voice control of a computing device are described herein. Take as an example a content viewing application displayed on a user device such as a television. Typically, when a user wishes to interact with content displayed on the device, the user uses user input haptic members, such as pressing buttons on a remote control, moving and pressing buttons on a mouse, pressing keys on a keyboard, and/or providing touch input in examples where the user device includes a touch screen. While these input members may function, additional input members may be needed and/or desired by the user. Systems and methods for voice control of a computing device are described herein, particularly when the computing device is displaying content associated with an application that has not yet been developed with voice control functionality. These applications will be described herein as third party applications.

For example, a user may download or otherwise access a third-party application that has been optimized for input controls other than voice-based input controls (e.g., touch screen, keyboard, mouse, remote control, etc.). When the user wants to access the third-party application, the user may provide an audible command that represents a request to open or otherwise view the application content. Audio corresponding to the audible command may be captured by a microphone of the user device or accessory device, which may generate corresponding audio data. The audio data may be transmitted to a remote system, which may determine an intent associated with the audio data. Where the intent may be to open or otherwise display the content of the requested third party application. Once displayed, the user may wish to interact with the displayed content through audio input means.

The user device and/or the accessory device may determine that content of the third-party application is being displayed on the user device. Data indicating that the content of the third party application is being displayed and/or indicating an identifier of the application may be sent to a remote system. The remote system may determine whether the application is authorized for voice-controlled content based at least in part on an indication that a developer or other party responsible for the application has indicated that the application may be voice-enabled. In examples where the data indicates that the application may be voice-enabled, components of the user device and/or accessory device may query or otherwise receive contextual information (also described herein as screen data) corresponding to content displayed on the user device. The contextual information examples may include indications of objects displayed on the user device and/or information indicating relationships between objects. This information may be sent to a remote system and may be used to identify which portions of content may be selected and/or interacted with by a user and/or possible actions that may be taken with respect to those objects.

Continuing with the above example, the user may provide voice commands to interact with content displayed on the user device. A microphone of the user device and/or the accessory device may capture audio corresponding to the voice command and may generate corresponding audio data. The audio data may be transmitted to a remote system, which may perform automatic speech recognition on the audio data to generate corresponding text data. The remote system may utilize natural language understanding techniques based on the text data to determine one or more intents corresponding to the voice command. The remote system may perform named entity recognition in conjunction with natural language understanding to identify portions of the text data that correspond to named entities that may be recognized by the remote system. The process may link the text portion to a particular entity known to the remote system. To perform named entity resolution, the remote system may utilize context information provided by the user device and/or the accessory device. For example, the contextual information can be used for entity parsing by matching results of the automatic speech recognition component to different entities, such as types of objects displayed on the user device. In this way, the data source database of the remote system may be populated with some or all of the contextual information provided by the user device and/or the accessory device to facilitate named entity identification.

Based at least in part on the indication that the content of the third-party application is displayed on the user device and/or based on receipt of the contextual information, a speech applet (speedlet) configured to facilitate voice control of the remote system of the third-party application may be invoked and data representing the results of the natural language understanding technique may be transmitted to the speech applet for processing. The verbal applet can generate instruction data corresponding to the instructions based at least in part on the intent determined by the natural language understanding component and any values associated with the intent. The instruction data may include data indicating that an action is to be taken with respect to one or more portions of content displayed on the device.

The instructional data can be transmitted to the user device and/or the accessory device, which can determine an action to take with respect to the content based at least in part on the instructional data. The node processing component of the user device and/or the accessory device may receive data corresponding to the action and the object to which the action is to be applied, and may attempt to perform the action on the node corresponding to the object. The process may include matching searchable text associated with the instruction to text associated with a node of content displayed on the device. A confidence map may be applied on the nodes and the node with the highest confidence on which the action is to be performed may be selected. The action may be performed on the selected node. In this way, the user's voice commands may be utilized to interact with the content of the third-party application even when the third-party application is not configured to control the displayed content or computing device via voice commands.

Additionally or alternatively, the user device, the accessory device, and/or the remote system may facilitate interacting with the third-party application with the user utterance by generating and/or causing display of a prompt for the user to follow. For example, with contextual information indicating objects displayed on the screen, overlay content may be generated that provides, for example, numbers and/or letters associated with the displayed objects. As described herein, overlay content may be described as including "hints" for user interaction with the system. Upon viewing the overlay content, the user may provide a voice command instructing the system to perform an action on the selected number and/or letter. In this manner, the confidence with which the system determines which action to perform based on the voice command may be increased. Additionally or alternatively, in examples where multiple objects displayed on the user device correspond to the same or similar actions, the system may identify relationships between the objects and may generate modification hints, which may simplify user interaction with the system.

Additionally or alternatively, the system may be configured to identify and/or determine when content displayed on the user device changes. For example, when a user interacts with content displayed on a user device, actions performed with respect to the content may cause the content to be updated and/or different content to be displayed. The user device and/or the accessory device may be configured to identify a content change event and may transmit updated context information to the remote system based at least in part on the content change event. The updated context information may inform natural language understanding, including named entity recognition and/or instruction generation for subsequent voice commands.

Additionally or alternatively, in examples where the determined intent corresponds to more than one action to be performed on a given object, the system may be configured to order the instructional data and/or the actions. For example, a user utterance may represent an intent that may be determined to correspond to more than one action and/or may correspond to actions that may be performed for multiple subjects. In these examples, the instruction data and/or actions may be ranked such that ambiguous utterances may result in sending the highest ranked instruction data to the user device and/or selecting the highest ranked action. For example, the ordering of the instruction data and/or actions may be based at least in part on historical usage data, applications associated with the displayed content, locations of objects displayed on the user device relative to each other, classifications of intent, previous voice commands, and/or contextual information updates.

The present disclosure provides a complete understanding of the principles of the structure, function, manufacture, and use of the systems and methods disclosed herein. One or more examples of the present disclosure are illustrated in the accompanying drawings. Those of ordinary skill in the art will understand that the systems and methods specifically described herein and illustrated in the accompanying drawings are non-limiting embodiments. Features illustrated or described in connection with one embodiment may be combined with features of other embodiments, both as systems and methods. Such modifications and variations are intended to be included herein within the scope of the appended claims.

Additional details will be described below with reference to several exemplary embodiments.

FIG. 1 shows a schematic diagram of an example system 100 for speech control of a computing device. The system 100 may include, for example, a user device 102 and one or more accessory devices 104(a) - (b). User device 102 may include a display 106, which may be configured to display content associated with one or more third-party applications. As shown in fig. 1, user equipment 102 is a television. It should be understood that although a television is used herein as an example user device 102, other devices that display content (e.g., tablet, mobile phone, projector, computer, and/or other computing device) are included in the present disclosure. In an example, the system 100 can include one or more accessory devices 104(a) - (b). The accessory devices 104(a) - (b) may be computing devices configured to communicate with each other, the user device 102, and/or the remote system 108 via the network 110. It should be understood that some or all of the operations described herein as being performed with respect to the user device 102 may additionally or alternatively be performed with respect to one or more of the accessory devices 104(a) - (b). It should be understood that some or all of the operations described herein as being performed with respect to one or more of the accessory devices 104(a) - (b) may be performed by the user device 102.

The user device 102 and/or the accessory devices 104(a) - (b) may include, for example, one or more processors 112, one or more network interfaces 114, one or more speakers 116, one or more microphones 118, one or more displays 106, and memory 120. The components of the user device 102 and/or the accessory devices 104(a) - (b) will be described in more detail below. The remote system 108 may include, for example, one or more processors 122, one or more network interfaces 124, and memory 126. The components of the remote system will also be described in more detail below.

For example, the microphone 118 of the user device 102 and/or the accessory devices 104(a) - (b) may be configured to capture audio representing one or more voice commands from a user located in an environment associated with the user device 102 and/or the accessory devices 104(a) - (b). The microphone 118 may be further configured to generate audio data corresponding to the captured audio. The speaker 116 may be configured to receive audio data from the user device 102 and/or the accessory devices 104(a) - (b) and/or other components of the remote system 108. The speaker 116 may be further configured to output audio corresponding to the audio data. The display 106 may be configured to present a presentation graph of content associated with an application, such as a third party application.

As used herein, a processor, such as processor 112 and/or 122, may include multiple processors and/or processors with multiple cores. Further, the processor may include one or more cores of different types. For example, a processor may include an application processor unit, a graphics processing unit, and so on. In one implementation, the processor may include a microcontroller and/or a microprocessor. Processors 112 and/or 122 may include a Graphics Processing Unit (GPU), a microprocessor, a digital signal processor, or other processing units or components known in the art. Alternatively or additionally, the functions described herein may be performed, at least in part, by one or more hardware logic components. By way of example, and not limitation, illustrative types of hardware logic components that may be used include Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like. Further, each processor 112 and/or 122 may have its own local memory, which may also store program components, program data, and/or one or more operating systems.

Memory 120 and/or 126 may include volatile and non-volatile memory, removable and non-removable media implemented in any method or technology for storage of information, such as computer-readable instructions, data structures, program components, or other data. Such memory 120 and/or 126 includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, RAID storage systems, or any other medium which can be used to store the desired information and which can be accessed by a computing device. Memory 120 and/or 126 may be implemented as a computer-readable storage medium ("CRSM"), which may be any available physical medium that may be accessed by processors 112 and/or 122 to execute instructions stored on memory 120 and/or 126. In a basic implementation, a CRSM may include random access memory ("RAM") and flash memory. In other implementations, the CRSM may include, but is not limited to, read only memory ("ROM"), electrically erasable programmable read only memory ("EEPROM"), or any other tangible medium that can be used to store the desired information and that can be accessed by the processor.

Further, the functional components may be stored in respective memories, or the same functions may alternatively be implemented in hardware, firmware, an application specific integrated circuit, a field programmable gate array, or a system on a chip (SoC). Additionally, although not shown, each respective memory discussed herein (e.g., memory 120 and/or 126) may include at least one Operating System (OS) component configured to manage hardware resource devices such as network interfaces, I/O devices of the respective apparatus, and provide various services to applications or components executing on the processor. Such OS components may implement variants of the FreeBSD operating system published by the FreeBSD project; other UNIX or UNIX-like variants; a variant of the Linux operating system distributed by Linus Torvalds; the FireOS operating system from Amazon. com, Seattle, Washington, USA; windows operating system from Microsoft corporation of Redmond, Washington, USA; LynxOS, distributed by Lynx software technology, Inc. of san Jose, Calif.; an embedded operating system (ENEA OSE) issued by ENEA AB in sweden, etc.

Network interfaces 114 and/or 124 may enable communication between components and/or devices shown in system 100 and/or with one or more other remote systems and other networked devices. Such network interfaces 114 and/or 124 may include one or more Network Interface Controllers (NICs) or other types of transceiver devices to send and receive communications over the network 110.

For example, each of the network interfaces 114 and/or 124 may include a Personal Area Network (PAN) component to enable communication over one or more short-range wireless communication channels. For example, the PAN component may enable communication that conforms to at least one of the following standards: IEEE802.15.4(ZigBee), IEEE 802.15.1(Bluetooth), IEEE 802.11(WiFi), or any other PAN communication protocol. Further, each of network interfaces 114 and/or 124 may include a Wide Area Network (WAN) component to enable communications over a wide area network.

In some examples, the remote system 108 may be located in an environment associated with the user device 102 and/or the accessory devices 104(a) - (b). For example, the remote system 108 may be located within the user device 102 and/or the accessory devices 104(a) - (b). In some cases, some or all of the functionality of the remote system 108 may be performed by one or more of the user device 102 and/or the accessory devices 104(a) - (b).

The memory of the user device 102 and/or the accessory devices 104(a) - (b) may include computer-executable instructions, described below as components of the memory 120, that when executed by the one or more processors 112 may cause the one or more processors 112 to perform various operations. Exemplary components of the memory 120 of the user device 102 and/or the accessory devices 104(a) - (b) can include a third party application storage and/or access component 128, a device event controller 130, an instruction processor 132, a node processing component 134, a keyword processing component 136, a third party application interface component 138, a ranking component 140, and/or an override component 142. Each of these exemplary components of memory 120 will be described below.

The memory 126 of the remote system 108 may include computer-executable instructions described below as components of the memory 126 that, when executed by the one or more processors 122, may cause the one or more processors 122 to perform various operations. Exemplary components of the memory 126 of the remote system 108 can include a user profile and/or account component 144, an automatic speech recognition component 146, a natural language understanding component 148, one or more speech applets 150, a third party application registry 152, and/or a ranking component 154. Each of these exemplary components of memory 126 will be described below.

The user profile/account component 144 of the memory 126 may be configured to store associations between users, user profiles, user accounts, user devices, accessory devices, remote systems, and/or third party applications. In this way, data sent from the user device and/or the accessory device can be associated with the voice command and/or the application for which the voice command is intended. It should be understood that a given user profile may be associated with one or more applications and/or one or more devices, and that a given user account may be associated with one or more user profiles.

To describe the components of memory 120 and/or memory 126 in more detail, the functionality of memory 120 and/or memory 126 will be described with respect to an example voice command and a process of controlling user device 102 based on the voice command.

For the third party application storage and/or access component 128, it can be configured to store third party applications that have been downloaded onto the memory 120 of the user device 102 and/or the accessory devices 104(a) - (b). Additionally or alternatively, the third-party application storage and/or access component 128 can be configured to access third-party applications that the user device 102 and/or the accessory devices 104(a) - (b) have been authorized to use. Additionally or alternatively, the third party application storage and/or access component 128 can be configured to store and/or access contextual information (also referred to as screen data) associated with the third party application, such as Document Object Model (DOM) information.

The third party application interface component 138 may be configured to receive data indicative of an identity of an application corresponding to content being displayed on the user device 102. The third party application interface component 138 may be further configured to receive screen data associated with content displayed on the user device 102. Data indicative of the identity of the application may be sent to the remote system 108 via the network 110. Additionally, screen data may be transmitted to the remote system 108 via the network 110. The screen data may include DOM information associated with the content. The DOM information may include an identification of one or more objects corresponding to the displayed content and/or one or more relationships between the objects.

The DOM may be an Application Programming Interface (API) that represents hypertext markup language (HTML), extensible markup language (XML), and/or other computing languages in a tree structure, where each node of the tree represents an object that represents a portion of the application code. When an object in the tree is acted upon, the corresponding change may be reflected in the display of the application content. One or more libraries associated with the API can be provided to allow one or more actions to be taken with respect to the nodes in the DOM tree.

Based at least in part on receiving data indicating that content associated with the third-party application is displayed on the user device 102, the third-party application registry 152 may determine whether the third-party application is registered or otherwise authorized to provide voice control of the content displayed on the user device 102. For example, when a third party application developer publishes an application for sale or consumption on an application store, the developer may be queried to determine whether the developer wants to voice-enable the application. If the developer indicates that voice enablement is authorized, an indication of the application may be stored in a third party application registry. Thereafter, when the data indicates that content of the application is being displayed on the device, audio data corresponding to the voice command may be processed to voice-enable the application.

To illustrate additional functionality of memory 120 and/or 126, examples are provided herein in which a user provides voice commands to interact with displayed content. The user may provide audible voice commands that may be captured by the microphone 118. The microphone 118 may generate corresponding audio data that may be transmitted to the remote system 108 via the network 110.

An Automatic Speech Recognition (ASR) component 146 can receive the audio data and can generate corresponding text data. Performing ASR will be described in more detail below with reference to fig. 12. A Natural Language Understanding (NLU) component 148 can receive text data generated by the ASR component 146 and can determine an intent associated with a voice command. Performing NLU will be described in more detail below with reference to fig. 12. As part of determining the intent associated with the voice command, the NLU component 148 may perform named entity recognition in conjunction with natural language understanding to identify portions of the text data corresponding to named entities that may be recognized by the remote system 108. The process may link the text portion to a particular entity known to the remote system 108.

To perform named entity recognition, the remote system can utilize screen data provided by the third party application interface component 138 of the user device 102 and/or the accessory devices 104(a) - (b). The screen data may be used for entity recognition, for example, by matching results of the automatic speech recognition component to different entities associated with the application (e.g., objects displayed on the user device 102). As such, the data source database may be populated with some or all of the screen data provided by the user device 102 and/or the secondary devices 104(a) - (b) to facilitate named entity identification. As such, the NLU component 148 may be trained or otherwise configured to select an intent based on screen data currently being displayed on the user device 102. Additionally, the NLU component 148 may determine values for one or more slots associated with the intent based on the screen data.

For example, a user viewing content associated with a video playback application may provide a user command to "play cat video". Based at least in part on the indication that the content of the third-party application is being displayed on the user device 102, screen data indicative of the object being displayed can be sent to and received by the NLU component 148. The screen data may include an indication of one or more intents that may be specific to the application the user is using and/or an indication of the object currently being displayed. In the example of a video playback application, the objects may include, for example, one or more play buttons, selectable text associated with the video, a video category, and/or a text entry field. The intent may include, for example, playing a video, selecting an object, and/or performing a keyword search. The NLU component 148 may be configured to determine an intent corresponding to the voice command and determine one or more values to fill a slot associated with the intent. For example, the determined intent may be "play" and the value that may fill the slot associated with the intent may be "cat video". The determination of intent and slot values may be based at least in part on an individualized finite state displacer to improve intent determination and slot value determination.

Based at least in part on an indication that an application associated with the displayed content has been authorized for voice enablement using the system described herein, the remote system 108 can transmit data corresponding to the intent and the value associated therewith to a speech applet 150 configured to generate instructions for a third-party application. The applet 150 can generate instructions for execution by the user device 102 based at least in part on the information received from the NLU component 148. Some or all of the screen data associated with the displayed content may be provided by the third party application interface component 138 of the user device 102 and/or the accessory devices 104(a) - (b). The screen data may be used to generate instructions for execution by the user device 102 and/or the accessory devices 104(a) - (b) to achieve the intent determined by the NLU component 148.

In an example, the ordering component 154 may be configured to order the instructions in an example where the determined intent corresponds to one or more actions to be performed on the given object. For example, a voice command may represent an intent that may be determined to correspond to more than one action and/or may correspond to actions that may be performed for multiple objects. In these examples, the instructions may be ordered such that ambiguous voice commands may result in the highest ordered instruction being sent to the user device 102. The ordering of the instructions may be based at least in part on historical usage data, applications associated with the displayed content, locations of objects displayed on the user device relative to one another, classifications of intent, previous voice commands, and/or screen data updates.

For example, the historical usage data may indicate that a given voice command, while corresponding to a plurality of instructions, historically corresponds to a first instruction more frequently than a second instruction for voice commands received via the user device 102. Additionally or alternatively, data may be used that indicates that a given voice command, while corresponding to multiple instructions, historically corresponds to the first instruction more frequently than the second instruction for voice commands received via the user device 102 and/or other devices. The application may also indicate which instructions take precedence over others. Additionally or alternatively, data indicating the position of objects displayed on the user device 102 relative to each other may be used to order the instructions. For example, instructions to perform an action on a more highlighted object may be prioritized over instructions to perform an action on a less highlighted object. Additionally or alternatively, certain intents may not be dependent on a particular object displayed on the user device 102 and thus may be associated with predetermined instructions. For example, a voice command to "scroll down" may correspond to an intent to display content on user device 102 that is not currently in view, and may not correspond to an intent to perform an action with respect to an object displayed on user device 102. Instructions that perform actions based on an intent that is not dependent on the object, such as this, may be prioritized over instructions that perform actions dependent on the object.

Additionally or alternatively, data indicative of previous voice commands may be used to order the instructions. For example, a previous voice command may be "scroll down" and a subsequent voice command may be "more". Without context data indicating a previous voice command, a "more" command may correspond to an instruction to perform an action such as displaying more video, providing more information about a certain video, playing more video, etc. However, with the previous voice command of "scroll down", the instructions may be ordered such that the instructions performing the additional scroll down action take precedence over the other instructions. Additionally or alternatively, data indicating that the screen data has changed or has otherwise been updated may be used to order the instructions.

Additionally or alternatively, a predetermined prioritization of instructions may be stored and utilized by the remote system 108. For example, instructions to perform actions on objects associated with an application may be ordered based at least in part on the type of object acted upon. For example, objects associated with both images and text may be prioritized over objects having only text, only images, selectable text, and/or editable text. For example, a voice command to "play video" may be associated with instructions to perform actions on various objects, such as representing an image of a video with a play icon overlaid thereon, reading text of "play", playing an icon, and/or an editable field (e.g., a search field into which the phrase "play video" may be inserted). In this example, instructions associated with the image and overlaid play icon may override other instructions. Likewise, playing the icon may be preferred over reading the text of "play". Likewise, reading the text of "play" may take precedence over editable fields. The ordering of the instructions may be based at least in part on the intent determined by the NLU component 148. For example, the determined "play" intent may correspond to the ordering described above. Additionally or alternatively, the determined "search" intent may correspond to an ordering of instructions to perform actions on objects associated with the editable fields over instructions to perform actions on objects associated with the selection of the objects. Additionally or alternatively, the determined "selection" intent may correspond to an ordering of instructions that will cause an action to be performed on the object whose content was updated when selected over instructions that will cause an action to be performed on other objects (e.g., inserting text into a search field). It should be understood that examples of instruction ordering are provided herein for illustration, and that other examples of ordering instructions are included in the present disclosure.

Once the verbal applet 150 generates the instructions, the remote system 108 can transmit data representing the instructions to the user device 102 and/or the accessory devices 104(a) - (b) via the network 110. Instruction processor 132 of memory 120 may receive the instructions and may determine an action to perform based at least in part on the instructions. For example, the instructions may indicate that a "play" intent is to be performed on the object "cat video". The instruction processor 132, based at least in part on the intent from the instruction, may determine to take an action that causes the video to be played on the user device 102. The instruction processor 132 may also determine that the action that caused the video to play is associated with "cat video".

Instruction processor 132 may send data indicating that the selected action is to be performed and a value of "cat video" associated therewith to device event controller 130. Device event controller 130 may then determine which components of user device 102 and/or accessory devices 104(a) - (b) are to be used to perform the actions determined by instruction processor 132. Device event controller 130 may also be configured to identify and/or determine when an event corresponding to a displayed content change and/or update occurs. Examples of such events may include launching an application, user interaction with content that causes the content to be updated, refreshing of the content, and/or time-dependent changes in the displayed content. The device event controller 130, based at least in part on identifying and/or determining that an event has occurred, can cause the third party application interface component 138 to identify and/or determine updated content being displayed on the user device 102.

Node processing component 134 may receive data from device event controller 130 indicating the actions to be performed and the objects on which the actions are to be performed. The node processing component 134 may identify node information stored by the third party application storage/access component 128 and/or determined by the third party application interface component 138. Node processing component 134 may attempt to match or substantially match the identified object from the instruction to a node associated with the application. The process may be performed using a keyword search, where the keywords used in the search may be words that describe the object. For example, the object may include or be associated with displayed text that reads "best cat video". The phrase can be used to perform a keyword search on searchable text associated with a node of an application. The node that matches or best matches the searched phrase may be selected as the node on which the action is to be performed. According to the example used herein, the node associated with the video described as "best cat video on earth" may be determined as the best match for "best cat video". An action may be performed on the selected node that causes the video to play. The keyword processing component 136 can be employed to return a searchable list of words in which stop words such as "and", "of" and/or "the" are filtered out. This information can be used to match keywords to the appropriate nodes.

In an example, the instructions received from the remote system 108 may be associated with more than one action. For example, a "select" intent may correspond to opening a hyperlink, causing a video to play, causing additional information to be displayed, or other action. The ordering component 140 of the memory 120 can be configured to prioritize actions based at least in part on intent from instructions and/or contextual information associated with the application. For example, historical usage data may indicate that a given intent, while corresponding to multiple actions, historically corresponds to a first action more frequently than a second action with respect to an intent received via the user device 102. Additionally or alternatively, data may be used that indicates that a given intent, while corresponding to multiple actions, historically corresponds to a first action more frequently than a second action for voice commands received via the user device 102 and/or other devices. The application may also indicate which actions take precedence over other actions.

Additionally or alternatively, data indicating the position of objects displayed on the user device 102 relative to each other may be used to rank the actions. For example, actions performed on more highlighted objects may be prioritized over actions performed on less highlighted objects. Additionally or alternatively, certain intents may not be dependent on a particular object displayed on the user device 102 and thus may be associated with a predetermined action. For example, a voice command to "scroll down" may correspond to an intent to display content on user device 102 that is not currently in view, and may not correspond to an intent to perform an action with respect to an object displayed on user device 102. Actions based on an intent that is not dependent on an object, such as this, may be prioritized over actions that are dependent on an object.

Additionally or alternatively, data indicative of previous voice commands may be used to rank the actions. For example, a previous voice command may be "scroll down" and a subsequent voice command may be "more". Without context data indicating a previous voice command, a "more" command may correspond to an action such as displaying more video, providing more information about a certain video, playing more video, etc. However, with the previous voice command of "scroll down," the actions may be ordered such that the action of performing the additional scroll down takes precedence over other actions. Additionally or alternatively, data indicating that the screen data has changed or has otherwise been updated may be used to order the actions. Additionally or alternatively, a predetermined prioritization of actions may be stored and utilized by the remote system 108. It should be understood that examples of ordering actions are provided herein for illustration, and that other examples of ordering actions are included in the present disclosure.

The overlay component 142 can be configured to provide one or more "prompts" to help the user more accurately provide voice commands and/or determine intent from voice commands. For example, with screen data indicating objects displayed on the screen, overlay content may be generated that provides, for example, numbers and/or letters associated with the displayed objects. As described herein, overlay content may be described as including "hints" for user interaction with the system. Upon viewing the overlay content, the user may provide a voice command instructing the system to perform an action on the selected number and/or letter. For example, the user may provide a voice command to "select the number 1". In this manner, the confidence with which the system determines which action to perform based on the voice command may be increased. Additionally or alternatively, in examples where multiple objects displayed on the user device correspond to the same or similar actions, the system may identify relationships between the objects and may generate modification hints, which may simplify user interaction with the system.

FIG. 2 illustrates a conceptual diagram of the components of a user device 202 and a remote system 204 involved in controlling a computing device via audible input. Example flows of how components of the user device 202 and the remote system 204 may interact with each other and how each component of the system may identify, determine, generate, transmit, and/or receive information are described with reference to fig. 2.

For example, the third party application 206 may be stored in a memory of the user device 202 and/or may be accessed by the user device 202. The third party application 206 may include an identifier of the application 206 and data representing content associated with the application 206. The content may be described in terms of nodes of a DOM tree, which may be used to perform actions on the content. As described herein, the object may be displayed on the user device 202. The object may correspond to one or more nodes of the DOM tree of application 206.

The third party application interface component 208 can receive the data from the third party application 206 and/or one or more databases storing the data. For example, the third party application interface component 208 may be configured to receive data indicative of an identity of the application 206 corresponding to content displayed on the user device 202. The third party application interface component 208 may be further configured to receive screen data associated with content displayed on the user device 202. Data indicative of the identity of the application 206 may be sent to the remote system 204 via a network. Additionally, the screen data may be transmitted to the remote system 204 via a network. The screen data may include DOM information associated with the content. The DOM information may include an identification of one or more objects corresponding to the displayed content and/or one or more relationships between the objects.

The DOM may be an Application Programming Interface (API) that represents hypertext markup language (HTML), extensible markup language (XML), and/or other computing languages in a tree structure, where each node of the tree represents an object that represents a portion of an application. When an object in the tree is acted upon, the corresponding change may be reflected in the display of the application content. One or more libraries associated with the API can be provided to allow one or more actions to be taken with respect to the nodes in the DOM tree. Additionally or alternatively, context data may be described and/or associated with metadata associated with an application. The metadata may provide an indication as to which portions of the content and/or the presentation of those portions of the content correspond to the selectable objects. For example, the metadata may indicate that a particular portion of the content is associated with a link that, when selected by the user, causes the content displayed by the device to be updated. The syntax associated with portions of content may indicate that selection of the portion of content results in the retrieval of data, querying of a database, receipt of content, and/or other actions that, when performed, will cause the content displayed by the device to be updated. For example, a portion of content corresponding to a "movie" in a video playback application may be associated with metadata and/or other contextual information that may indicate that selection of presentation of the "movie" portion of the content results in the application acquiring data indicative of movies viewable using the application and displaying indicators of the various movies. Assuming that the "movie" portion of the content corresponds to a selectable portion of the content, that portion of the content may be identified as an object on which the user may interact via a user utterance.

Based at least in part on receiving the data indicating that the content associated with the third-party application 206 is displayed on the user device 202, the third-party application registry of the remote system 204 can determine whether the third-party application 206 is registered or otherwise authorized to provide voice control of the content displayed on the user device 202. For example, when a third party application developer publishes an application for sale or consumption on an application store, the application store may query the developer to determine whether the developer wants to voice-enable the application. If the developer indicates that voice enablement is authorized, an indication of the application may be stored in a third party application registry. Thereafter, when the data indicates that content of the application is being displayed on the device, audio data corresponding to the voice command may be processed to voice-enable the application.

The determined and/or generated context data identified, by the third party application interface component 208 may be sent to the remote system 204 and may be stored, for example, in the data storage database 210. This context data may be used by the remote system 204, as described more fully below.

The user device 202 may have one or more microphones 212, which may be configured to capture audio from the environment in which the user device 202 is disposed. As described herein, an example of audio from the environment may be a human utterance, e.g., a voice command that interacts with content displayed by the user device 202. Additionally or alternatively, accessory devices (e.g., accessory devices 104(a) - (b) of fig. 1) may include a microphone 212. The microphone 212 may generate audio data corresponding to audio. The user device 202 may transmit the audio data, or a portion thereof, to the remote system 204.

An Automatic Speech Recognition (ASR) component 214 of the remote system 204 can receive the audio data and can execute ASR thereon to generate text data. Performing ASR on audio data will be described more fully below with reference to fig. 12. A Natural Language Understanding (NLU) component 216 can utilize the text data to determine one or more intents associated with the voice command. Again, performing NLU on text data will be described more fully below with reference to fig. 12. As part of determining the intent associated with the voice command, the NLU component 216 can perform named entity recognition in conjunction with natural language understanding to identify portions of the text data corresponding to named entities that can be recognized by the remote system 204. The process may link the text portion to a particular entity known to the remote system 204. As shown in fig. 2, entity identification component 218 is shown as a separate component from NLU component 216. However, it is to be understood that the entity identification component 218 can be a component of the NLU component 216.

To perform named entity recognition, the entity recognition component 218 can utilize screen data provided by the third party application interface component 208 of the user device 202. The screen data can be used for entity recognition, for example, by matching results of the ASR component 214 to different entities (e.g., objects displayed on the user device 202) associated with the application 206. As such, the data source database 210 may be populated with some or all of the screen data provided by the user device 202 to facilitate named entity identification. The NLU component 218 may be trained or otherwise configured to select an intent based on screen data currently being displayed on the user device 202. Additionally, the NLU component 218 can determine values for one or more slots associated with the intent based on the screen data.

In an example, the intent determined by the NLU component 216 can be transmitted with the assistance of the entity recognition component 218 to a verbal applet 220 configured to generate instructions for performing an action with respect to the third-party application 206. Based at least in part on the indication that the application 206 has been authorized for voice enablement using the system described herein, the intent and associated value may be transmitted to a verbal applet 220 configured to generate instructional data for the third-party application 206. The verbal applet 220 can generate instruction data for execution by the user device 202 based at least in part on information received from the NLU component 216 and/or the entity recognition component 218. Some or all of the screen data associated with the displayed content may be provided by the third party application interface component 208. The screen data can be utilized to generate instruction data for execution by the user device 202 and/or one or more accessory devices to achieve an intent determined by the NLU component 216.

Once the verbal applet 220 generates instructional data, the remote system 204 can transmit instructional data to the user device 202 via the network. The instruction processor 222 of the user device 202 may receive the instruction data and may determine an action to perform based at least in part on the instruction data. For example, the instruction data may indicate that a "play" intent is to be performed on the object "cat video". The instruction processor 222, based at least in part on the intent from the instruction data, can determine that an action is to be taken to cause the video to play on the user device 202. The instruction processor 222 may also determine that the action that caused the video to play is associated with "cat video".

The instruction processor 222 may send data to the device event controller 224 indicating the selected action to be performed and the value of the "cat video" associated therewith. The device event controller 224 may then determine which components of the user device 202 will be utilized to perform the actions determined by the instruction processor 222. The device event controller 224 may also be configured to identify and/or determine when an event corresponding to the displayed content changing and/or being updated occurs. Examples of such events may include launching an application, user interaction with content that causes the content to be updated, refreshing of the content, and/or time-dependent changes in the displayed content. The device event controller 224, based at least in part on identifying and/or determining that an event has occurred, can cause the third party application interface component 208 to identify and/or determine updated content being displayed on the user device 202.

The node processing component 226 may receive data from the device event controller 224 indicating the action to be performed and the object on which the action is to be performed. The node processing component 226 may identify stored node information determined by the third party application interface component 208. The node processing component 226 may attempt to match or substantially match the identified object from the instruction to a node associated with the application 206. The process may be performed using a keyword search, where the keywords used in the search may be words that describe the object. For example, the object may include or be associated with displayed text that reads "best cat video". The phrase can be used to perform a keyword search on searchable text associated with the node of the application 206. The node that matches or best matches the searched phrase may be selected as the node on which the action is to be performed. According to the example used herein, the node associated with the video described as "best cat video on earth" may be determined as the best match for "best cat video". An action may be performed on the selected node that causes the video to play. The keyword processing component 228 can be employed to return a list of searchable words in which stop words such as "and", "of" and/or "the" are filtered out. This information can be used to match keywords to the appropriate nodes.

The overlay component 230 can be configured to provide one or more "prompts" to help the user more accurately provide voice commands and/or determine intent from voice commands. For example, with screen data indicating objects displayed on the screen, overlay content may be generated that provides, for example, numbers and/or letters associated with the displayed objects. Information associated with the prompts (also referred to as suggestions) may be stored in the node and suggestion database 232. The hints, associations between hint identifiers, and associations between nodes and hints may also be stored in the node and suggestion database 232. The information stored in the nodes and suggestion database 232 can be used by the overlay component 230 to generate overlay content. Upon viewing the overlay content, the user may provide a voice command instructing the system to perform an action on the selected number and/or letter. For example, the user may provide a voice command to "select the number 1". In this manner, the confidence with which the system determines which action to perform based on the voice command may be increased. Additionally or alternatively, in examples where multiple objects displayed on the user device correspond to the same or similar actions, the system may identify relationships between the objects and may generate modification hints, which may simplify user interaction with the system.

The node processing component 226 may send data to the third party application interface component 208 indicating the action to be performed and the node on which the action is to be performed. The third party application interface component 208 may send data to the third party application 206 and/or other components of the user device 202 to cause an action to be performed on the node.

FIG. 3 shows a flow diagram of an example process 300 for controlling a computing device via audible input. The operation of process 300 is described with respect to a user device and/or a remote system, as shown in FIG. 3. The order in which operations or steps are described is not intended to be construed as a limitation, and any number of the described operations can be combined in any order and/or in parallel to implement process 300.

At block 302, the process 300 may include capturing audio from an environment in which a user device is disposed and generating corresponding audio data. For example, the audio may include voice commands from a user in the environment. As shown in fig. 3, the voice command is "alexan, open video application". Audio corresponding to the voice command may be captured by one or more microphones of the user device and/or the accessory device, and corresponding audio data may be generated and transmitted to the remote system.

At block 304, the process 300 may include performing Automatic Speech Recognition (ASR) on the audio data to generate corresponding text data. Natural Language Understanding (NLU) techniques may be performed on the text data to determine an intent associated with the voice command. ASR and NLU techniques are described in more detail below with reference to fig. 12. In the example used in fig. 3, the voice command "alexan, open video application" may correspond to an "open application" intent, and a value indicating which application to open may correspond to "video". At block 306, based at least in part on determining that the voice command corresponds to an intent to open the video application, the process 300 may include generating instruction data for opening the application. The instructional data can be transmitted to the user device and/or the accessory device, which can open the video application based at least in part on receiving data corresponding to the instruction from the remote system at block 308.

At block 310, the process 300 may include determining that content associated with an application is currently being displayed on a display associated with a user device. It should be appreciated that the operations described with respect to block 310 may be performed regardless of whether the operations described with respect to block 302 and 308 are performed. For example, a user may provide a tactile input that may cause an application to open or otherwise launch. Determining that the application is currently being displayed may include receiving data from the application and/or another system storing the application indicating that the application is being used. Additionally or alternatively, the event handler may receive an indication that an event corresponding to opening the application has occurred.

At block 312, based at least in part on determining that the content of the application is currently being displayed, the process may include determining whether the application is registered as voice-enabled. For example, when a third party application developer publishes an application for sale or consumption on an application store, the application store may query the developer to determine whether the developer wants to voice-enable the application. If the developer indicates that voice enablement is authorized, an indication of the application may be stored in a registry. Thereafter, when the data indicates that content of the application is being displayed on the device, audio data corresponding to the voice command may be processed to voice-enable the application. If the application is not registered, at block 314, process 300 may include not performing the operations of the voice-enabled application.

If an application is registered, at block 316, the process 300 may include determining screen data associated with the displayed content. The screen data may include Document Object Model (DOM) information associated with the content of the application. The DOM information may include an identification of one or more objects corresponding to the displayed content and/or one or more relationships between the objects. The DOM may be an Application Programming Interface (API) that represents hypertext markup language (HTML), extensible markup language (XML), and/or other computing languages in a tree structure, where each node of the tree represents an object that represents a portion of the application content. When an object in the tree is acted upon, the corresponding change may be reflected in the display of the application content. One or more libraries associated with the API can be provided to allow one or more actions to be taken with respect to the nodes in the DOM tree. At block 318, the process 300 may include receiving data corresponding to the screen data at the remote system. The screen data may be used by the remote system in the operations described in more detail below.

At block 320, the process may include capturing audio from an environment in which the user device is disposed and generating corresponding audio data. For example, the audio may include voice commands from the user to interact with the displayed content. As shown in fig. 3, the voice command is "alexan, search dog video". Audio corresponding to the voice command may be captured by one or more microphones of the user device and/or the accessory device, and corresponding audio data may be generated and transmitted to the remote system.

At block 322, the process 300 may include performing ASR on the audio data to generate corresponding text data. NLU techniques may be performed on the text data at block 324 to determine an intent associated with the voice command. ASR and NLU techniques are described in more detail below with reference to fig. 12. In the example used in fig. 3, the voice command "alexan, search dog video" may correspond to the "search" intent, and a value indicating which content to search for may correspond to the "dog video". As part of determining the intent associated with the voice command, named entity recognition may be performed at block 326 in conjunction with natural language understanding to identify portions of the text data corresponding to named entities that may be recognized by the remote system. The process may link the text portion to a particular entity known to the remote system. As shown in fig. 3, performing entity recognition is shown as a separate operation from the NLU operation described at block 324. However, it should be understood that entity identification may be performed as part of the NLU operation described at block 324.

To perform named entity recognition, the screen data determined at block 316 may be utilized. The screen data may be used for entity recognition, for example, by matching results of the ASR operation to different entities associated with the application (e.g., objects displayed on the user device). In this way, the data source database may be populated with some or all of the screen data provided by the user device to assist in named entity identification. The NLU component of the remote system may be trained or otherwise configured to select an intent based on screen data corresponding to content currently being displayed on the user device. Additionally, the NLU component can determine values for one or more slots associated with the intent based on the screen data.

In an example, the intent determined by the NLU component can be transmitted with the assistance of an entity recognition operation to a verbal applet configured to generate instructions for performing an action with respect to a third-party application. The verbal applet may generate instruction data for execution by the user device at block 328 based at least in part on the intents determined at

blocks

324 and 326. The screen data may be used to generate instruction data for execution by the user device and/or one or more accessory devices to achieve the determined intent.

Once the verbal applet generates the instructional data, the remote system can transmit data representing the instructional to the user device via the network. An instruction processor of the user device may receive the instructions and may determine an action to perform based at least in part on the instructions. The instruction processor may send data to the device event controller indicating the selected action to be performed and information about the object on which the action is to be performed. The device event controller may then determine which components of the user device are to be utilized to perform the actions determined by the instruction processor. The device event controller may also be configured to identify and/or determine when an event corresponding to the displayed content changing and/or being updated occurs. Examples of such events may include launching an application, user interaction with content that causes the content to be updated, refreshing of the content, and/or time-dependent changes in the displayed content.

The node processing component of the user device may receive data from the device event controller indicating the action to be performed and the object on which the action is to be performed. The node processing component may identify stored node information. The node processing component may attempt to match or substantially match the identified object from the instruction to a node associated with the application. The process may be performed using a keyword search, where the keywords used in the search may be words that describe the object. The node that matches or best matches the searched phrase may be selected as the node on which the action is to be performed. The keyword processing component of the user device can be used to return a list of searchable words in which stop words such as "and", "of" and/or "the" are filtered out. This information can be used to match keywords to the appropriate nodes. At block 330, after the node on which the action is to be performed has been determined and the action to be performed has been determined, the action may be performed on the node of the application.

FIG. 4 illustrates an example user interface 400 for controlling a computing device via audible input. The user interface 400 may be displayed on a device, such as the user device 102 of fig. 1. The user interface 400 may display content associated with an application, such as a third party application. In the example provided with reference to fig. 4, the third party application is a video playback application. It should be understood that although the example given with reference to fig. 4 is a video playback application, other applications that include objects that may be displayed on a device are also included in the present disclosure.

The user interface 400 may include one or more objects. Objects may be classified as object types, such as text object 402, image object 404, and text input object 406. As described above, a user may interact with various objects through audible input means. For example, a user may provide a voice command to open a video playback application. The user may then provide subsequent voice commands to interact with the content displayed on the user device. Those voice commands may be, for example, "select movie", "play video C", "search dog video", etc. Audio data corresponding to the voice command may be processed by the remote system as described above to determine an intent associated with the voice command. Instructions to perform the action may be sent to the user device and/or the accessory device, which may utilize the instructions to perform the action on the node corresponding to the displayed object. For example, a voice command for "select movie" may cause the "movie" object to be selected as if the user had provided a tactile input selecting the "movie" object displayed on the user device. By way of further example, a voice command for "play video C" may cause "video C" text object 402 to be selected, and/or a play icon overlaid on an image associated with "video C" to be selected, and/or an image associated with "video C" to be selected, as if the user had provided tactile input selecting a "video C", play icon or image displayed on the user device. By way of further example, a voice command for "search for dog video" may cause text input object 406 to be selected and the text "dog video" to be entered into the text input field as if the user had provided tactile input to select the text input field and typed or otherwise entered "dog video" into that field.

Performing one or more actions on one or more objects described with reference to fig. 4 may result in additional and/or different content being displayed on the user device. For example, selection of a "movie" object may result in a change in the images displayed on other portions of the user interface 400, thereby displaying images corresponding to videos identified as being in the category of "movies". The text object 402 corresponding to the description of the video may also be updated to the description corresponding to the newly displayed image. By way of further example, selection of a play icon may cause a video corresponding to the play icon to be launched and displayed on the user interface 400. As the user interacts with the user interface 400, as displayed content is updated, events corresponding to the interaction may be identified and used to update the determination of the screen data being displayed. For example, the remote system may utilize the updated screen data to more accurately determine an intent associated with the voice command to interact with the displayed content.

In an example, in an example where the determined intent corresponds to more than one action to be performed on the given object, the user device and/or the remote system may be configured to order instructions to perform the action on the displayed content. For example, a voice command may represent an intent that may be determined to correspond to more than one action and/or may correspond to actions that may be performed for multiple objects. The instructions may be ordered such that ambiguous voice commands may result in the highest ordered instruction being sent to the user device and used to perform a given action. For example, the ordering of the instructions may be based at least in part on historical usage data, applications associated with the displayed content, locations of objects displayed on the user device relative to one another, classifications of intent, previous voice commands, and/or screen data updates.

For example, the historical usage data may indicate that a given voice command, while corresponding to a plurality of instructions, historically corresponds to a first instruction more frequently than a second instruction for voice commands received via the user device. Additionally or alternatively, the instructions may be ordered using data indicating that a given voice command, while corresponding to a plurality of instructions, historically corresponds to the first instruction more frequently than the second instruction for voice commands received via the user device and/or other devices. The application may also indicate which instructions take precedence over others. Additionally or alternatively, data indicating the position of objects displayed on the user device relative to each other may be used to order the instructions. For example, instructions to perform an action on a more highlighted object may be prioritized over instructions to perform an action on a less highlighted object. Additionally or alternatively, certain intents may not be dependent on a particular object displayed on the user device and thus may be associated with predetermined instructions. For example, a voice command to "scroll down" may correspond to an intent to display content on the user device that is not currently in view, and may not correspond to an intent to perform an action with respect to an object displayed on the user device. Instructions that perform actions based on intent that is not dependent on the object may be prioritized over instructions that perform actions dependent on the object.

Additionally or alternatively, data indicative of previous voice commands may be used to order the instructions. For example, a previous voice command may be "scroll down" and a subsequent voice command may be "more". Without context data indicating a previous voice command, a "more" command may correspond to an instruction to perform an action such as displaying more video, providing more information about a certain video, playing more video, etc. However, with the previous voice command of "scroll down", the instructions may be ordered such that the instructions performing the additional scroll down action take precedence over the other instructions. Additionally or alternatively, data indicating that the screen data has changed or has otherwise been updated may be used to order the instructions. Additionally or alternatively, a predetermined prioritization of instructions may be stored and utilized by a remote system.

For example, instructions to perform actions on objects associated with an application may be ordered based at least in part on the type of object acted upon. For example, objects associated with image object 404 and text object 402 may override text-only object 402, image-only object 404, and/or text input object 406. For example, a voice command to "play video" may be associated with instructions to perform actions on various objects, such as representing an image of the video with a play icon 404 overlaid thereon, reading a text object 402 of "play", an image object 404 including a play icon, and/or a text input object 406 (such as a search field into which the phrase "play video" may be inserted). In this example, instructions associated with the image and overlaid play icon may override other instructions. Likewise, playing the icon may be preferred over reading the text of "play". Likewise, reading the text of "play" may take precedence over editable fields.

Additionally or alternatively, the ordering of the instructions may be based at least in part on an intent, determined by the remote system, corresponding to the voice command. For example, the determined "play" intent may correspond to the ordering described above. Additionally or alternatively, the determined "search" intent may correspond to an ordering of instructions to perform an action on an object associated with the text input object 406 over instructions to perform an action on an object associated with a selection of the object. Additionally or alternatively, the determined "selection" intent may correspond to an ordering of instructions that will cause an action to be performed on the object whose content was updated when selected over instructions that will cause an action to be performed on other objects (e.g., inserting text into a search field). It should be understood that examples of instruction ordering are provided herein for illustration, and that other examples of ordering instructions are included in the present disclosure. Other non-limiting examples of intent may include "scroll," move, "" slide, "" page, "" return, "" fold, "" forward, "" previous, "" next, "" resume, "" pause, "" stop, "" rewind, "and" fast forward.

In addition to or in lieu of sequencing instructions generated by a remote system as described above, the instructions received from the remote system may be associated with more than one action. For example, a "select" intent may correspond to opening a hyperlink, causing a video to play, causing additional information to be displayed, or other action. The user device and/or the accessory device can be configured to prioritize the actions based at least in part on an intent from instructions and/or contextual information associated with the application. The ordering of the actions may be performed in the same or similar manner as the ordering of the instructions described with reference to FIG. 4.

FIG. 5 illustrates another example user interface 500 for controlling a computing device via audible input. The user interface 500 may be displayed on a device, such as the user device 102 of fig. 1. The user interface 500 may display content associated with an application, such as a third party application. In the example provided with reference to fig. 5, the third party application is a video playback application. It should be understood that although the example given with reference to fig. 5 is a video playback application, other applications that include objects that may be displayed on a device are also included in the present disclosure.

The user interface 500 may include one or more objects, which may include the same or similar objects as those described with reference to fig. 4 above. As described with reference to fig. 4, a user may provide voice commands to interact with displayed content. Examples provided with reference to FIG. 4 include "select movie", "Play video C", and "search dog video". These voice commands are based at least in part on the user's perception of objects presented by the user device. In an example, a user may wish or need assistance to provide a voice command that results in a desired action being performed on a desired object.

In these examples, the user device, the accessory device, and/or the remote system may be configured to provide one or more "prompts" to help the user more accurately provide voice commands and/or determine intent from voice commands. For example, with screen data indicating objects displayed on the screen, overlay content may be generated that provides, for example, numbers and/or letters associated with the displayed objects. Upon viewing the overlay content, the user may provide a voice command instructing the system to perform an action on the selected number and/or letter. For example, as shown in FIG. 5, the overlay content may include one or more presentation graphs of the cues 502(a) - (e). As used in this example, the cues 502(a) - (e) may be displayed as overlay content on one or more selectable objects displayed on the user interface 500.

Here, the first prompt 502(a) may correspond to a text entry field, the second prompt 502(B) may correspond to a selection of a "home" object, the third prompt 502(c) may correspond to a selection of a play icon associated with "video a", the fourth prompt 502(d) may correspond to a selection of a text object associated with "video a", and/or the fifth prompt 502(e) may correspond to a selection of an image associated with "video B". In an example, a number may be provided for each object displayed on the user interface 500. In other examples, only a portion of the object may include an override number. For example, it may be determined that multiple objects are associated with the same action when selected. In these examples, an overlay number may be displayed for multiple objects. To illustrate using FIG. 5, the text object "video B", the image associated with the text object and the play icon overlaid on the image may both cause "video B" to be launched and displayed on the user interface 500 when selected. In this example, instead of providing a number for each of the text object, the image, and the play icon, a single prompt 502(e) may be overlaid on an area of the user interface 500 that is common to multiple objects.

The user may then provide a voice command corresponding to the selection of one of the prompts 502(a) - (e). For example, the user may provide voice commands such as "select number 2", "select 2", "2", "select second", etc. Data may be provided to the remote system indicating that prompts are being provided to the user, as well as data indicating which prompts are associated with which objects. As such, audio data corresponding to the voice command may be processed by the remote system to more easily and/or accurately determine that the voice command corresponds to an intent to select one of the prompts provided on the user interface 500 and identify the prompt selected by the user. The remote system may associate the selected prompt with the object corresponding to the prompt and may provide instructions to perform an action on the object, as described more fully above. When a user interacts with the displayed content, for example by selecting a prompt, the content may change and/or be updated. The updated content may be used to determine updated screen data that may be used to generate updated overlay content with updated prompts for use by a user. The updated data can be sent to a remote system to help determine intent and generate instructions for subsequent voice commands.

As shown in fig. 5, the cues are depicted as numbers. However, it should be understood that cues 502(a) - (e) are provided by way of example and not limitation. Other cue identifiers may be utilized, such as letters, symbols, sounds, shapes, and/or colors. The cues are also shown with reference to fig. 5 as having a shape, i.e., a circle. However, it should be understood that the cues may be any shape. Additionally or alternatively, the size of the hint may be static or dynamic. For example, the size of the hints may be consistent with respect to the particular overlay content. Alternatively, the size of the cues may vary. For example, a size of the presented object may be determined, and the information may be utilized to generate a prompt having a size similar to the presented object. As shown in FIG. 5, for example, prompt 502(b) is larger than prompt 504 (d). The size difference may be based at least in part on the different sizes of the object presentations to which the cues correspond. Additionally or alternatively, the presentation of the prompt may have translucency or transparency, which may allow the user to view some or all of the objects on which the prompt overlay is applied to the user interface 500.

6-11 illustrate various processes for voice control of a computing device. The processes described herein are illustrated in logical flow diagrams as a collection of blocks, which represent a sequence of operations, some or all of which may be implemented in hardware, software, or a combination thereof. In the context of software, the blocks may represent computer-executable instructions stored on one or more computer-readable media, which, when executed by one or more processors, program the processors to perform the recited operations. Generally, computer-executable instructions include routines, programs, objects, components, data structures, and so forth that perform particular functions or implement particular data types. The order in which the blocks are described should not be construed as a limitation, unless otherwise specified. Any number of the described blocks can be combined in any order and/or in parallel to implement the process or an alternative process, and not all blocks need to be performed. For purposes of discussion, the processes are described with reference to the environments, architectures and systems described in examples herein (e.g., those described with reference to fig. 1-5 and 12), although the processes may be implemented in various other environments, architectures and systems.

FIG. 6 shows a flow diagram of an example process 600 for controlling a computing device via audible input. The order in which operations or steps are described is not intended to be construed as a limitation, and any number of the described operations can be combined in any order and/or in parallel to implement process 600.

At block 602, the process 600 may include determining that content of an application is being displayed on a device. The application may not have instructions for interacting with the displayed content using voice commands. For example, the application may be a third party application that has not enabled voice interaction with application content. The third party application may include an identifier of the application and data representing content associated with the application. Content can be described as nodes of a DOM tree, which can be used to perform actions on the content. As described herein, an object may be displayed on a user device. The object may correspond to one or more nodes of a DOM tree of the application. Determining that the content of the application is currently being displayed may include receiving data from the application and/or another system storing the application indicating that the application is being used. Additionally or alternatively, the event handler may receive an indication that an event corresponding to opening the application has occurred.

At block 604, the process 600 may include causing the application interface component to identify metadata associated with the content, which may be based at least in part on determining that the content is being displayed. The application interface component can be configured to identify content being displayed. The application interface component may be a component of the device or another device in communication with the first device. The application interface component may receive data from the application, for example, via one or more APIs, which may indicate content being displayed on the display.

At block 606, the process 600 may include identifying, via the application interface and according to the metadata, a portion of the content that, when displayed and selected by the user, causes the updated content to be displayed. The portion of content may correspond to an object or selectable object selectable by a user or a node of a document object model associated with an application. Identifying the selectable object may be based at least in part on determining screen data associated with the displayed content. The screen data may include Document Object Model (DOM) information associated with the content of the application. The DOM information may include an identification of one or more objects corresponding to the displayed content and/or one or more relationships between the objects. The DOM may be an Application Programming Interface (API) that represents hypertext markup language (HTML), extensible markup language (XML), and/or other computing languages in a tree structure, where each node of the tree represents an object that represents a portion of the application content. When an object in the tree is acted upon, the corresponding change may be reflected in the display of the application content. One or more libraries associated with the API can be provided to allow one or more actions to be taken with respect to the nodes in the DOM tree.

At block 608, the process 600 may include sending screen data identifying the portion of the content to the remote system. The screen data may be transmitted to the remote system via the network and network interface described herein. Data indicating one or more relationships between the objects may additionally be sent to a remote system.

At block 610, the process 600 may include receiving audio data representing a user utterance. Receiving audio data may include capturing audio from an environment in which the device is disposed via one or more microphones and generating corresponding audio data. For example, the audio may include utterances from users in the environment. Audio corresponding to the user utterance may be captured by one or more microphones of the user device and/or the accessory device, and corresponding audio data may be generated.

At block 612, the process 600 may include transmitting the audio data to a remote system. The audio data may be transmitted to the remote system via the network and network interface described herein. One or more instructions and/or data may be sent to the remote system along with the audio data to associate the audio data with the device, the associated accessory device, a user profile associated with the device, a user account associated with the device, and/or screen data sent to the remote system.

At block 614, process 600 may include receiving instruction data from the remote system to perform an action with respect to the portion of content. The instruction data may be determined by the remote system based on the screen data and the audio data. For example, the remote system may perform Automatic Speech Recognition (ASR) on the audio data to generate corresponding text data. Natural Language Understanding (NLU) techniques may be performed on the text data to determine an intent associated with the utterance. ASR and NLU techniques are described in more detail below with reference to fig. 12. As part of determining the intent associated with the utterance, named entity recognition may be performed in conjunction with natural language understanding to identify portions of the text data that correspond to named entities recognizable by the remote system. The process may link the text portion to a particular entity known to the remote system.

To perform named entity recognition, screen data may be used. The screen data may be used for entity recognition, for example, by matching results of the ASR operation to different entities associated with the application (e.g., objects displayed on the user device). In this way, the data source database may be populated with some or all of the screen data provided by the user device to assist in named entity identification. As such, the NLU component of the remote system may be trained or otherwise configured to select an intent based on screen data corresponding to content currently being displayed on the user device. Additionally, the NLU component can determine values of one or more slots associated with the intent based on the screen data.

In an example, the intent determined by the NLU component can be transmitted with the assistance of an entity recognition operation to a verbal applet configured to generate instruction data for performing an action with respect to a third-party application. The verbal applet can generate instruction data for execution by the device based at least in part on the intent determined by the remote system. The screen data may be used to generate instruction data for execution by the device and/or one or more accessory devices to achieve the determined intent. Based at least in part on determining that the utterance corresponds to a given intent, instruction data corresponding to the intent, and an indication of a subject on which to perform a desired action, the instruction data can be generated and sent to the device.

At block 616, the process 600 may include causing an action to be performed. An instruction processor of the device may receive the instruction data and may determine an action to perform based at least in part on the instruction data. The instruction processor may send data to the device event controller indicating the selected action to be performed and information about the object on which the action is to be performed. The device event controller may then determine which components of the device are to be utilized to perform the actions determined by the instruction processor. The device event controller may also be configured to identify and/or determine when an event corresponding to the displayed content changing and/or being updated occurs. Examples of such events may include launching an application, user interaction with content that causes the content to be updated, refreshing of the content, and/or time-dependent changes in the displayed content.

A node processing component of the device may receive data from the device event controller indicating an action to be performed and an object on which the action is to be performed. The node processing component may identify stored node information. The node processing component may attempt to match or substantially match the identified object from the instruction to a node associated with the application. The process may be performed using a keyword search, where the keywords used in the search may be words that describe the object. The node that matches or best matches the searched phrase may be selected as the node on which the action is to be performed. The keyword processing component of the user device can be used to return a list of searchable words in which stop words such as "and", "of" and/or "the" are filtered out. This information can be used to match keywords to the appropriate nodes. Having determined the node on which the action is to be performed and having determined the action to be performed, the action may be performed on the node of the application.

The process 600 may additionally include receiving event data indicating that an event has occurred for the content. The process 600 may additionally include determining that the event corresponds at least in part to the second content being displayed on the display. Based at least in part on determining that the second content is being displayed, a second portion of the second content may be identified. Process 600 may include sending second screen data identifying a second portion to the remote system. The second portion may be different from the first portion. In this manner, the screen data identified, determined and/or sent to the remote system may be updated as the displayed content is updated. The remote system may utilize the updated screen data to inform of the natural language understanding of the subsequent voice command and to generate subsequent instructions for execution by the device. Determining that at least a portion of a user interface displayed on a device has changed may be based, at least in part, on a determination that an event has occurred for content displayed on the device. For example, the event may include opening an application, user interaction with content, refreshing of content, and/or time-dependent changes to displayed content. Based at least in part on identifying and/or determining that an event has occurred, a device event controller of the device can cause a third party application interface component of the device to identify and/or determine updated content being displayed on the device.

The process 600 may additionally or alternatively include causing display of the overlay content on a user interface. The overlay content may include an identifier proximate to the selectable object. The process 600 may also include sending second data associating the identifier with the selectable object to the remote system. The remote system may utilize the second data to generate instructions and/or determine an intent associated with the voice command. For example, the user device, the accessory device, and/or the remote system may be configured to provide one or more "prompts" to help the user more accurately provide voice commands and/or determine intent from voice commands. For example, with screen data indicating objects displayed on the screen, overlay content may be generated that provides, for example, numbers and/or letters associated with the displayed objects. Upon viewing the overlay content, the user may provide a voice command instructing the system to perform an action on the selected number and/or letter. For example, the overlay content may include a presentation of one or more numbers. As used in this example, the numbers may be displayed as overlay content on one or more selectable objects displayed on the user interface.

In an example, a number may be provided for each object displayed on the user interface. In other examples, only a portion of the object may include an override number. For example, it may be determined that multiple objects are associated with the same action when selected. In these examples, an overlay number may be displayed for multiple objects. In this example, instead of providing a number for each of a plurality of objects, such as text objects, images, and/or play icons, a single number may be overlaid on an area of the user interface that is common to the plurality of objects.

The user may then provide a voice command corresponding to the selection of one of the digits. Data may be provided to the remote system indicating that a prompt is being provided to the user, as well as data indicating which objects are associated with which objects. In this manner, audio data corresponding to the voice command may be processed by the remote system to more easily and/or accurately determine that the voice command corresponds to an intent to select one of the prompts provided on the user interface and identify the prompt selected by the user. The remote system may associate the selected prompt with the object corresponding to the prompt and may provide instructions to perform an action on the object, as described more fully above. When a user interacts with the displayed content, for example by selecting a prompt, the content may change and/or be updated. The updated content may be used to determine updated screen data that may be used to generate updated overlay content with updated prompts for use by a user. The updated data can be sent to a remote system to help determine intent and generate instructions for subsequent voice commands.

The process 600 may additionally or alternatively include determining that the instruction corresponds to the first action and the second action. The first action may be associated with a first priority and the second action may be associated with a second priority. Process 600 may also include determining that the first priority is greater than the second priority, and selecting one of the first action or the second action to perform on the object based at least in part on the priority. For example, a "select" intent may correspond to opening a hyperlink, causing a video to play, causing additional information to be displayed, or other action. Actions such as these are prioritized based at least in part on intent from instructions and/or contextual information associated with the application. For example, historical usage data may indicate that a given intent, while corresponding to a plurality of actions, historically corresponds to a first action more frequently than a second action with respect to an intent received via a device. Additionally or alternatively, the data indicates that the given intent, while corresponding to the plurality of actions, historically corresponds to the first action more frequently than the second action for voice commands received via the device and/or other devices. The application may also indicate which actions take precedence over other actions.

Additionally or alternatively, data indicating the position of objects displayed on the device relative to each other may be used to rank the actions. For example, actions performed on more highlighted objects may be prioritized over actions performed on less highlighted objects. Additionally or alternatively, certain intents may not be dependent on a particular object displayed on the device and thus may be associated with a predetermined action. For example, a voice command to "scroll down" may correspond to an intent to display content on the device that is not currently in view, and may not correspond to an intent to perform an action with respect to an object displayed on the device. Actions based on an intent that is not dependent on an object, such as this, may be prioritized over actions that are dependent on an object.

Additionally or alternatively, data indicative of previous voice commands may be used to rank the actions. For example, a previous voice command may be "scroll down" and a subsequent voice command may be "more". Without context data indicating a previous voice command, a "more" command may correspond to an action such as displaying more video, providing more information about a certain video, playing more video, etc. However, with the previous voice command of "scroll down," the actions may be ordered such that the action of performing the additional scroll down takes precedence over other actions. Additionally or alternatively, data indicating that the screen data has changed or has otherwise been updated may be used to order the actions. Additionally or alternatively, a predetermined prioritization of actions may be stored and utilized. It should be understood that examples of ordering actions are provided herein for illustration, and that other examples of ordering actions are included in the present disclosure.

Additionally or alternatively, process 600 may include determining a first node of content corresponding to a value associated with the action and a second node of content corresponding to the value from document object model information indicating nodes associated with the content. The process 600 may also include determining a confidence level associated with the first node and the second node that indicates a confidence that the node corresponds to the value. The action may be performed based at least in part on which confidence level prioritization.

FIG. 7 shows a flow diagram of an example process 700 for controlling a computing device via audible input. The order in which operations or steps are described is not intended to be construed as a limitation, and any number of the described operations can be combined in any order and/or in parallel to implement process 700.

At block 702, the process 700 may include determining that content of an application is being displayed on a device. The determination may be based at least in part on launching the application. In an example, the application may have no instructions for interacting with the displayed content using voice commands. For example, the application may be a third party application that has not enabled voice interaction with application content. The third party application may include an identifier of the application and data representing content associated with the application. Content can be described as nodes of a DOM tree, which can be used to perform actions on the content. As described herein, an object may be displayed on a user device. The object may correspond to one or more nodes of a DOM tree of the application.

At block 704, the process 700 may include identifying metadata associated with the content based at least in part on determining that the content is being displayed. The metadata may include indicators of which portions of the application content are currently being utilized to render the display of the object on the device.

At block 706, process 700 may include identifying a portion of the selectable content based at least in part on the metadata. For example, a portion of the application content may be associated with a link or other mechanism that, when the user selects the presentation of the content, causes the application and/or the device utilizing the application to update the content being displayed. For example, the portion of the content may be associated with a "play button" object displayed on the device. The user may select the play button object and, as such, the application may include instructions to update the displayed content to something linked to the selection of the play button object. The object may be selected via a user interface of the device and/or may correspond to at least a portion of a node of a document object model associated with the application. Identifying the object may be based at least in part on determining screen data associated with the displayed content. The screen data may include Document Object Model (DOM) information associated with the content of the application. The DOM information may include an identification of one or more objects corresponding to the displayed content and/or one or more relationships between the objects. The DOM may be an Application Programming Interface (API) that represents hypertext markup language (HTML), extensible markup language (XML), and/or other computing languages in a tree structure, where each node of the tree represents an object that represents a portion of the application content. When an object in the tree is acted upon, the corresponding change may be reflected in the display of the application content. One or more libraries associated with the API can be provided to allow one or more actions to be taken with respect to the nodes in the DOM tree.

At block 708, process 700 may include sending screen data identifying the portion of the content to the remote system. The screen data may be transmitted to the remote system via the network and network interface described herein. Data indicating one or more relationships between the objects may additionally be sent to a remote system.

At block 710, the process 700 may include transmitting audio data representing the user utterance to the remote system. The user utterance may correspond to a request to interact with content being displayed on the device.

At block 712, the process 700 may include receiving instruction data from the remote system and based at least in part on the audio data representing the user utterance to perform an action for the portion of the content. The instruction data may be determined by the remote system based at least in part on the screen data and the audio data. For example, the remote system may perform Automatic Speech Recognition (ASR) on the audio data to generate corresponding text data. Natural Language Understanding (NLU) techniques may be performed on the text data to determine an intent associated with the voice command. ASR and NLU techniques are described in more detail below with reference to fig. 12. As part of determining the intent associated with the user utterance, named entity recognition may be performed in conjunction with natural language understanding to identify portions of the text data that correspond to named entities recognizable by the remote system. The process may link the text portion to a particular entity known to the remote system.

To perform named entity recognition, screen data may be used. The screen data may be used for entity recognition, for example, by matching ASR operation results to different entities associated with the application (e.g., a portion of content displayed on a user device). In this way, the data source database may be populated with some or all of the screen data provided by the user device to assist in named entity identification. As such, the NLU component of the remote system may be trained or otherwise configured to select an intent based on screen data corresponding to content currently being displayed on the user device. Additionally, the NLU component can determine values of one or more slots associated with the intent based on the screen data.

At block 714, the process 700 may include causing an action to be performed for at least a portion of the content. An instruction processor of the device may receive the instruction data and may determine an action to perform based at least in part on the instruction data. The instruction processor may send data to the device event controller indicating the selected action to be performed and information about the object on which the action is to be performed. The device event controller may then determine which components of the device are to be utilized to perform the actions determined by the instruction processor. The device event controller may also be configured to identify and/or determine when an event corresponding to the displayed content changing and/or being updated occurs. Examples of such events may include launching an application, user interaction with content that causes the content to be updated, refreshing of the content, and/or time-dependent changes in the displayed content.

The process 700 may additionally or alternatively include causing display of the overlay content on the content. The overlay content may include an identifier proximate to an object associated with a portion of the content. The process 700 may also include sending data associating the identifier with the object to a remote system. The remote system may utilize the second data to generate instructions and/or determine an intent associated with the voice command. For example, the user device, the accessory device, and/or the remote system may be configured to provide one or more "prompts" to help the user more accurately provide voice commands and/or determine intent from voice commands. For example, with screen data indicating objects displayed on the screen, overlay content may be generated that provides, for example, numbers and/or letters associated with the displayed objects. Upon viewing the overlay content, the user may provide a voice command instructing the system to perform an action on the selected number and/or letter. For example, the overlay content may include a presentation of one or more numbers. As used in this example, the numbers may be displayed as overlay content on one or more selectable objects displayed on the user interface.

The process 700 may additionally or alternatively include determining that the instruction corresponds to the first action and the second action. The first action may be associated with a first priority and the second action may be associated with a second priority. Process 700 may also include determining that the first priority is greater than the second priority, and selecting one of the first action or the second action to perform on the object based at least in part on the priority. For example, a "select" intent may correspond to opening a hyperlink, causing a video to play, causing additional information to be displayed, or other action. Actions such as these are prioritized based at least in part on intent from instructions and/or contextual information associated with the application. For example, historical usage data may indicate that a given intent, while corresponding to a plurality of actions, historically corresponds to a first action more frequently than a second action with respect to an intent received via a device. Additionally or alternatively, the data indicates that the given intent, while corresponding to the plurality of actions, historically corresponds to the first action more frequently than the second action for voice commands received via the device and/or other devices. The application may also indicate which actions take precedence over other actions.

The process 700 may additionally or alternatively include determining that second content associated with the application is being displayed on the device. Based at least in part on determining that the second content is being displayed on the device, the second content displayed on the device may be identified. Process 700 may include sending second screen data identifying second content to the remote system. The second content may be different from the first content. In this manner, the screen data identified, determined and/or sent to the remote system may be updated as the displayed content is updated. The remote system may utilize the updated screen data to inform of the natural language understanding of the subsequent voice command and to generate subsequent instructions for execution by the device. Determining that at least a portion of the content displayed on the device has changed may be based, at least in part, on a determination that an event has occurred for the content displayed on the device. For example, the event may include opening an application, user interaction with content, refreshing of content, and/or time-dependent changes to displayed content. Based at least in part on identifying and/or determining that an event has occurred, a device event controller of the device can cause a third party application interface component of the device to identify and/or determine updated content being displayed on the device.

FIG. 8 shows a flow diagram of an example process 800 for controlling a computing device via audible input. The order in which operations or steps are described is not intended to be construed as a limitation, and any number of the described operations can be combined in any order and/or in parallel to implement process 800.

At block 802, process 800 may include receiving screen data indicating a portion of content of an application being displayed on a device. The application may not have instructions for interacting with the displayed content using voice commands. For example, the application may be a third party application that has not enabled voice interaction with application content. The third party application may include an identifier of the application and data representing content associated with the application. Content can be described as nodes of a DOM tree, which can be used to perform actions on the content. These nodes may also be described as and/or correspond to objects. As described herein, an object may be displayed on a user device. The object may correspond to one or more nodes of a DOM tree of the application. Determining that the application is currently being displayed may include receiving data from the application and/or another system storing the application indicating that the application is being used. Additionally or alternatively, the event handler may receive an indication that an event corresponding to opening the application has occurred.

The object may correspond to at least a portion of a node of a document object model associated with the application. Identifying the object may be based at least in part on determining screen data associated with the displayed content. The screen data may include Document Object Model (DOM) information associated with the content of the application. The DOM information may include an identification of one or more objects corresponding to the displayed content and/or one or more relationships between the objects. The DOM may be an Application Programming Interface (API) that represents hypertext markup language (HTML), extensible markup language (XML), and/or other computing languages in a tree structure, where each node of the tree represents an object that represents a portion of the application content. When an object in the tree is acted upon, the corresponding change may be reflected in the display of the application content. One or more libraries associated with the API can be provided to allow one or more actions to be taken with respect to the nodes in the DOM tree.

At block 804, the process 800 may include receiving audio data representing a user utterance. The audio data may be associated with a device. The audio data may be generated by one or more microphones that capture corresponding audio within the environment in which the device is disposed. For example, the audio may include user utterances from users in the environment. Audio corresponding to the utterance may be captured by one or more microphones of the user device and/or the accessory device, and corresponding audio data may be generated.

At block 806, the process 800 may include determining intent data based at least in part on the screen data and the audio data. For example, the system can perform Automatic Speech Recognition (ASR) on the audio data to generate corresponding text data. Natural Language Understanding (NLU) techniques may be performed on the text data to determine an intent associated with the voice command. ASR and NLU techniques are described in more detail below with reference to fig. 12. Determining intent data can be based at least in part on a finite state displacer associated with a verbal applet that generates instruction data to be sent to a device and/or associated with an application. As part of determining intent data associated with the utterance, named entity recognition may be performed in conjunction with natural language understanding to identify portions of the text data that correspond to named entities recognizable by the remote system. The process may link the text portion to a particular entity known to the remote system.

To perform named entity recognition, screen data may be used. The screen data may be used for entity recognition, for example, by matching results of the ASR operation to different entities associated with the application (e.g., objects displayed on the user device). In this way, the data source database may be populated with some or all of the screen data provided by the user device to assist in named entity identification. As such, the NLU component of the remote system may be trained or otherwise configured to select an intent based on screen data corresponding to content currently being displayed on the user device.

At block 808, the process 800 may include generating instruction data associated with the intent data. The generation may be based at least in part on the intent data. The instruction data may indicate an action to perform for the portion of the content. The intent data can be determined by the NLU component, which in an example, with the aid of an entity recognition operation, can be sent to a verbal applet configured to generate instructions for performing an action for a third-party application. The verbal applet can generate instruction data for execution by the device based at least in part on the intent data determined by the remote system. The verbal applet can be a speech processing component of a plurality of speech processing components associated with the remote system. The method can include selecting a verbal applet from the other speech processing components based at least in part on the first data indicating that the content associated with the application is being displayed and/or the second data identifying the portion of the content. The screen data may be used to generate instruction data for execution by the device and/or one or more accessory devices to achieve the determined intent. Based at least in part on determining that the utterance corresponds to a given intent, instruction data corresponding to the intent, and an indication of a subject on which to perform a desired action, the instruction data can be generated and sent to the device.

Generating the instruction data may be based at least in part on an indication that the application has been authorized to receive the instruction data. For example, when a third party application developer publishes an application for sale or consumption on an application store, the application store may query the developer to determine whether the developer wants to voice-enable the application. If the developer indicates that voice enablement is authorized, an indication of the application may be stored in a registry. Thereafter, when the data indicates that content of the application is being displayed on the device, audio data corresponding to the voice command may be processed to voice-enable the application.

At block 810, process 800 may include sending the instruction data to the device. An instruction processor of the device may receive the instruction data and may determine an action to perform based at least in part on the instruction data. The instruction processor may send data to the device event controller indicating the selected action to be performed and information about the object on which the action is to be performed. The device event controller may then determine which components of the device are to be utilized to perform the actions determined by the instruction processor. The device event controller may also be configured to identify and/or determine when an event corresponding to the displayed content changing and/or being updated occurs. Examples of such events may include launching an application, user interaction with content that causes the content to be updated, refreshing of the content, and/or time-dependent changes in the displayed content.

The process 800 may additionally or alternatively include generating an identifier corresponding to at least one of the objects associated with the application and sending the identifier to the device to be displayed. The process 800 may also include determining that the intent corresponds to a selection of an identifier. The generation of the instruction and/or the determination of the intent associated with the voice command may be based at least in part on the selection of the identifier. For example, the user device, the accessory device, and/or the remote system may be configured to provide one or more "prompts" to help the user more accurately provide voice commands and/or determine intent from voice commands. For example, with screen data indicating objects displayed on the screen, overlay content may be generated that provides, for example, numbers and/or letters associated with the displayed objects. Upon viewing the overlay content, the user may provide a voice command instructing the system to perform an action on the selected number and/or letter. For example, the overlay content may include a presentation of one or more numbers. As used in this example, the numbers may be displayed as overlay content on one or more objects displayed on the user interface.

FIG. 9 shows a flow diagram of an example process 900 for ordering instructions. The order in which operations or steps are described is not intended to be construed as a limitation, and any number of the described operations can be combined in any order and/or in parallel to implement process 900.

At block 902, the process 900 may include receiving audio data representing a user utterance. The audio data may correspond to audio captured via one or more microphones from an environment in which the device is disposed. For example, the audio may include user utterances from users in the environment. Audio corresponding to the utterance may be captured by one or more microphones of the user device and/or the accessory device, and corresponding audio data may be generated.

At block 904, the process 900 may include determining intent data based at least in part on the audio data. For example, Automatic Speech Recognition (ASR) may be performed on the audio data to generate corresponding text data. Natural Language Understanding (NLU) techniques may be performed on the text data to determine an intent related to the user utterance. ASR and NLU techniques are described in more detail below with reference to fig. 12. As part of determining intent data associated with the utterance, named entity recognition may be performed in conjunction with natural language understanding to identify portions of the text data that correspond to named entities recognizable by the remote system. The process may link the text portion to a particular entity known to the remote system.

To perform named entity recognition, screen data indicating objects displayed on the device may be utilized. The screen data may be used for entity recognition, for example, by matching results of the ASR operation to different entities associated with the application (e.g., objects displayed on the user device). In this way, the data source database may be populated with some or all of the screen data provided by the user device to assist in named entity identification. As such, the NLU component of the system may be trained or otherwise configured to select an intent based on screen data corresponding to content currently being displayed on the user device. Additionally, the NLU component can determine values of one or more slots associated with the intent based on the screen data.

At block 906, the process 900 may include identifying first instruction data corresponding to the intent data. The first instruction data may be configured to, when transmitted to a device, cause the device to perform an operation on the portion of the content. In an example, with the aid of an entity recognition operation, intent data determined by the NLU component can be sent to a verbal applet configured to generate instruction data for performing an action with respect to a third-party application. The verbal applet can generate instruction data for execution by the device based at least in part on the intent data determined by the remote system. The screen data may be used to generate instruction data for execution by the device and/or one or more accessory devices to achieve the determined intent.

At block 908, the process 900 may include identifying second instruction data corresponding to the intent data. The second instruction data may be configured to, when sent to the device, cause the device to perform an operation or another operation with respect to the portion of the content. The second instruction data may be identified in a manner similar to how the first instruction data is identified for block 906. For example, the user utterance may represent an intent that may be determined to correspond to more than one instruction. In these examples, the instruction data may be ranked such that ambiguous utterances may result in the highest ranked instruction data being sent to the user device.

At block 910, the process 900 may include determining a first priority associated with the first instruction data based on a first content type associated with the portion of the content. The first content type may include at least one of text content, image content, and/or text input content. The first priority may be determined proportionally, e.g., 1 to 10. It is to be understood that the example proportions provided herein are illustrative and not limiting. No proportions may be used, or any alternative proportions may be used. Additionally, in some examples, 10 may be the highest priority, while 1 may be the lowest priority. Alternatively, 1 may be the highest priority and 10 may be the lowest priority.

At block 912, the process 900 may include determining a second priority associated with the second instruction data based on a second content type associated with the portion of the content. The second content type may include the same or similar content types as described above for block 910. For example, the image content type may be prioritized over the text content type and the text input content type. Other priorities outside this particular example are included in the present disclosure.

At block 914, the process 900 may include determining that the first instruction data is prioritized over the second instruction data based on the first content type being prioritized over the second content type. Additionally or alternatively, prioritizing the instruction data may be based at least in part on historical usage data, applications associated with the displayed content, locations of objects displayed on the user device relative to each other, classifications of intent, previous user utterances, and/or screen data updates.

For example, the historical usage data may indicate that a given utterance, while corresponding to a plurality of instructions, historically corresponds to the first instruction data more frequently than the second instruction data for utterances received via the user device. Additionally or alternatively, the data indicates that the given utterance, while corresponding to the plurality of instructions, historically corresponds to the first instruction data more frequently than the second instruction data for utterances received via the user device and/or other devices. The application may also indicate which instructions take precedence over others. Additionally or alternatively, data indicating the position of objects corresponding to content displayed on the user device relative to each other may be used to order the instruction data. For example, instructions that perform an action on a more highlighted object may take precedence over instructions that perform an action on a less highlighted object. Additionally or alternatively, certain intents may not be dependent on the particular content displayed on the user device and thus may be associated with predetermined instructional data. For example, a user utterance of "scroll down" may correspond to an intent to display content on the user device that is not currently in view, and may not correspond to an intent to perform an action with respect to an object displayed on the user device. Instruction data for performing an action based on an intent that is not dependent on content, such as this, may be prioritized over instruction data for performing an action that is dependent on content.

Additionally or alternatively, data indicative of previous utterances may be used to rank the instruction data. For example, the previous utterance may be "scroll down," while the subsequent utterance may be "more. Without context data indicating a previous utterance, a "more" utterance may correspond to instruction data that performs actions such as displaying more video, providing more information about a certain video, playing more video, and so forth. However, with the previous utterance of "scroll down", the instruction data may be ordered such that the instruction to perform the additional scroll down action takes precedence over other instructions. Additionally or alternatively, data indicating that the screen data has changed or has otherwise been updated may be used to sort the instruction data. Additionally or alternatively, a predetermined prioritization of instruction data may be stored and utilized by the remote system.

For example, instruction data to perform an action on an object associated with an application may be ordered based at least in part on the type of content acted upon. For example, content associated with both images and text may be prioritized over content having only text, only images, selectable text, and/or editable text. For example, a user utterance for "play video" may be associated with instruction data that performs an action on various objects, such as an image representing the video with a play icon overlaid thereon, reading text for "play", playing an icon, and/or an editable field (e.g., a search field into which the phrase "play video" may be inserted). In this example, the instructional data associated with the image and the overlaid play icon may override other instructional data. Likewise, playing the icon may be preferred over reading the text of "play". Likewise, reading the text of "play" may take precedence over editable fields. The ordering of the instruction data can be based at least in part on the intent determined by the NLU component. For example, the determined "play" intent may correspond to the ordering described above. Additionally or alternatively, the determined "search" intent may correspond to an ordering of instruction data to perform an action on an object associated with the editable field over instruction data to perform an action on an object associated with the selection of the object. Additionally or alternatively, the determined "selection" intent may correspond to an ordering of instruction data that would cause an action to be performed on the object whose selection caused the content to be updated over instructions that perform actions on other objects (e.g., inserting text into a search field). It should be understood that examples of ordering of instruction data are provided herein for illustration, and that other examples of ordering instruction data are included in the present disclosure.

At block 916, the process 900 may include selecting the first instruction data based at least in part on the first instruction data being prioritized over the second instruction data. At block 918, the process 900 may include sending first instruction data to the device to cause an action to be performed on the portion of the content.

FIG. 10 shows a flow diagram of an example process 1000 for ordering instructions. The order in which operations or steps are described is not intended to be construed as a limitation, and any number of the described operations can be combined in any order and/or in parallel to implement process 1000.

At block 1002, process 1000 may include receiving audio data representing a user utterance. The audio data may correspond to audio captured via one or more microphones from an environment in which the device is disposed. For example, the audio may include user utterances from users in the environment. Audio corresponding to the utterance may be captured by one or more microphones of the user device and/or the accessory device, and corresponding audio data may be generated.

At block 1004, the process 1000 may include determining intent data associated with the user utterance based at least in part on the audio data. Automatic Speech Recognition (ASR) may be performed on the audio data to generate corresponding text data. Natural Language Understanding (NLU) techniques may be performed on the text data to determine intent data related to the user utterance. ASR and NLU techniques are described in more detail below with reference to fig. 12. As part of determining intent data associated with the user utterance, named entity recognition may be performed in conjunction with natural language understanding to identify portions of the text data that correspond to named entities recognizable by the remote system. The process may link the text portion to a particular entity known to the remote system.

To perform named entity recognition, screen data indicating objects corresponding to content displayed on the device may be utilized. The screen data may be used for entity recognition, for example, by matching results of the ASR operation to different entities associated with the application (e.g., objects displayed on the user device). In this way, the data source database may be populated with some or all of the screen data provided by the user device to assist in named entity identification. As such, the NLU component of the system may be trained or otherwise configured to select an intent based on screen data corresponding to content currently being displayed on the user device. Additionally, the NLU component can determine values of one or more slots associated with the intent based on the screen data.

At block 1006, the process 1000 may include identifying first instruction data corresponding to the intent data. The first instruction data may be configured to cause the device to perform an operation when transmitted to the device. The first instruction data may be configured to be sent to the device to perform an operation on an object associated with content displayed on the device. In an example, with the aid of an entity recognition operation, intent data determined by the NLU component can be sent to a verbal applet configured to generate instruction data for performing an action with respect to a third-party application. The verbal applet can generate instruction data for execution by the device based at least in part on the intent data determined by the remote system. The screen data may be used to generate instruction data for execution by the device and/or one or more accessory devices to achieve the determined intent.

At block 1008, the process 1000 may include identifying second instruction data corresponding to the intent data. The second instruction data may be configured to cause the device to perform the operation or another operation when transmitted to the device. The second instruction data may be identified in a manner similar to how the first instruction data is identified for block 1006. For example, the user utterance may represent an intent that may be determined to correspond to more than one instruction. In these examples, the instruction data may be ranked such that ambiguous user utterances may result in the highest ranked instruction data being sent to the user device.

At block 1010, the process 1000 may include determining that the first instruction data is prioritized over the second instruction data. Prioritizing the instruction data may be based at least in part on historical usage data, applications associated with the displayed content, locations of objects displayed on the user device relative to one another, classifications of intent, previous voice commands, and/or screen data updates.

For example, the historical usage data may indicate that a given voice command, while corresponding to a plurality of instructions, historically corresponds to the first instruction data more frequently than the second instruction data for voice commands received via the user device. Additionally or alternatively, the data indicates that the given utterance, while corresponding to the plurality of instructions, historically corresponds to the first instruction data more frequently than the second instruction data for utterances received via the user device and/or other devices. The application may also indicate which instruction data takes precedence over other instruction data. Additionally or alternatively, data indicating the position of objects displayed on the user device relative to each other may be used to order the instruction data. For example, instruction data that performs an action on a more highlighted object may be prioritized over instruction data that performs an action on a less highlighted object. Additionally or alternatively, certain intents may not be dependent on a particular object displayed on the user device and thus may be associated with predetermined instructional data. For example, a user utterance of "scroll down" may correspond to an intent to display content on the user device that is not currently in view, and may not correspond to an intent to perform an action with respect to an object displayed on the user device. Instruction data that performs an action based on an intent that is not dependent on an object, such as this, may be prioritized over instruction data that performs an action dependent on an object. As a further example, process 1000 may include determining a second intent associated with a second user utterance. It may be determined that at least one of the first instruction data or the second instruction data corresponds to the second intent, and the third instruction data is identified as corresponding to the second intent. In this example, the third instruction data may be independent of the object and/or content, such that the action associated with the instruction data does not require the value of the object to render the instruction operational. The third instruction data may be selected based at least in part on the object-independent third instruction data. The third instruction data may then be sent to the device.

Additionally or alternatively, data indicative of previous utterances may be used to rank the instruction data. For example, the previous utterance may be "scroll down," while the subsequent utterance may be "more. Without context data indicating a previous utterance, a "more" utterance may correspond to instruction data that performs actions such as displaying more video, providing more information about a certain video, playing more video, and so forth. However, with the previous utterance of "scroll down", the instruction data may be ordered such that the instruction data performing the additional scroll down action takes precedence over the other instruction data. Additionally or alternatively, data indicating that the screen data has changed or has otherwise been updated may be used to sort the instruction data. Additionally or alternatively, a predetermined prioritization of instruction data may be stored and utilized by the remote system.

For example, instruction data to perform an action on an object associated with an application may be ordered based at least in part on the type of object acted upon. In these examples, process 1000 may include determining that the first instruction data is associated with a value associated with the intent, where the value may indicate that the first object on which the operation is performed is associated with the first object type. Process 1000 may also include determining that the value is associated with a second object of a second object type. Ordering instructions may be based at least in part on the ordering such that instructions associated with one type of object take precedence over instructions associated with another type of object. For example, objects associated with both images and text may be prioritized over objects having only text, only images, selectable text, and/or editable text. For example, a voice command to "play video" may be associated with instructions to perform actions on various objects, such as representing an image of a video with a play icon overlaid thereon, reading text of "play", playing an icon, and/or an editable field (e.g., a search field into which the phrase "play video" may be inserted). In this example, instructions associated with the image and overlaid play icon may override other instructions. Likewise, playing the icon may be preferred over reading the text of "play". Likewise, reading the text of "play" may take precedence over editable fields.

Prioritizing the instruction data may be based at least in part on intent data determined by the NLU component. For example, the determined "play" intent may correspond to the ordering described above. Additionally or alternatively, the determined "search" intent may correspond to an ordering of instruction data to perform an action on an object associated with the editable field over instruction data to perform an action on an object associated with the selection of the object. Additionally or alternatively, the determined "selection" intent may correspond to an ordering of instruction data that would cause an action to be performed on the object whose selection caused the content to be updated over instructions that perform actions on other objects (e.g., inserting text into a search field). It should be understood that examples of ordering of instruction data are provided herein for illustration, and that other examples of ordering instruction data are included in the present disclosure.

At block 1012, process 1000 may include selecting the first instruction data based at least in part on the first instruction data being prioritized over the second instruction data. At block 1014, the process 1000 may include sending the first instruction data to the device to cause an action to be performed on the object.

Process 1000 may additionally include receiving an indication that content displayed by the device has been updated and determining a second ordering of the first instruction data and the second instruction data. Process 1000 may also include selecting second instruction data based at least in part on the second ordering and sending the second instruction data to the device. In these examples, the updated content displayed on the device may include different objects, may be associated with different actions to be performed on the objects, and/or may be associated with contextual information indicating that certain instructional data is sent to the device more prominently than other instructional data.

FIG. 11 shows a flow diagram of an example process 1100 for ordering instructions. The order in which operations or steps are described is not intended to be construed as a limitation, and any number of the described operations can be combined in any order and/or in parallel to implement process 1100.

At block 1102, the process 1100 may include determining intent data associated with the user utterance based at least in part on audio data representing the user utterance. The audio data may correspond to audio captured via one or more microphones from an environment in which the device is disposed. For example, the audio may include user utterances from users in the environment. Audio corresponding to the user utterance may be captured by one or more microphones of the user device and/or the accessory device, and corresponding audio data may be generated. Automatic Speech Recognition (ASR) may be performed on the audio data to generate corresponding text data. Natural Language Understanding (NLU) techniques may be performed on the text data to determine intent data related to the user utterance. ASR and NLU techniques are described in more detail below with reference to fig. 12. As part of determining intent data associated with the user utterance, named entity recognition may be performed in conjunction with natural language understanding to identify portions of the text data that correspond to named entities recognizable by the remote system. The process may link the text portion to a particular entity known to the remote system.

To perform named entity recognition, screen data indicating objects displayed on the device may be utilized. The screen data may be used for entity recognition, for example, by matching results of the ASR operation to different entities associated with the application (e.g., objects displayed on the user device). In this way, the data source database may be populated with some or all of the screen data provided by the user device to assist in named entity identification. As such, the NLU component of the system may be trained or otherwise configured to select an intent based on screen data corresponding to content currently being displayed on the user device. Additionally, the NLU component can determine values for one or more slots associated with the intent based on the screen data.

At block 1104, the process 1100 may include identifying first instruction data corresponding to the intent data. The first instruction data may be configured to be sent to a device to perform an operation. The first instruction data may be configured to be sent to the device to perform an operation on an object associated with content displayed on the device. In an example, with the aid of an entity recognition operation, intent data determined by the NLU component can be sent to a verbal applet configured to generate instruction data for performing an action with respect to a third-party application. The verbal applet can generate instruction data for execution by the device based at least in part on the intent data determined by the remote system. The screen data may be used to generate instruction data for execution by the device and/or one or more accessory devices to achieve the determined intent.

At block 1106, the process 1100 may include identifying second instruction data corresponding to the intent data. The second instruction data may be configured to be sent to the device to perform the operation or another operation. The second instruction data may be identified in a manner similar to how the first instruction data is identified for block 1104. For example, the user utterance may represent an intent that may be determined to correspond to more than one instruction. In these examples, the instruction data may be ranked such that ambiguous utterances may result in the highest ranked instruction data being sent to the user device.

At block 1108, process 1100 may include determining that the first instruction data is prioritized over the second instruction data. Prioritizing the instruction data may be based at least in part on historical usage data, applications associated with the displayed content, locations of objects displayed on the user device relative to one another, classifications of intent, previous voice commands, and/or screen data updates.

For example, the historical usage data may indicate that a given user utterance, while corresponding to a plurality of instructions, historically corresponds to the first instruction data more frequently than the second instruction data for utterances received via the user device. Additionally or alternatively, the data indicates that the given utterance, while corresponding to the plurality of instructions, historically corresponds to the first instruction data more frequently than the second instruction data for utterances received via the other device. The application may also indicate which instruction data takes precedence over other instruction data. Additionally or alternatively, data indicating the position of objects displayed on the user device relative to each other may be used to order the instruction data. For example, instruction data that performs an action on a more highlighted object may be prioritized over instruction data that performs an action on a less highlighted object. Additionally or alternatively, certain intents may not be dependent on a particular object displayed on the user device and thus may be associated with predetermined instructional data. For example, a voice command to "scroll down" may correspond to an intent to display content on the user device that is not currently in view, and may not correspond to an intent to perform an action with respect to an object displayed on the user device. Instruction data that performs an action based on an intent that is not dependent on an object, such as this, may be prioritized over instruction data that performs an action dependent on an object. As a further example, process 1100 may include determining a second intent associated with a second user utterance. It may be determined that at least one of the first instruction data or the second instruction data corresponds to the second intent, and the third instruction data is identified as corresponding to the second intent. In this example, the third instruction data may be independent of the object, such that the action associated with the instruction data does not require the value of the object to make the instruction operational. The third instruction data may be selected based at least in part on the object-independent third instruction data. The third instruction data may then be sent to the device.

Additionally or alternatively, data indicative of previous utterances may be used to prioritize the instruction data. For example, the previous utterance may be "scroll down," while the subsequent utterance may be "more. Without context data indicating a previous utterance, a "more" utterance may correspond to instruction data that performs actions such as displaying more video, providing more information about a certain video, playing more video, and so forth. However, with the previous utterance of "scroll down", the instruction data may be ordered such that the instruction data performing the additional scroll down action takes precedence over the other instruction data. Additionally or alternatively, data indicating that the screen data has changed or has otherwise been updated may be used to sort the instruction data. Additionally or alternatively, a predetermined prioritization of instruction data may be stored and utilized by the remote system.

For example, instruction data to perform an action on an object associated with an application may be prioritized based at least in part on the type of object acted upon. In these examples, process 1100 may include determining that the first instruction data is associated with a value associated with the intent, where the value may indicate that the first object/a portion of content corresponding to an object on which the operation is performed is associated with the first object type. The process 1100 may also include determining that the value is associated with a second object of a second object type. The ordering of the instruction data may be based at least in part on the ordering such that instruction data associated with one type of object takes precedence over instruction data associated with another type of object. For example, objects associated with both images and text may be prioritized over objects having only text, only images, selectable text, and/or editable text. For example, a user utterance for "play video" may be associated with instruction data that performs an action on various objects, such as an image representing the video with a play icon overlaid thereon, reading text for "play", playing an icon, and/or an editable field (e.g., a search field into which the phrase "play video" may be inserted). In this example, the instructional data associated with the image and the overlaid play icon may override other instructional data. Likewise, playing the icon may be preferred over reading the text of "play". Likewise, reading the text of "play" may take precedence over editable fields.

Prioritizing the instruction data may be based at least in part on intent data determined by the NLU component. For example, the determined "play" intent may correspond to the ordering described above. Additionally or alternatively, the determined "search" intent may correspond to an ordering of instructions that prioritize execution of actions on objects associated with the editable fields over execution of actions on objects associated with selections of the objects. Additionally or alternatively, the determined "selection" intent may correspond to an ordering of instruction data that would cause an action to be performed on the object whose selection caused the content to be updated over instructions that perform actions on other objects (e.g., inserting text into a search field). It should be understood that examples of ordering of instruction data are provided herein for illustration, and that other examples of ordering instruction data are included in the present disclosure.

At block 1110, process 1100 may include sending the first instruction data to the device based at least in part on determining that the first instruction data is prioritized over the second instruction data.

Process 1100 may additionally include receiving an indication that content displayed by the device has been updated and determining a second ordering of the first instruction data and the second instruction data. Process 1100 may also include selecting second instruction data based at least in part on the second ordering and sending the second instruction data to the device. In these examples, the updated content displayed on the device may include different objects, may be associated with different actions to be performed on the objects, and/or may be associated with contextual information indicating that certain instructions are sent to the device more prominently than other instructions.

Fig. 12 illustrates a conceptual diagram of how a spoken utterance is processed that allows the system to capture and execute a command spoken by a user, such as a spoken command that may follow a wake word or trigger utterance (i.e., a predefined word or phrase for "waking up" a device, causing the device to begin transmitting audio data to a remote system, such as system 108). The various components shown may be located on the same or different physical devices. Communication between the various components shown in fig. 12 may occur directly or through network 110. An audio capture component (e.g., the microphone 118 of the user device 102 or another device) captures audio 1200 corresponding to the spoken utterance. The

device

102 or 104 then processes the audio data corresponding to the audio 1200 using the wake word detection module 1201 to determine whether a keyword (e.g., wake word) is detected in the audio data. After detecting the wake word, the

device

102 or 104 sends audio data 1202 corresponding to the utterance to the remote system 108 that includes the ASR module 1203. The audio data 1202 may be output from an optional Acoustic Front End (AFE)1256 located on the device prior to transmission. In other examples, the audio data 1202 may take a different form for processing by the remote AFE1256 (e.g., the AFE1256 located in the remote system 108 with the ASR module 1203).

The wake word detection module 1201 works in conjunction with other components of the user device, such as a microphone, to detect keywords in the audio 1200. For example, the device may convert audio 1200 to audio data and process the audio data using wake word detection module 1201 to determine whether human sound is detected and, if so, whether the audio data including human sound matches an audio signature and/or model corresponding to a particular keyword.

The user device may use various techniques to determine whether the audio data includes human sound. Some embodiments may apply Voice Activity Detection (VAD) techniques. Such techniques may determine whether human sound is present in the audio input based on various quantitative aspects of the audio input, e.g., spectral slopes between one or more frames of the audio input; an energy level of the audio input in one or more spectral bands; a signal-to-noise ratio of the audio input in one or more spectral bands; or other quantitative aspect. In other embodiments, the user device may implement a limited classifier configured to distinguish human sounds from background noise. The classifier may be implemented by techniques such as linear classifiers, support vector machines, and decision trees. In still other embodiments, Hidden Markov Models (HMMs) or Gaussian Mixture Models (GMMs) techniques may be applied to compare the audio input to one or more acoustic models in the human voice store, which may include models corresponding to human voice, noise (e.g., ambient or background noise), or silence. Other techniques may also be used to determine whether human sound is present in the audio input.

Once human sound is detected in the audio received by the user device (or separate from human sound detection), the user device may perform wake word detection using wake word detection module 1201 to determine when the user intends to speak a command to the user device. This process may also be referred to as keyword detection, where a wake word is a particular example of a keyword. In particular, keyword detection may be performed without performing linguistic, textual, or semantic analysis. Instead, the input audio (or audio data) is analyzed to determine whether a particular characteristic of the audio matches a preconfigured acoustic waveform, audio signature, or other data to determine whether the input audio "matches" the stored audio data corresponding to the keyword.

Accordingly, the wake word detection module 1201 may compare the audio data to a stored model or data to detect the wake words. One approach for wake word detection employs a generic Large Vocabulary Continuous Speech Recognition (LVCSR) system to decode audio signals, with wake word searches being conducted in the resulting lattice or confusion network. The LVCSR decoding may require relatively high computational resources. Another method for wake word localization builds a Hidden Markov Model (HMM) for each key wake word and non-wake word speech signal, respectively. Non-wake word speech includes other spoken words, background noise, etc. One or more HMMs may be constructed to model non-arousal word speech characteristics, which are referred to as a filling model. Viterbi decoding is used to search for the best path in the decoded picture and the decoded output is further processed to make a decision on the existence of a keyword. By incorporating a hybrid DNN-HMM decoding framework, the method can be extended to include discriminative information. In another embodiment, the wake word localization system may be built directly on a Deep Neural Network (DNN)/Recurrent Neural Network (RNN) structure without involving HMMs. Such systems may estimate the posterior of the wake word with context information by stacking frames within a context window of the DNN or using the RNN. A subsequent a posteriori threshold adjustment or smoothing is applied to the decision. Other techniques such as those known in the art for wake word detection may also be used.

Upon detecting the wake word, the local device 102 and/or 104 can "wake up" and begin transmitting audio data 1202 corresponding to the input audio 1200 to the remote system 108 for speech processing. Audio data corresponding to the audio can be sent to the remote system 108 for routing to the recipient device, or can be sent to the remote system 108 for speech processing to interpret the contained speech (for purposes of enabling voice communication and/or for purposes of executing commands in the speech). The audio data 1202 may include data corresponding to the wake word, or a portion of the audio data corresponding to the wake word may be removed by the local device 102 and/or 104 prior to transmission. Further, as described herein, the local device may "wake up" when speech/spoken audio is detected above a threshold. Once received by the remote system 108, the ASR module 1203 may convert the audio data 1202 into text. The ASR transcribes the audio data into text data representing the speech words contained in the audio data 1202. This text data may then be used by other components for various purposes, such as executing system commands, entering data, and so forth. The spoken utterance in the audio data is input to a processor configured to perform ASR, which then interprets the utterance based on similarities between the utterance and pre-established language models 1254 stored in an ASR model knowledge base (ASR model memory 1252). For example, the ASR process may compare the input audio data to models of sounds (e.g., subword units or phonemes) and sound sequences to identify words that match the sound sequences spoken in the utterance of the audio data.

The different ways in which a spoken utterance (i.e., different hypotheses) may be interpreted may each be assigned a probability or confidence score that represents the likelihood that a particular word set matches those spoken in the utterance. The confidence score may be based on a number of factors, including, for example, the similarity of the sounds in the utterance to models of language sounds (e.g., the acoustic models 1253 stored in the ASR model memory 1252) and the likelihood that a particular word that matches a sound will be included at a particular location in the sentence (e.g., using language or grammar models). Thus, each potential text interpretation (hypothesis) of the spoken utterance is associated with a confidence score. Based on the factors considered and the assigned confidence scores, the ASR process 1203 outputs the most likely text identified in the audio data. The ASR process may also output a plurality of hypotheses in the form of a lattice or N-best list, where each hypothesis corresponds to a confidence score or other score (e.g., a probability score, etc.).

One or more devices that perform ASR processing may include an Acoustic Front End (AFE)1256 and a speech recognition engine 1258. An Acoustic Front End (AFE)1256 converts audio data from the microphone to data that is processed by a speech recognition engine 1258. The speech recognition engine 1258 compares the speech recognition data to the acoustic models 1253, the language models 1254, and other data models and information used to recognize speech conveyed in the audio data. The AFE1256 may reduce noise in the audio data and divide the digitized audio data into frames representing time intervals for which the AFE1256 determines a plurality of values (referred to as features) representing the quality of the audio data and a set of values (referred to as feature vectors) representing the features/quality of the audio data within the frame. As is known in the art, many different features may be determined, and each feature represents some quality of audio that may be used for ASR processing. The AFE may process the audio data using a variety of methods, such as mel-frequency cepstral coefficients (MFCCs), Perceptual Linear Prediction (PLP) techniques, neural network feature vector techniques, linear discriminant analysis, semi-constrained covariance matrices, or other methods known to those skilled in the art.

The speech recognition engine 1258 may process the output from the AFE1256 with reference to information stored in a speech/model memory (1252). Alternatively, post-front end processed data (e.g., feature vectors) may be received by the device performing ASR processing from another source outside of the internal AFE. For example, the user device may process the audio data into feature vectors (e.g., using the AFE1256 on the device) and transmit this information over the network to the server for ASR processing. The feature vectors may be encoded to the remote system 108, in which case the feature vectors may be decoded before the processor executes the speech recognition engine 1258 for processing.

The speech recognition engine 1258 attempts to match the received feature vectors to linguistic phonemes and words known in the stored acoustic models 1253 and language models 1254. The speech recognition engine 1258 calculates a recognition score for the feature vector based on the acoustic information and the language information. The acoustic information is used to calculate an acoustic score that represents the likelihood that an expected sound represented by a set of feature vectors matches a linguistic phoneme. The linguistic information is used to adjust the acoustic scores by considering what sounds and/or words are used in context with each other, thereby increasing the likelihood that the ASR process will output grammatically meaningful verbal results. The particular model used may be a generic model or may be a model corresponding to a particular domain (e.g., music, bank, etc.).

The speech recognition engine 1258 may use a variety of techniques to match the feature vectors to the phonemes, for example, using a Hidden Markov Model (HMM) to determine a probability that a feature vector may be matched to a phoneme. The received sound may be represented as a path between HMM states, and multiple paths may represent multiple possible text matches for the same sound.

After ASR processing, the speech recognition engine 1258 may send the ASR results to other processing components, which may be local to the device performing the ASR, and/or distributed throughout the network. For example, the ASR results in the form of a single text representation of the speech, an N-best list including multiple hypotheses and respective scores, a lattice, etc. may be sent to the remote system 108 for Natural Language Understanding (NLU) processing, e.g., converting the text into commands that are executed by the user device, the remote system 108, or another device (e.g., a server running a particular application such as a search engine).

The device (e.g., server 108) performing NLU processing 1205 may include various components, including potentially a dedicated processor, memory, storage, and the like. As shown in fig. 12, the NLU component 1205 can include a recognizer 1263 that includes a Named Entity Recognition (NER) module 1262 for identifying a portion of the query text that corresponds to a named entity that is recognizable by the system. A downstream process called named entity parsing links the text portion to a specific entity known to the system. To perform named entity resolution, the system may utilize place name index information (1284a-1284n) stored in an entity repository memory 1282. The location name index information may be used for entity resolution, e.g., matching the ASR results with different entities (e.g., song names, contact names). The location name index may be linked to the user (e.g., a particular location name index may be associated with a particular user's music collection), may be linked to certain domains (e.g., shopping), or may be organized in a variety of other ways.

Typically, the NLU process takes text input (e.g., text input processed by the ASR1203 based on the utterance input audio 1200) and attempts to semantically interpret the text. That is, the NLU process determines the meaning behind the text based on the individual words and then implements that meaning. The NLU process 1205 interprets the text string to derive an intent or desired action from the user and the relevant piece of information in the text that allows the device (e.g., device 102 and/or 104) to complete the action. For example, if the spoken utterance was processed using ASR1203 and outputs the text "play Jeopardy," the NLU process may determine that the user intends for the device to launch a Jeopardy game.

The NLU can process multiple text inputs related to the same utterance. For example, if the ASR1203 outputs N text segments (as part of N best lists), the NLU may process all N outputs to obtain an NLU result.

As will be discussed further below, the NLU process may be configured to parse and mark to annotate text as part of the NLU process. For example, for the text "play You're welome," play "may be labeled as a command (to access a song and output corresponding audio), while You're welome" may be labeled as a particular video to play.

To correctly perform NLU processing of speech input, NLU processing 1205 can be configured to determine the "domain" of the utterance in order to determine and narrow down which services provided by the endpoint device (e.g., remote system 108 or user device) may be relevant. For example, an endpoint device may provide services that involve interaction with a telephone service, a contact list service, a calendar/scheduling service, a music player service, and so forth. Words in a single text query may relate to multiple services, while some services may be functionally linked (e.g., both a telephone service and a calendar service may use data in a contact list).

A Named Entity Recognition (NER) module 1262 receives the query in the form of an ASR result and attempts to identify relevant grammatical and lexical information that can be used to understand meaning. To this end, the NLU module 1205 may begin by identifying potential domains that may be related to the received query. The NLU memory 1273 includes a device database (1274a-1274n) that identifies domains associated with particular devices. For example, user devices may be associated with domains for music, telephony, calendars, contact lists, and device-specific communications, rather than video. In addition, the entity library may include database entries for particular services on particular devices, indexed by device ID, user ID, or home ID, or some other indicator.

In NLU processing, a domain may represent a discrete set of activities that have a common theme, e.g., "shopping," "music," "calendar," etc. Thus, each domain may be associated with a particular recognizer 1263, a language model and/or grammar database (1276a-1276n), a particular set of intents/actions (1278a-1278n), and a particular personalized dictionary (1286). Each place name index (1284a-1284n) may include domain index lexical information associated with a particular user and/or device. For example, the place name index a (1284a) includes domain index vocabulary information 1286aa to 1286 an. The user's contact list vocabulary information may include the names of the contacts. This personalized information may improve entity resolution since the contact list for each user may be different.

As described above, in conventional NLU processing, the query can be processed applying rules, models, and information applicable to each identified domain. For example, if the query potentially involves communication and, for example, music, the query may be processed substantially in parallel using the grammar model and lexical information for the communication and will be processed using the grammar model and lexical information for the music. The responses to the queries generated based on each set of models are scored, and the overall highest ranked result from all application domains is typically selected as the correct result.

An Intent Classification (IC) module 1264 parses the query to determine one or more intents for each identification domain, where the intents correspond to actions to be performed in response to the query. Each domain is associated with a database (1278a-1278n) of words linked to intents. For example, the music intent database may link words and phrases such as "quiet," "volume reduced," and "silence" to the "silence" intent. Meanwhile, the voice message intention database may link words and phrases such as "send message", "send voice message", "send following", and the like. IC module 1264 identifies the potential intent of each identification domain by comparing the words in the query to the words and phrases in intent database 1278. In some cases, the determination of intent by the IC module 1264 is performed using a set of rules or templates that are processed against the incoming text to identify matching intents.

To generate a specific interpreted response, NER 1262 applies the grammar model and lexical information associated with the respective domain to actually identify references to one or more entities in the query text. In this manner, NER 1262 identifies "slots" or values (i.e., specific words in the query text) that may be needed for later command processing. Depending on the complexity of NER 1262, it may also mark each slot with a different level of specificity (e.g., noun, place, city, artist name, song title, etc.). Each grammar model 1276 includes names of entities (i.e., nouns) typically found in speech for a particular domain (i.e., generic term), while the lexical information 1286 from the place name index 1284 is personalized for the user and/or device. For example, a grammar model associated with a shopping domain may include a database of words that are commonly used when people discuss shopping.

The intents identified by the IC module 1264 are linked to a domain-specific grammar framework (included in 1276) in which "slots" or "fields" are populated with values. Each slot/field corresponds to a portion of the query text that the system considers to correspond to an entity. To make parsing more flexible, these frames are not typically constructed as sentences, but are based on associating slots with grammar tags. For example, if "play song" is the identified intent, one or more grammar (1276) frames may correspond to sentence structures such as "play song { song name }" and/or "play song name }".

For example, the NER module 1262 may parse the query to recognize words as subjects, objects, verbs, prepositions, etc. based on grammatical rules and/or models prior to identifying the named entity. The identified verbs may be used by the IC module 1264 to recognize intent, which is then used by the NER module 1262 to identify a framework. Also, the framework for the "PlaySong" intent may specify a list of slots/fields that are appropriate for playing the identified "Song" as well as any object modifiers (e.g., specifying a music collection from which the song should be accessed), and so forth. The NER module 1262 then searches the corresponding fields in the domain-specific and personalized lexicon in an attempt to match words and phrases in the query tagged as grammar objects or object modifiers to those identified in the database.

This process includes semantic tagging, i.e., tagging of words or word combinations according to their type/semantic meaning. The parsing may be performed using heuristic grammar rules, or the NER model may be constructed using techniques such as hidden markov models, maximum entropy models, log linear models, Conditional Random Fields (CRF), and the like.

The frames linked with the intent are then used to determine which database fields should be searched to determine the meaning of the phrases, e.g., the similarity of the search user's communiques to the frame slots. If the search of the place name index does not use the place name index information to parse the slot/field, the NER module 1262 may search a database of common words (in the knowledge base 1272) associated with the domain. Thus, for example, if the query is "play You're welome," the NER component 1262 may search the domain vocabulary for the phrase "You're welome after failing to determine which song" You're welome "should be played. "in the alternative, the generic word may be examined before the place name indexing information, or both may be tried, potentially producing two different results.

The output data from the NLU processing (which may include markup text, commands, etc.) may then be sent to a command processor 1207. The destination command processor 1207 may be determined based on the NLU output. For example, if the NLU output includes a command to send a message, the destination command processor 1207 may be a messaging application (e.g., an application located on or in a user device) configured to execute the messaging command. If the NLU output includes a search request, the destination command processor 1207 may include a search engine processor configured to execute the search command (e.g., a search engine processor located on a search server). After generating the appropriate commands based on the user's intent, the command processor 1207 may provide some or all of this information to a text-to-speech (TTS) engine 1208. The TTS engine 1208 may then generate the actual audio file for outputting the audio data determined by the command processor 1207 (e.g., "play your song" or "lip sync to …"). After generating the file (or "audio data"), the TTS engine 1207 may provide the data back to the remote system 108.

The NLU operation of existing systems may take the form of a multi-domain architecture. Each domain (which may include a set of intent and entity slots defining larger concepts such as music, books, etc., and components such as training models for performing various NLU operations such as NER, IC, etc.) can be constructed separately and made available to the NLU component 1205 during runtime operations, where the NLU operations are performed on text (e.g., text output from the ASR component 1203). Each domain may have specially configured components to perform the various steps of the NLU operation.

For example, in an NLU system, the system may include a multi-domain architecture consisting of multiple domains (e.g., music, video, books, and information) for intents/commands that may be executed by the system (or by other devices connected to the system). The system may include multiple domain identifiers, where each domain may include its own identifier 1263. Each recognizer can include a variety of NLU components, such as NER components 1262, IC modules 1264, and other components (e.g., entity resolvers or other components).

For example, messaging domain recognizer 1263-A (domain A) can have NER component 1262-A that identifies which slots (i.e., portions of the input text) can correspond to particular words associated with the domain. The word may correspond to an entity such as a recipient (for a messaging domain). The NER component 1262 can employ a machine learning model, such as a domain specific Conditional Random Field (CRF), to identify portions corresponding to entities and to identify entity types corresponding to text portions. The messaging domain identifier 1263-A may also have its own Intent Classification (IC) component 1264-A that determines the intent of the text assuming that the text is within a prohibited domain. The IC component can identify an intent of the text using a model such as a domain-specific maximum entropy classifier, where the intent is an action that the user wishes the system to perform. To this end, the remote system computing device 108 may include a model training component. The model training component can be used to train the classifier/machine learning model discussed above.

As described above, multiple devices can be employed in a single speech processing system. In such a multi-device system, each device may include different components for performing different aspects of speech processing. The plurality of devices may include overlapping components. As shown herein, the components of the user device and the remote system 108 are exemplary and may be located in separate devices, or may be included in whole or in part as components of a larger device or system, may be distributed across multiple devices on or connected by a network, and the like.

Although the foregoing invention has been described with respect to specific examples, it should be understood that the scope of the invention is not limited to these specific examples. Since other modifications and changes varied to fit particular operating requirements and environments will be apparent to those skilled in the art, the invention is not considered limited to the examples selected for purposes of disclosure, and covers all changes and modifications that do not depart from the true spirit and scope of the invention.

Although the present application describes embodiments with specific structural features and/or methodological acts, it is to be understood that the claims are not necessarily limited to the specific features or acts described. Rather, the specific features and acts are merely illustrative of some embodiments that fall within the scope of the claims of the present application.

Embodiments of the present disclosure may be described according to the following clauses.

1. A system, comprising: one or more processors; and a computer-readable medium storing computer-executable instructions that, when executed by the one or more processors, cause the one or more processors to perform operations comprising: determining that content of an application is being displayed on a device; based on determining that the content is being displayed, causing an application interface component to identify metadata associated with the content; identifying, via the application interface and from the metadata, a portion of the content that, when selected by a user, causes updated content to be displayed; transmitting screen data identifying the portion of the content to a remote system; receiving audio data representing a user utterance; transmitting the audio data to the remote system; receiving instruction data from the remote system to perform an action for the portion of the content, the instruction data determined by the remote system from the screen data and the audio data; and cause the action to be performed.

2. The system of clause 1, wherein the content comprises first content, the portion comprises a first portion, the screen data comprises first screen data, and the operations further comprise: receiving, via the application interface, event data indicative of second content being displayed; determining a second portion of the second content based at least in part on the event data indicative of the second content being displayed, the second portion of the second content, when selected by a user, causing third content to be displayed; and transmitting second screen data to the remote system identifying the second portion of the second content, the second portion being different from the first portion.

3. The system of clause 1 or 2, wherein the instruction data indicates a value associated with the action, and causing performance of the action comprises causing performance of an action based at least in part on: determining a node of the content corresponding to the portion of the content according to document object model information indicating a node associated with the content; and causing the action to be performed for the node.

4. The system of clause 1, 2, or 3, wherein the instruction data indicates a value associated with the action, and causing performance of the action comprises causing performance of an action based at least in part on: determining a first node of the content corresponding to the value according to document object model information indicating a node associated with the content; determining a second node of the content corresponding to the value according to the document object model information; determining a first confidence level that the first node corresponds to the value; determining a second confidence level that the second node corresponds to the value; and causing the action to be performed for the first node based on the first confidence level overriding the second confidence level.

5. A method, comprising: determining that content of an application is being displayed on a device; based at least in part on determining that the content is being displayed, identifying metadata associated with the content; identifying a selectable portion of content based at least in part on the metadata associated with the content; transmitting screen data identifying the portion of the content to a remote system; transmitting audio data representing a user utterance to the remote system; receiving instruction data from the remote system to perform an action for the portion of the content, the instruction data determined by the remote system from the screen data and the audio data; and cause the action to be performed.

6. The method of clause 5, wherein receiving the instruction data comprises receiving the instruction data based at least in part on an indication that the application has been authorized to receive the instruction data.

7. The method of clause 5 or 6, further comprising: causing display of overlay content, the overlay content including an identifier associated with the portion of the content; sending overlay data to the remote system indicating that the overlay content is displayed; and wherein the instruction data includes an indicator to select the identifier.

8. The method of clause 5, 6 or 7, further comprising: receiving, from the remote system, a first identifier associated with the content displayed by the device; receiving, from the remote system, a second identifier associated with the content displayed by the device; determining that the first identifier and the second identifier correspond to the portion of the content; generating a modified identifier corresponding to the first identifier and the second identifier; and causing display of overlay content, the overlay content including the modified identifier displayed for the portion of the content.

9. The method of clause 5, 6, 7, or 8, wherein the instruction data indicates a value associated with the action, and causing performance of the action comprises causing performance of the action based at least in part on: determining a first node of the content corresponding to the value according to document object model information indicating a node associated with the content; determining a second node of the content corresponding to the value according to the document object model information; determining a first confidence level that the first node corresponds to the value; determining a second confidence level that the second node corresponds to the value; and causing the action to be performed for the first node based at least in part on the first confidence level overriding the second confidence level.

10. The method of clause 5, 6, 7, 8, 9 or 10, wherein the content comprises first content, the portion comprises a first portion, the screen data comprises first screen data, and further comprising: receiving event data indicative of second content being displayed; identifying a second portion of the second content based at least in part on the event data indicative of the second content being displayed, the second portion of the second content, when selected by a user, causing third content to be displayed; and transmitting second screen data to the remote system identifying the second portion of the second content, the second portion being different from the first portion.

11. The method of clause 5, 6, 7, 8, 9 or 10, wherein identifying the portion of the content comprises identifying the portion of the content based at least in part on document object model data associated with the application.

12. The method of clause 5, 6, 7, 8, 9, 10, or 11, wherein causing the action to be performed comprises causing the action to be performed based at least in part on determining a node corresponding to the portion of the content from document object model information indicating a node associated with the content.

13. A system, comprising: one or more processors; and a computer-readable medium storing computer-executable instructions that, when executed by the one or more processors, cause the one or more processors to perform operations comprising: receiving screen data indicating a portion of content of an application being displayed by the device; receiving audio data representing a user utterance, the audio data associated with the device; determining intent data based at least in part on the screen data and the audio data; generating, based at least in part on the intent data, instruction data associated with the intent data, the instruction data indicating an action to perform for the portion of the content; and sending the instruction data to the device.

14. The system of clause 13, wherein generating the instructional data comprises generating the instructional data based, at least in part, on an indication that the application has been authorized to receive the instructional data.

15. The system of clause 13 or 14, wherein generating the instruction data comprises generating the instruction data based at least in part on determining a value associated with the action based at least in part on the screen data and the audio data.

16. The system of clauses 13, 14 or 15, the operations further comprising selecting a speech processing component from speech processing components associated with the system to generate the instruction data, the selecting based at least in part on receiving the screen data.

17. The system of clause 13, 14, 15, or 16, the operations further comprising: generating an identifier corresponding to the portion of the content; transmitting identifier data to the device, the identifier data indicating that the device is to display the identifier content; determining that the intent data corresponds to a selection of the identifier content; and wherein generating the instructions is based at least in part on determining that the intent data corresponds to a selection of the identifier content.

18. The system of clause 13, 14, 15, 16, or 17, wherein the portion of the content comprises a first portion of the content, and the operations further comprise: associating a first identifier with the first portion of the content; associating a second identifier with a second portion of the content; determining that the action is performable on the first portion and the second portion; generating modified identifier content corresponding to the first identifier and the second identifier based at least in part on determining that the action is performable on the first portion and the second portion; and sending the modified identifier content to the device for presentation by the device.

19. The system of clause 13, 14, 15, 16, 17, or 18, wherein determining the intent data comprises determining the intent data based at least in part on a finite state displacer associated with the application.

20. The system of clause 13, 14, 15, 16, 17, 18, or 19, wherein the content comprises first content, the screen data comprises first screen data, and the operations further comprise: receiving, from the device, second screen data indicating that the device is displaying the second content associated with the application; and wherein generating the instruction data comprises generating the instruction data based at least in part on determining a value associated with the action based at least in part on the second screen data.

21. A system, comprising: one or more processors; and a computer-readable medium storing computer-executable instructions that, when executed by the one or more processors, cause the one or more processors to perform operations comprising: receiving audio data representing a user utterance; determining intent data based at least in part on the audio data; identifying first instruction data corresponding to the intent data, the first instruction data configured to, when transmitted to a device, cause the device to perform a first operation on a portion of the content displayed by the device; identifying second instruction data corresponding to the intent data, the second instruction data configured to, when sent to the device, cause the device to perform a second operation with respect to the portion of the content; determining a first priority associated with the first instruction data in accordance with a first content type associated with the portion of the content, the first content type being textual content; determining a second priority associated with the second instruction data according to a second content type associated with the portion of the content, the second content type being iconic content; determining that the first instruction data is prioritized over the second instruction data based on the first content type being prioritized over the second content type; selecting the first instruction data based on the first instruction data being prioritized over the second instruction data; and sending the first instruction data to the device to cause an action to be performed for the portion of the content.

22. The system of clause 21, wherein determining that the first instruction data is prioritized over the second instruction data comprises determining that the first instruction data is prioritized over the second instruction data based on: determining a first number of times the first instruction data has been sent to the device; determining a second number of times the second instruction data has been sent to the device; and determining that the first number of times is greater than the second number of times.

23. The system of clause 21 or 22, wherein the portion of the content comprises a first portion of the content, and determining that the first instructional data overrides the second instructional data comprises determining that the first instructional data overrides the second instructional data based on: determining a first rendered size of the first portion of the content relative to the device screen; determining a second rendered size of a second portion of the content relative to the screen; and determining that the first presentation size is greater than the second presentation size.

24. The system of clause 21, 22 or 23, wherein determining that the first instruction data is prioritized over the second instruction data comprises determining that the first instruction data is prioritized over the second instruction data based on context data indicating that the first instruction data is prioritized over the second instruction data for the application.

25. A method, comprising: receiving audio data representing a user utterance; determining intent data associated with the user utterance based at least in part on the audio data; identifying first instruction data corresponding to the intent data, the first instruction data configured to, when received by a device, cause the device to perform a first operation on a portion of content of an application; identifying second instruction data corresponding to the intent data, the second instruction data configured to, when received by the device, cause the device to perform a second operation with respect to the portion of the content; determining that the first instruction data takes precedence over the second instruction data; selecting the first instruction data based at least in part on the first instruction data being prioritized over the second instruction data; and sending the first instructional data to the device to cause an action to be performed on the portion of the content.

26. The method of clause 25, wherein determining that the first instruction data is prioritized over the second instruction data comprises determining that the first instruction data is prioritized over the second instruction data based, at least in part, on: determining a first number of times the first instruction data has been sent to the device; determining a second number of times the second instruction data has been sent to the device; and determining that the first number of times is greater than the second number of times.

27. The method of clause 25 or 26, wherein the portion of the content comprises a first portion of the content, and determining that the first instructional data is prioritized over the second instructional data comprises determining that the first instructional data is prioritized over the second instructional data based, at least in part, on: determining a first rendered size of the first portion of the content relative to the device screen; determining a second rendered size of a second portion of the content relative to the screen; and determining that the first presentation size is greater than the second presentation size.

28. The method of clause 25, 26 or 27, wherein determining that the first instruction data is prioritized over the second instruction data comprises determining that the first instruction data is prioritized over the second instruction data based on context data indicating that the first instruction data is prioritized over the second instruction data for the application.

29. The method of clause 25, 26, 27 or 28, wherein the content comprises first content, and further comprising: receiving an indication that second content is displayed by the device instead of the first content; determine, based at least in part on the indication, that the second instruction data is prioritized over the first instruction data; selecting the second instruction data based at least in part on the second instruction data having priority over the first instruction data; and sending the second instruction data to the device.

30. The method of clause 25, 26, 27, 28, or 29, wherein the audio data comprises first audio data, the user utterance comprises a first user utterance, and further comprising: determining second intent data associated with a second user utterance based at least in part on second audio data representing the second user utterance; determining that at least one of the first instruction data or the second instruction data corresponds to the second intent data; identifying third instruction data corresponding to the second intent data, the third instruction data being content-independent; selecting the third instruction data based at least in part on the third instruction data being content independent; and sending the third instruction data to the device.

31. The method of clause 25, 26, 27, 28, 29 or 30, further comprising: determining that the first instruction data is associated with a value associated with the intent data, the value indicating a first portion of the content on which the first operation is performed, the first portion of the content being associated with a first content type; determining that the value is associated with a second portion of the content associated with a second content type; and wherein the first instruction data is prioritized over the second instruction data based at least in part on the first content type being prioritized over the second content type.

32. The method of clause 25, 26, 27, 28, 29, 30 or 31, wherein determining that the first instruction data is prioritized over the second instruction data comprises determining that the first instruction data is prioritized over the second instruction data based at least in part on previous instruction data sent to the device.

33. A system, comprising: one or more processors; and a computer-readable medium storing computer-executable instructions that, when executed by the one or more processors, cause the one or more processors to perform operations comprising: determining intent data associated with a user utterance based at least in part on audio data representing the user utterance; identifying first instruction data corresponding to the intent data, the first instruction data configured to, when transmitted to a device, cause the device to perform a first operation on a portion of content of an application; identifying second instruction data corresponding to the intent data, the second instruction data configured to, when sent to the device, cause the device to perform a second operation with respect to the portion of the content; determining that the first instruction data takes precedence over the second instruction data; and send the first instruction data to the device based at least in part on determining that the first instruction data takes precedence over the second instruction data.

34. The system of clause 33, wherein determining that the first instruction data is prioritized over the second instruction data comprises determining that the first instruction data is prioritized over the second instruction data based, at least in part, on: determining a first number of times the first instruction data has been sent to the device; determining a second number of times the second instruction data has been sent to the device; and determining that the first number of times is greater than the second number of times.

35. The system of clause 33 or 34, wherein determining that the first instruction data is prioritized over the second instruction data comprises determining that the first instruction data is prioritized over the second instruction data based, at least in part, on: determining a first rendered size of the first portion of the content relative to the device screen; determining a second rendered size of a second portion of the content relative to the screen; and determining that the first presentation size is greater than the second presentation size.

36. The system of clause 33, 34, or 35, wherein determining that the first instruction data is prioritized over the second instruction data comprises determining that the first instruction data is prioritized over the second instruction data based on context data indicating that the first instruction data is prioritized over the second instruction data for the application.

37. The system of clause 33, 34, 35, or 36, wherein the content comprises first content, and the operations further comprise: receiving an indication that second content is displayed by the device instead of the first content; determine, based at least in part on the indication, that the second instruction data is prioritized over the first instruction data; selecting the second instruction data based at least in part on the second instruction data having priority over the first instruction data; and sending the second instruction data to the device.

38. The system of clause 33, 34, 35, 36, or 37, wherein the audio data comprises first audio data, the user utterance comprises a first user utterance, and the operations further comprise: determining second intent data associated with a second user utterance based at least in part on second audio data representing the second user utterance; determining that at least one of the first instruction data or the second instruction data corresponds to the second intent data; identifying third instruction data corresponding to the second intent data, the third instruction data being content-independent; selecting the third instruction data based at least in part on the third instruction data being content independent; and sending the third instruction data to the device.

39. The system of clause 33, 34, 35, 36, 37, or 38, the operations further comprising: determining that the first instruction data is associated with a value associated with the intent data, the value indicating a first portion of the content on which the first operation is performed, the first portion of the content being associated with a first content type; determining that the value is associated with a second portion of the content associated with a second content type; and wherein the first instruction data is prioritized over the second instruction data based at least in part on the first content type being prioritized over the second content type.

40. The system of clause 33, 34, 35, 36, 37, 38, or 39, wherein determining that the first instruction data is prioritized over the second instruction data comprises determining that the first instruction data is prioritized over the second instruction data based, at least in part, on previous instruction data sent to the device.

Claims

1. A method, comprising:

determining that content of an application is being displayed on a device;

based at least in part on determining that the content is being displayed, identifying metadata associated with the content;

identifying a selectable portion of the content based at least in part on the metadata associated with the content;

transmitting screen data identifying the portion of the content to a remote system;

transmitting audio data representing a user utterance to the remote system;

receiving instruction data from the remote system to perform an action for the portion of the content, the instruction data determined by the remote system from the screen data and the audio data; and

causing the action to be performed.

2. The method of claim 1, wherein receiving the instruction data comprises receiving the instruction data based at least in part on an indication that the application has been authorized to receive the instruction data.

3. The method of claim 1, further comprising:

causing display of overlay content, the overlay content including an identifier associated with the portion of the content;

sending overlay data to the remote system indicating that the overlay content is displayed; and

wherein the instruction data includes an indicator to select the identifier.

4. The method of claim 1, further comprising:

receiving, from the remote system, a first identifier associated with the content displayed by the device;

receiving, from the remote system, a second identifier associated with the content displayed by the device;

determining that the first identifier and the second identifier correspond to the portion of the content;

generating a modified identifier corresponding to the first identifier and the second identifier; and

causing display of overlay content, the overlay content including the modified identifier displayed for the portion of the content.

5. The method of claim 1, wherein the instruction data indicates a value associated with the action, and causing performance of the action comprises causing performance of an action based at least in part on:

determining a first node of the content corresponding to the value according to document object model information indicating a node associated with the content;

determining a second node of the content corresponding to the value according to the document object model information;

determining a first confidence level that the first node corresponds to the value;

determining a second confidence level that the second node corresponds to the value; and

causing the action to be performed for the first node based at least in part on the first confidence level overriding the second confidence level.

6. The method of claim 1, wherein the content comprises first content, the portion comprises a first portion, the screen data comprises first screen data, and further comprising:

receiving event data indicating that the second content is being displayed;

identifying a second portion of the second content based at least in part on the event data indicating that the second content is being displayed, the second portion of the second content, when selected by a user, causing third content to be displayed; and

sending second screen data to the remote system identifying the second portion of the second content, the second portion being different from the first portion.

7. The method of claim 1, wherein identifying the portion of the content comprises identifying the portion of the content based at least in part on document object model data associated with the application.

8. The method of claim 1, wherein causing the action to be performed comprises causing the action to be performed based at least in part on determining a node corresponding to the portion of the content from document object model information indicating a node associated with the content.

9. A method, comprising:

receiving audio data representing a user utterance;

determining intent data associated with the user utterance based at least in part on the audio data;

identifying first instruction data corresponding to the intent data, the first instruction data configured to, when received by a device, cause the device to perform a first operation on a portion of content of an application;

identifying second instruction data corresponding to the intent data, the second instruction data configured to, when received by the device, cause the device to perform a second operation with respect to the portion of the content;

determining that the first instruction data takes precedence over the second instruction data;

selecting the first instruction data based at least in part on the first instruction data being prioritized over the second instruction data; and

sending the first instructional data to the device to perform an action on the portion of the content.

10. The method of claim 9, wherein determining that the first instruction data is prioritized over the second instruction data comprises determining that the first instruction data is prioritized over the second instruction data based at least in part on:

determining a first number of times the first instruction data has been sent to the device;

determining a second number of times the second instruction data has been sent to the device; and

determining that the first number of times is greater than the second number of times.

11. The method of claim 9, wherein the portion of the content comprises a first portion of the content, and determining that the first instruction data is prioritized over the second instruction data comprises determining that the first instruction data is prioritized over the second instruction data based at least in part on:

determining a first rendered size of the first portion of the content relative to the device screen;

determining a second rendered size of a second portion of the content relative to the screen; and

determining that the first presentation size is greater than the second presentation size.

12. The method of claim 9, wherein determining that the first instruction data is prioritized over the second instruction data comprises determining that the first instruction data is prioritized over the second instruction data based on context data indicating that the first instruction data is prioritized over the second instruction data for the application.

13. The method of claim 9, wherein the content comprises first content, and further comprising:

receiving an indication that second content is displayed by the device instead of the first content;

determine, based at least in part on the indication, that the second instruction data is prioritized over the first instruction data;

selecting the second instruction data based at least in part on the second instruction data having priority over the first instruction data; and

and sending the second instruction data to the equipment.

14. The method of claim 9, wherein the audio data comprises first audio data, the user utterance comprises a first user utterance, and further comprising:

determining second intent data associated with a second user utterance based at least in part on second audio data representing the second user utterance;

determining that at least one of the first instruction data or the second instruction data corresponds to the second intent data;

identifying third instruction data corresponding to the second intent data, the third instruction data being content-independent;

selecting the third instruction data based at least in part on the third instruction data being content independent; and

sending the third instruction data to the device.

15. The method of claim 9, further comprising:

determining that the first instruction data is associated with a value associated with the intent data, the value indicating a first portion of the content on which the first operation is performed, the first portion of the content being associated with a first content type;

determining that the value is associated with a second portion of the content associated with a second content type; and

wherein the first instruction data is prioritized over the second instruction data based at least in part on the first content type being prioritized over the second content type.