EP3047481A1

EP3047481A1 - Local and remote speech processing

Info

Publication number: EP3047481A1
Application number: EP14846698.0A
Authority: EP
Inventors: Nikko Strom; Peter Spalding Vanlund; Bjorn HOFFMEISTER
Original assignee: Amazon Technologies Inc
Current assignee: Amazon Technologies Inc
Priority date: 2013-09-20
Filing date: 2014-09-09
Publication date: 2016-07-27
Also published as: JP2016531375A; WO2015041892A1; EP3047481A4; CN105793923A

Abstract

A user device may be configured to detect a user-uttered trigger expression and to respond by interpreting subsequent words or phrases as commands. The commands may be recognized by sending audio containing the words or phrases to a remote service that is configured to perform speech recognition. Certain commands may be designated as local commands and may be detected locally rather than relying on the remote service. Upon detection of the trigger expression, audio is streamed to the remote service and also analyzed locally to detect utterances of local commands. Upon detecting a local command, a corresponding function is immediately initiated, and subsequent activities or responses by the remote service are canceled or ignored.

Description

LOCAL AND REMOTE SPEECH PROCESSING

RELATED APPLICATIONS

[0001] The present application claims priority to US Patent Application No. 14/033,302 filed on September 20, 2013, entitled "Local and Remote Speech Processing", which is incorporated by reference herein in its entirety.

BACKGROUND

[0002] Homes, offices, automobiles, and public spaces are becoming more wired and connected with the proliferation of computing devices such as notebook computers, tablets, entertainment systems, and portable communication devices. As computing devices evolve, the ways in which users interact with these devices continue to evolve. For example, people can interact with computing devices through mechanical devices (e.g., keyboards, mice, etc.), electrical devices (e.g., touch screens, touch pads, etc.), and optical devices (e.g., motion detectors, camera, etc.). Another way to interact with computing devices is through audio devices that capture and respond to human speech.

BRIEF DESCRIPTION OF THE DRAWINGS

[0003] The detailed description is described with reference to the accompanying figures. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. The use of the same reference numbers in different figures indicates similar or identical components or features.

[0004] FIG. 1 is a block diagram of an illustrative voice interaction computing architecture that includes a local audio device and a remote speech processing service.

[0005] FIG. 2-4 are flow diagrams illustrating example processes for detecting command expressions that may be performed by a local audio device in conjunction with a remote speech processing service.

DETAILED DESCRIPTION

[0006] This disclosure pertains generally to a speech interface system that provides or facilitates speech-based interactions with a user. The system includes a local device having a microphone that captures audio containing user speech. Spoken user commands may be prefaced by a keyword, referred to as a trigger expression or wake expression. Audio following a trigger expression may be streamed to a remote service for speech recognition and the service may respond by performing a function or providing a command to be performed by the audio device.

[0007] Communications with the remote service may introduce response latency, which in most cases can be minimized within acceptable limits. Some spoken commands, however, may call for less latency. As an example, spoken commands related to certain types of media rendering, such as "stop", "pause", "hang up", and so forth may need to be performed with less perceptible amounts of latency.

[0008] In accordance with various embodiments, certain command expressions, referred to herein as local commands or local command expressions, are detected by or at the local device rather than by the remote service. More specifically, the local device is configured to detect a trigger or alert expression, which indicates that subsequent speech is intended by the user to form a command. Upon detecting the trigger expression, the local device initiates a communication session with the remote service and begins streaming received audio to the service. In response, the remote service performs speech recognition on the received audio and attempts to identify user intent based on the recognized speech. In response to a recognized user intent, the remote service may perform a corresponding function. In some cases, the function may performed in conjunction with the local device. For example, the remote service may send a command to the local device indicating that the local device should execute the command to perform a corresponding function.

[0009] Concurrently with the activities of the remote service, the local device monitors or analyzes the audio to detect an occurrence of a local command expression following the trigger expression. Upon detecting a local command expression in the audio, the local device immediately implements a corresponding function. In addition, further actions by the remote service are stopped or cancelled to avoid duplicate actions with respect to a single user utterance. Actions by the remote service may be stopped by explicitly notifying the remote service that the utterance has been acted upon locally, by terminating or cancelling a communications session, and/or by foregoing execution of any commands that are specified by the remote service in response to remote recognition of user speech.

[0010] FIG. 1 shows an example of a voice interaction system 100. The system 100 may include or may utilize a local voice-based audio device 102, which may be located within an environment 104 such as a home, and which may be used for interacting with a user 106. The voice interaction system 100 may also include or utilize a remote, network-based speech command service 108 that is configured to receive audio, to recognize speech in the audio, and to perform a function, referred to herein as a service-identified function, in response to the recognized speech. The service-identified function may be implemented by the speech command service 108 independently of the audio device, and/or may be implemented by providing a command to the audio device 102 for local execution.

[0011] In certain embodiments, the primary mode of user interaction with the audio device 102 may be through speech. For example, the audio device 102 may receive spoken command expressions from the user 106 and may provide services in response to the commands. The user may speak a predefined wake or trigger expression (e.g., "Awake"), which may be followed by commands or instructions (e.g., "I'd like to go to a movie. Please tell me what's playing at the local cinema."). Provided services may include performing actions or activities, rendering media, obtaining and/or providing information, providing information via generated or synthesized speech via the audio device 102, initiating Internet-based services on behalf of the user 106, and so forth.

[0012] The local audio device 102 and the speech command service 108 are configured to act in conjunction with each other to receive and respond to command expressions from the user 106. The command expressions may include local command expressions that are detected and acted upon by the local device 102 independently of the speech command service 108. The command expressions may also include commands that are interpreted and acted upon by or in conjunction with the remote speech command service 108.

[0013] The audio device 102 may have one or more microphones 1 10 and one or more audio speakers or transducers 1 12 to facilitate audio interactions with the user 106. The microphone 1 10 produces a microphone signal, also referred to as an input audio signal, representing audio from the environment 104, including sounds or expressions uttered by the user 106.

[0014] In some cases, the microphone 1 10 may comprise a microphone array that is used in conjunction with audio beamforming techniques to produce an input audio signal that is focused in a selectable direction. Similarly, a plurality of directional microphones 1 10 may be used to produce an audio signal corresponding to one of multiple available directions.

[0015] The audio device 102 includes operational logic, which in many cases may comprise a processor 1 14 and memory 1 16. The processor 1 14 may include multiple processors and/or a processor having multiple cores. The processor 1 14 may also comprise or include a digital signal processor for processing audio signals.

[0016] The memory 1 16 may contain applications and programs in the form of computer-executable instructions that are executed by the processor 1 14 to perform acts or actions that implement desired functionality of the audio device 102, including the functionality specifically described below. The memory 1 16 may be a type of computer-readable storage media and may include volatile and nonvolatile memory. Thus, the memory 1 16 may include, but is not limited to, RAM, ROM, EEPROM, flash memory, or other memory technology.

[0017] The audio device 102 may include a plurality of applications, services, and/or functions 1 18, referred to collectively below as functional components 1 18, which are executable by the processor 1 14 to provide services and functionality. The applications and other functional components 1 18 may include media playback services such as music players. Other services or operations performed or provided by the applications and other functional components 1 18 may include, as examples, requesting and consuming entertainment (e.g., gaming, finding and playing music, movies or other content, etc.), personal management (e.g., calendaring, note taking, etc.), online shopping, financial transactions, database inquiries, person-to-person voice communications, and so forth.

[0018] In some embodiments, the functional components 1 18 may be pre- installed on the audio device 102, and may implement core functionality of the audio device 102. In other embodiments, one or more of the applications or other functional components 1 18 may be installed by the user 106 or otherwise installed after the audio device 102 has been initialized by the user 106, and may implement additional or customized functionality as desired by the user 106.

[0019] The processor 1 14 may be configured by audio processing functionality or components 120 to process input audio signals generated by the microphone 1 10 and/or output audio signals provided to the speaker 1 12. As an example, the audio processing components 120 may implement acoustic echo cancellation to reduce audio echo generated by acoustic coupling between the microphone 1 10 and the speaker 1 12. The audio processing components 120 may also implement noise reduction to reduce noise in received audio signals, such as elements of input audio signals other than user speech. In certain embodiments, the audio processing components 120 may include one or more audio beamformers that are responsive to multiple microphones 1 10 to generate an audio signal that is focused in a direction from which user speech has been detected.

[0020] The audio device 102 may also be configured to implement one or more expression detectors or speech recognition components 122, which may be used to detect a trigger expression in speech captured by the microphone 1 10. The term "trigger expression" is used herein to indicate a word, phrase, or other utterance that is used to signal the audio device 102 that subsequent user speech is intended by the user to be interpreted as a command. [0021] The one or more speech recognition components 122 may also be used to detect commands or command expressions in the speech captured by the microphone 1 10. The term "command expression" is used herein to indicate a word, phrase, or other utterance that corresponds to or is associated with a function that is to be performed by the audio device 102 or by a service or other device that is accessible to the audio device 102, such as the speech command service 108. For example, the words "stop", "pause", "hang-up" may be used as command expressions. The "stop" and "pause" command expressions may indicate that media playback activities should be interrupted. The "hang-up" command expression may indicate that a current person-to- person communication should be terminated. Other command expressions, corresponding to different functions, may also be used. Command expressions may comprise conversation-style directives, such as "Find a nearby Italian restaurant."

[0022] Command expressions may include local command expressions that are to be interpreted by the audio device 102 without relying on the speech command service 108. Generally, local command expressions are relatively short expressions such as single words or short phrases, which can be easily detected by the audio device 102. Local command expressions may correspond to device functions for which relatively low response latencies are desired, such as media control or media playback control functions. The services of the speech command service 108 may be utilized for other command expressions for which greater response latencies are acceptable. Command expressions that are to be acted upon by the speech command service will be referred to herein as remote command expressions.

[0023] In some cases, the speech recognition components 122 may be implemented using automated speech recognition (ASR) techniques. For example, large vocabulary speech recognition techniques may be used for keyword detection, and the output of the speech recognition may be monitored for occurrences of the keyword. As an example, the speech recognition may use hidden Markov models and Gaussian mixture models to recognize voice input and to provide a continuous word stream corresponding to the voice input. The word stream may then be monitored to detect one or more specified words or expressions.

[0024] Alternatively, the speech recognition components 122 may be implemented by one or more keyword spotters. A keyword spotter is a functional component or algorithm that evaluates an audio signal to detect the presence of one or more predefined words or expressions in the audio signal. Generally, a keyword spotter uses simplified ASR techniques to detect a specific word or a limited number of words rather than attempting to recognize a large vocabulary. For example, a keyword spotter may provide a notification when a specified word is detected in a voice signal, rather than providing a textual or word-based output. A keyword spotter using these techniques may compare different words based on hidden Markov models (HMMs), which represent words as series of states. Generally, an utterance is analyzed by comparing its model to a keyword model and to a background model. Comparing the model of the utterance with the keyword model yields a score that represents the likelihood that the utterance corresponds to the keyword. Comparing the model of the utterance with the background model yields a score that represents the likelihood that the utterance corresponds to a generic word other than the keyword. The two scores can be compared to determine whether the keyword was uttered.

[0025] The audio device 102 may further comprise control functionally 124, referred to herein as a controller or control logic, that is configured to interact with the other components of the audio device 102 in order to implement the logical functionality of the audio device 102.

[0026] The control logic 124, the audio processing components 120, the speech recognition components 122, and the functional components 1 18 may comprise executable instructions, programs, and/or or program modules that are stored in the memory 1 16 and executed by the processor 1 14.

[0027] The speech command service 108 may in some instances be part of a network-accessible computing platform that is maintained and accessible via a network 126 such as the Internet. Network-accessible computing platforms such as this may be referred to using terms such as "on-demand computing", "software as a service (SaaS)", "platform computing", "network-accessible platform", "cloud services", "data centers", and so forth.

[0028] The audio device 102 and/or the speech command service 108 may communicatively couple to the network 126 via wired technologies (e.g., wires, universal serial bus (USB), fiber optic cable, etc.), wireless technologies (e.g., radio frequencies (RF), cellular, mobile telephone networks, satellite, Bluetooth, etc.), or other connection technologies. The network 126 is representative of any type of communication network, including data and/or voice network, and may be implemented using wired infrastructure (e.g., coaxial cable, fiber optic cable, etc.), a wireless infrastructure (e.g., RF, cellular, microwave, satellite, Bluetooth®, etc.), and/or other connection technologies.

[0029] Although the audio device 102 is described herein as a voice- controlled or speech-based interface device, the techniques described herein may be implemented in conjunction with various different types of devices, such as telecommunications devices and components, hands-free devices, entertainment devices, media playback devices, and so forth.

[0030] The speech command service 108 generally provides functionality for receiving an audio stream from the audio device 102, recognizing speech in the audio stream, determining user intent from the recognized speech, and performing an action or service in response to the user intent. The provided action may in some cases be performed in conjunction with the audio device 102 and in these cases the speech command service 108 may return a response to the audio device 102 indicating a command that is to be executed by the audio device 102.

[0031] The speech command service 108 includes operational logic, which in many cases may comprise one or more servers, computers, and or processors 128. The speech command service 108 may also have memory 130 containing applications and programs in the form of instructions that are executed by the processor 128 to perform acts or actions that implement desired functionality of the speech command service, including the functionality specifically described herein. The memory 130 may be a type of computer storage media and may include volatile and nonvolatile memory. Thus, the memory 130 may include, but is not limited to, RAM, ROM, EEPROM, flash memory, or other memory technology.

[0032] Among other logical and physical components not specifically shown, the speech command service 108 may comprise speech recognition components 132. The speech recognition components 132 may include automatic speech recognition (ASR) functionality that recognizes human speech in an audio signal.

[0033] The speech command service 108 may also comprise a natural language understanding component (NLU) 134 that determines user intent based on recognized speech.

[0034] The speech command service 108 may also comprise a command interpreter and action dispatcher 136 (referred to below simply as a command interpreter 136) that determines functions or commands corresponding to user intents. In some cases, commands may correspond to functions that are to be performed at least in part by the audio device 102, and the command interpreter 136 may in those cases provide responses to the audio device 102 indicating commands for implementing such functions. Examples of commands or functions that may be performed by the audio device in response to directives from the command interpreter 136 may include playing music or other media, increasing/decreasing the volume of the speaker 1 12, generating audible speech through the speaker 1 12, initiating certain types of communications with users of similar devices, and so forth.

[0035] Note that the speech command service 108 may also perform functions, in response to speech recognized from received audio, that involve entities or devices that are not shown in FIG. 1. For example, the speech command service 108 may interact with other network-based services to obtain information or services on behalf of the user 106. Furthermore, the speech command service 108 may itself have various elements and functionality that may be responsive to speech uttered by the user 106.

[0036] In operation, the microphone 1 10 of the audio device 102 captures or receives audio containing speech of the user 106. The audio is processed by the audio processing components 120 and the processed audio is received by the speech recognition components 122. The speech recognition components 122 analyze the audio to detect occurrences of a trigger expression in the speech contained in the audio. Upon detection of the trigger expression, the controller 124 begins sending or streaming received audio to the speech command service 108 along with a request for the speech command service 108 to recognize and interpret the user speech, and to initiate a function corresponding to any interpreted intent.

[0037] Concurrently with sending the audio to the speech command service 108, the speech recognition components 122 continue to analyze the received audio to detect an occurrence of a local command expression in the user speech. Upon detection of a local command expression, the controller 124 initiates or performs a device function that corresponds to the local command expression. For example, in response to the local command expression "stop", the controller 124 may initiate a function that stops media playback. The controller 124 may interact with one or more of the functional components 1 18 when initiating or performing the function.

[0038] Meanwhile, the speech command service 108, in response to receiving the audio, concurrently analyzes the audio to recognize speech, to determine a user intent, and to determine a service-identified function that is to be implemented in response to the user intent. However, after locally detecting and acting upon the local command expression, the audio device 102 may take actions to cancel, nullify, or invalidate any service-identified functions that may eventually be initiated by the speech command service 108. For example, the audio device 102 may cancel its previous request by sending a cancellation message to the speech command service 108 and/or by stopping the streaming of the audio to the speech command service 108. As another example, the audio device may ignore or discard any responses or service-specified commands that are received from the speech command service 108 in response to the earlier request. In some cases, the audio device may inform the speech command service 108 of actions that have been performed locally in response to the local command expression, and the speech command service 108 may modify its subsequent behavior based on this information. For example, the speech command service 108 may forego actions that it might otherwise have performed in response to recognized speech in the received audio.

[0039] FIG. 2 illustrates an example method 200 that may be performed by the audio device 102 in conjunction with the speech command service 108 in order to recognize and respond to user speech. The method 200 will be described in the context of the system 100 of FIG. 1, although the method 200 may also be performed in other environments and may be implemented in different ways.

[0040] Actions on the left side of FIG. 2 are performed at or by the local audio device 102. Actions on the right side of FIG. 2 are performed at or by the remote speech command service 108.

[0041] An action 202 comprises receiving an audio signal that has been captured by or in conjunction with the microphone 1 10. The audio signal contains or represents audio from the environment 104, and may contain user speech. The audio signal may be an analog electrical signal or may comprise a digital signal such as a digital audio stream.

[0042] An action 204 comprises detecting an occurrence of a trigger expression in the received audio and/or in the user speech. This may be performed by the speech recognition components 122 as described above, which may in some embodiments comprise keyword spotters. If the trigger expression is not detected, the action 204 is repeated in order to continuously monitor for occurrences of the trigger expression. The remaining actions shown in FIG. 2 are performed in response to detecting the trigger expression. [0043] If the trigger expression is detected in the action 204, an action 206 is performed, comprising sending subsequently received audio to the speech command service 108 along with a service request 208 for the speech command service 108 to recognize speech in the audio and to implement a function corresponding to the recognized speech. Functions initiated by the speech command service 108 in this manner are referred to herein as service- identified functions, and may in certain cases be performed in conjunction with the audio device 102. For example, a function may be initiated by sending a command to the audio device 102.

[0044] The sending 206 may comprise streaming or otherwise transmitting a digital audio stream 210 to the speech command service 108, representing or containing audio that is received from the microphone 1 10 subsequent to detection of the trigger expression. In certain embodiments, the action 206 may comprise opening or initiating a communication session between the audio device 102 and the speech command service 108. In particular, the request 208 may be used to establish a communication session with the speech command service 108 for the purpose of recognizing speech, understanding intent, and determining actions or functions to be performed in response to user speech. The request 208 may be followed or accompanied by the streamed audio 210. In some cases, the audio stream 210 provided to the speech command service 108 may include portions of received audio beginning at a time just prior to utterance of the trigger expression. [0045] The communication session may be associated with a communication or session identifier (ID) that identifies the communication session established between the audio device 102 and the speech command service 108. The session ID may be used or included in future communications relating to a particular user utterance or audio stream. In some cases, the session ID may be generated by the audio device 102 and provided in the request 208 to the speech command service 108. Alternatively, the session ID may be generated by the speech command service 108 and provided by the speech command service 108 in acknowledgment of the request 208. The term "request(ID)" is used herein to indicate a request having a particular session ID. A response from the speech command service 108 relating to the same session, request, or audio stream may be indicated by the term "response(ID)".

[0046] In certain embodiments, each communication session and corresponding session ID may correspond to a single user utterance. For example, the audio device 102 may establish a session upon detecting the trigger expression. The audio device 102 may then continue to stream audio to the speech command service 108 as part of the same session until the end of the user utterance. The speech command service 108 may provide responses to the audio device 102 through the session, using the same session ID. Responses may in some cases indicate commands to be executed by the audio device 102 in response to speech recognized by the speech command service 108 in the received audio 210. The communication session may remain open until the audio device 102 receives a response from the speech command service 108 or until the audio device 102 cancels the request.

[0047] The speech command service 108 receives the request 208 and audio stream 210 in an action 212. In response, the speech command service 108 performs an action 214 of recognizing speech in the received audio and determining a user intent as expressed by the recognized speech, using the speech recognition and natural language understanding components 132 and 134 of the speech command service 108. An action 214, performed by the command interpreter 136, comprises identifying and initiating a service- identified function in fulfillment of the determined user intent. The service- identified function may in some cases be performed by the speech command service 108, independently of the audio device 102. In other cases, the speech command service 108 may identify a function that is to be performed by the audio device 102, and may send a corresponding command to the audio device 102 for execution by the audio device 102.

[0048] Concurrently with the actions being performed by the speech command service 108, the local audio device 102 performs further actions to determine whether the user has uttered a local command expression and to perform a corresponding local function in response to any such uttered local command expression. Specifically, an action 218, performed in response to detecting the trigger expression in the action 204, comprises analyzing audio received in the action 202 to detect an occurrence of a local command expression that follows or immediately follows the trigger expression in the received user speech. This may be performed by the speech recognition components 122 of the audio device 102 as described above, which may in some embodiments comprise keyword spotters.

[0049] In response to detecting the local command expression in the action 218, an action 220 is performed of immediately initiating a device function that has been associated with the local command expression. For example, the local command expression "stop" might be associated with a function that stops media playback.

[0050] Also in response to detecting the local command expression in the action 218, the audio device 102 performs an action 222 of stopping or cancelling the request 208 to the speech command service 108. This may include cancelling or nullifying implementation of the service-identified function that may have otherwise been implemented by the speech command service 108 in response to the received request 208 and accompanying audio 210.

[0051] In certain implementations, the action 222 may comprise sending an explicit notification or command to the speech command service 108, requesting that the speech command service 108 cancel any further recognition activities with respect to the service request 208, and/or to cancel implementation of any service-identified functions that may otherwise have been initiated in response to recognized speech. Alternatively, the audio device 102 may simply notify the speech command service 108 regarding any functions that have been performed locally in response to local recognition of the local command expression, and the speech command service 108 may respond by cancelling the service request 208 or by performing other actions as may be appropriate.

[0052] In certain implementations, the speech command service 108 may implement the service-identified function by identifying a command to be executed by the audio device 102. In response to receiving a notification that the service request 208 is to be cancelled, the speech command service 108 may forego sending the command to the audio device 102. Alternatively, the speech command service may be allowed to complete its processing and to send a command to the audio device 102, whereupon the audio device 102 may ignore the command or forego execution of the command.

[0053] In some implementations, the speech command service may be configured to notify the audio device 102 before initiating a service-identified function, and may delay implementation of the service-identified function until receiving permission from the audio device 102. In this case, the audio device 102 may be configured to deny such permission when the local command expression has been recognized locally.

[0054] The various approaches described above may be used in situations calling for different amounts of command latency. For example, waiting for communications from the speech command service may introduce relatively higher latencies, which may not be acceptable in some situations. Such communications prior to implementing a function may safeguard against duplicate or unintended actions. Immediately implementing a locally recognized command expression and either ignoring subsequent commands from the speech command service or subsequently cancelling requests to the speech command service may be more appropriate in situations where lower latencies are desired.

[0055] Note that the actions of the speech command service 108 shown in FIG. 2 are performed in parallel and asynchronously with the actions 218, 220, and 222 of the audio device 102. It is assumed in some implementations that the audio device 102 is able to detect and act upon the local command expression relatively quickly, so that it may perform the action 222 of cancelling the request 208 and subsequent processing by the speech command service 108 before the service-identified function of the action 216 has been implemented or executed.

[0056] FIG. 3 shows illustrates an example method 300 in which the speech command service 108 returns commands to the audio device 102, and in which the audio device 102 is configured to ignore the commands or forego execution of the commands in situations in which a local command expression has already been detected and acted upon by the audio device 102. Initial actions are similar or identical to those described above. Actions performed by the audio device 102 are shown on the left and actions performed by the speech command service 108 are shown on the right.

[0057] An action 302 comprises receiving an audio signal containing user speech. An action 304 comprises analyzing the audio signal to detect a trigger expression in the user speech. Subsequent actions shown in FIG. 3 are performed in response to detecting the trigger expression.

[0058] An action 306 comprises sending a request 308 and audio 310 to the speech command service 108. An action 312 comprises receiving the request 308 and the audio 310 at the speech command service 108. An action 314 comprises recognizing user speech and determining user intent based on the recognized user speech.

[0059] In response to the determined user intent, the speech command service 108 performs an action 316 of sending a command 318 to the audio device 102 for execution by the audio device 102 in order to implement a service-identified function corresponding to the recognized user intent. For example, the command may comprise a "stop" command, indicating that the audio device 102 is to stop playback of music.

[0060] An action 320, performed by the audio device 102, comprises receiving and executing the command. The action 320 is shown in a dashed box to indicate that it is performed conditionally, based on whether a local command expression has been detected and acted upon by the audio device 102. Specifically, the action 320 is not performed if a local command expression has been detected by the audio device 102.

[0061] Concurrently with the actions performed by the speech command service 108, the audio device 102 performs an action 322 of analyzing received audio to detect an occurrence of a local command expression that follows or immediately follows the trigger expression in the received user speech. In response to detecting the local command expression, an action 324 is performed of immediately initiating a local device function that has been associated with the local command expression.

[0062] Also in response to detecting the local command expression in the action 322, the audio device 102 performs an action 326 of foregoing execution of the received command 318. More specifically, any commands received from the speech command service 108 in response to the request 308 are discarded or ignored. Responses and commands corresponding to the request 308 may be identified by session IDs associated with the responses.

[0063] If the local command expression is not detected in the action 322, the audio device performs the action 320 of executing the command 318 received from the speech command service 108.

[0064] FIG. 4 shows an example method 400 in which the audio device 102 is configured to actively cancel requests to the speech command service 108 after locally detecting a local command expression. Initial actions are similar or identical to those described above. Actions performed by the audio device 102 are shown on the left and actions performed by the speech command service 108 are shown on the right.

[0065] An action 402 comprises receiving an audio signal containing user speech. An action 404 comprises analyzing the audio signal to detect a trigger expression in the user speech. Subsequent actions shown in FIG. 4 are performed in response to detecting the trigger expression. [0066] An action 406 comprises sending a request 408 and audio 410 to the speech command service 108. An action 412 comprises receiving the request 408 and the audio 410 at the speech command service 108. An action 414 comprises recognizing user speech and determining user intent based on the recognized user speech.

[0067] An action 416 comprises determining whether the request 408 has been cancelled by the audio device 102. As an example, the audio device 102 may send a cancellation message or may terminate the current communication session in order to cancel the request. If the request has been canceled by the audio device 102, no further action is taken by the speech command service. If the request has not been canceled, an action 418 is performed, which comprises sending a command 420 to the audio device 102 for execution by the audio device 102 in order to implement a service-identified function corresponding to the recognized user intent.

[0068] An action 422, performed by the audio device 102, comprises receiving and executing the command. The action 422 is shown in a dashed box to indicate that it is performed conditionally, depending on whether a command has been sent and received from the speech command service 108, which in turn depends on whether the audio device 102 has cancelled the request 408.

[0069] Concurrently with the actions performed by the speech command service 108, the audio device 102 performs an action 424 of analyzing received audio to detect an occurrence of a local command expression that follows or immediately follows the trigger expression in the received user speech. In response to detecting the local command expression, an action 426 is performed of immediately initiating a local device function that has been associated with the local command expression.

[0070] Also in response to detecting the local command expression in the action 424, the audio device 102 performs an action 428 of requesting the speech command service 108 to cancel the request 408 and/or to cancel implementation of any service-identified functions that may have otherwise been performed in response to recognized speech in the audio received by the speech command service 108 from the audio device 102. This may comprise communicating with the speech command service 108, such as by sending a cancellation notification or request.

[0071] In some cases, the cancellation may comprise replying to a communication or notification from the speech command service 108 of a pending implementation of a service-identified function by the speech command service. In response to receiving such a notification, the audio device 102 may reply and may request cancellation of the pending implementation. Alternatively, the audio device 102 may cancel the implementation of any function that might have otherwise been performed in response to detecting the local command expression, and may instruct the speech command service 108 to proceed with implementation of the pending function. [0072] If the local command expression is not detected in the action 424, the audio device 102 performs the action 422 of executing the command 420 received from the speech command service 108. The action 422 may occur asynchronously, upon receiving the command 420 from the speech command service.

[0073] The embodiments described above may be implemented programmatically, such as with computers, processors, digital signal processors, analog processors, and so forth. In other embodiments, however, one or more of the components, functions, or elements may be implemented using specialized or dedicated circuits, including analog circuits and/or digital logic circuits. The term "component", as used herein, is intended to include any hardware, software, logic, or combinations of the foregoing that are used to implement the functionality attributed to the component.

[0074] Although the subject matter has been described in language specific to structural features, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features described. Rather, the specific features are disclosed as illustrative forms of implementing the claims.

Clauses:

1. One or more non- transitory computer-readable media storing computer-executable instructions that, when executed, cause one or more processors to perform acts comprising:

receiving audio that contains user speech; detecting a trigger expression in the user speech;

in response to detecting the trigger expression in the user speech:

streaming the received audio to a remote speech command service; and

analyzing the received audio to detect a local command expression following the trigger expression in the user speech, wherein the local command expression is associated with a device function;

initiating the device function in response to detecting the local command expression following the trigger expression in the user speech;

receiving a response from the remote speech command service, wherein the response indicates a command that is to be performed in response to speech recognized by the remote speech command service in the streamed audio;

executing the command indicated by the response if the local command expression is not detected following the trigger expression in the user speech; and

foregoing execution of the command indicated by the response if the local command expression is detected following the trigger expression in the user speech.

2. The one or more computer-readable media of clause 1, wherein the streaming is associated with a communication identifier and wherein the response indicates the communication identifier. 3. The one or more computer-readable media of clause 1, wherein the device function comprises a media control function.

4. The one or more computer-readable media of clause 1, the acts further comprising stopping the streaming of the received audio in response to detecting the command expression.

5. A method, comprising:

receiving audio that contains user speech;

detecting a trigger expression in the user speech;

in response to detecting the trigger expression in the user speech:

sending the received audio to a speech command service to recognize speech in the received audio and to implement a first function corresponding to the recognized speech; and

analyzing the received audio to detect a local command expression that follows the trigger expression in the received audio, wherein the local command expression is associated with a second function;

in response to detecting the local command expression that follows the trigger expression in the received audio:

initiating the second function; and

cancelling implementation of the first function. 6. The method of clause 5, wherein cancelling implementation of the first function comprises requesting the speech command service to cancel implementation of the first function.

7. The method of clause 5, further comprising receiving a communication from the speech command service indicating a pending implementation of the first function;

wherein cancelling implementation of the first function comprises requesting the speech command service to cancel the pending implementation of the first function.

8. The method of clause 5, further comprising receiving a command corresponding to the first function from the speech command service, wherein cancelling implementation of the first function comprises forgoing execution of the command received from the speech command service.

9. The method of clause 5, further comprising informing the speech command service that the second function has been initiated.

10. The method of clause 5, wherein cancelling implementation of the first function comprising informing the speech command service that the second function has been initiated. 11. The method of clause 5, wherein the second function comprises a media control function.

12. The method of clause 5, further comprising:

establishing a communication session with the speech command service in response to detecting the trigger expression in the audio; and

wherein cancelling implementation of the first function comprises terminating the communication session.

13. The method of clause 5, further comprising:

associating an identifier with the received audio;

receiving a response from the speech command service, wherein the response indicates the identifier and a command corresponding to the first function; and

wherein cancelling implementation of the first function comprises forgoing execution of the command.

14. A system, comprising:

one or more speech recognition components configured to recognize user speech in received audio, to detect a trigger expression in the user speech, and to detect a local command expression in the user speech;

control logic configured to perform acts in response to detection by the one or more speech recognition components of the trigger expression in the user speech, the acts comprising: sending the audio to a speech command service to recognize speech in the audio and to implement a first function corresponding to the recognized speech; and

in response to detection by the one or more speech recognition components of the local command expression in the user speech: (a) identifying a second function corresponding to the local command expression and (b) cancelling implementation of at least one of the first and second functions.

15. The system of clause 14, wherein the one or more speech recognition components comprise one or more keyword spotters.

16. The system of clause 14, wherein cancelling implementation of the at least one of the first and second functions comprises requesting the speech command service to cancel implementation of the first function.

17. The system of clause 14, wherein cancelling implementation of the at least one of the first and second functions comprises ignoring a command received from the speech command service.

18. The system of clause 14, wherein the second function comprises a media control function. 19. The system of clause 14, the acts further comprising stopping the sending of the audio in response to detection of the local command expression in the user speech.

20. The system of clause 14, wherein cancelling implementation of the at least one of the first and second functions comprises informing the speech command service that the second function has been initiated.

Claims

CLAIMS What is claimed is:

1. A device storing computer-executable instructions that, when executed, cause one or more processors of the device to perform acts comprising:

receiving audio that contains user speech;

detecting a trigger expression in the user speech;

in response to detecting the trigger expression in the user speech:

streaming the received audio to a remote speech command service; and

executing the command indicated by the response if the local command expression is not detected following the trigger expression in the user speech; foregoing execution of the command indicated by the response if the local command expression is detected following the trigger expression in the user speech.

2. The device of claim 1, wherein the streaming is associated with a communication identifier and wherein the response indicates the communication identifier.

3. The device of claim 1, wherein the device function comprises a media control function.

4. The device of claim 1, the acts further comprising stopping the streaming of the received audio in response to detecting the command expression.

5. A method, comprising:

receiving audio that contains user speech;

detecting a trigger expression in the user speech;

in response to detecting the trigger expression in the user speech:

initiating the second function; and

cancelling implementation of the first function.

6. The method of claim 5, wherein cancelling implementation of the first function comprises requesting the speech command service to cancel implementation of the first function.

7. The method of claim 5, further comprising receiving a communication from the speech command service indicating a pending implementation of the first function;

8. The method of claim 5, further comprising receiving a command corresponding to the first function from the speech command service, wherein cancelling implementation of the first function comprises forgoing execution of the command received from the speech command service.

9. The method of claim 5, further comprising informing the speech command service that the second function has been initiated.

10. The method of claim 5, further comprising:

associating an identifier with the received audio;

11. A system, comprising:

control logic configured to perform acts in response to detection by the one or more speech recognition components of the trigger expression in the user speech, the acts comprising:

sending the audio to a speech command service to recognize speech in the audio and to implement a first function corresponding to the recognized speech; and

12. The system of claim 11, wherein cancelling implementation of the at least one of the first and second functions comprises requesting the speech command service to cancel implementation of the first function.

13. The system of claim 11, wherein cancelling implementation of the at least one of the first and second functions comprises ignoring a command received from the speech command service.

14. The system of claim 11, the acts further comprising stopping the sending of the audio in response to detection of the local command expression in the user speech.

15. The system of claim 11, wherein cancelling implementation of the at least one of the first and second functions comprises informing the speech command service that the second function has been initiated.