EP3047481A1 - Local and remote speech processing - Google Patents

Local and remote speech processing

Info

Publication number
EP3047481A1
EP3047481A1 EP14846698.0A EP14846698A EP3047481A1 EP 3047481 A1 EP3047481 A1 EP 3047481A1 EP 14846698 A EP14846698 A EP 14846698A EP 3047481 A1 EP3047481 A1 EP 3047481A1
Authority
EP
European Patent Office
Prior art keywords
speech
command
expression
function
audio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP14846698.0A
Other languages
German (de)
French (fr)
Other versions
EP3047481A4 (en
Inventor
Nikko Strom
Peter Spalding Vanlund
Bjorn HOFFMEISTER
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Amazon Technologies Inc
Original Assignee
Amazon Technologies Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Amazon Technologies Inc filed Critical Amazon Technologies Inc
Publication of EP3047481A1 publication Critical patent/EP3047481A1/en
Publication of EP3047481A4 publication Critical patent/EP3047481A4/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/32Multiple recognisers used in sequence or in parallel; Score combination systems therefor, e.g. voting systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Definitions

  • Homes, offices, automobiles, and public spaces are becoming more wired and connected with the proliferation of computing devices such as notebook computers, tablets, entertainment systems, and portable communication devices.
  • computing devices such as notebook computers, tablets, entertainment systems, and portable communication devices.
  • computing devices evolve, the ways in which users interact with these devices continue to evolve. For example, people can interact with computing devices through mechanical devices (e.g., keyboards, mice, etc.), electrical devices (e.g., touch screens, touch pads, etc.), and optical devices (e.g., motion detectors, camera, etc.).
  • Another way to interact with computing devices is through audio devices that capture and respond to human speech.
  • FIG. 1 is a block diagram of an illustrative voice interaction computing architecture that includes a local audio device and a remote speech processing service.
  • FIG. 2-4 are flow diagrams illustrating example processes for detecting command expressions that may be performed by a local audio device in conjunction with a remote speech processing service.
  • This disclosure pertains generally to a speech interface system that provides or facilitates speech-based interactions with a user.
  • the system includes a local device having a microphone that captures audio containing user speech.
  • Spoken user commands may be prefaced by a keyword, referred to as a trigger expression or wake expression. Audio following a trigger expression may be streamed to a remote service for speech recognition and the service may respond by performing a function or providing a command to be performed by the audio device.
  • certain command expressions are detected by or at the local device rather than by the remote service.
  • the local device is configured to detect a trigger or alert expression, which indicates that subsequent speech is intended by the user to form a command.
  • the local device initiates a communication session with the remote service and begins streaming received audio to the service.
  • the remote service performs speech recognition on the received audio and attempts to identify user intent based on the recognized speech.
  • the remote service may perform a corresponding function.
  • the function may performed in conjunction with the local device. For example, the remote service may send a command to the local device indicating that the local device should execute the command to perform a corresponding function.
  • the local device monitors or analyzes the audio to detect an occurrence of a local command expression following the trigger expression. Upon detecting a local command expression in the audio, the local device immediately implements a corresponding function. In addition, further actions by the remote service are stopped or cancelled to avoid duplicate actions with respect to a single user utterance. Actions by the remote service may be stopped by explicitly notifying the remote service that the utterance has been acted upon locally, by terminating or cancelling a communications session, and/or by foregoing execution of any commands that are specified by the remote service in response to remote recognition of user speech.
  • FIG. 1 shows an example of a voice interaction system 100.
  • the system 100 may include or may utilize a local voice-based audio device 102, which may be located within an environment 104 such as a home, and which may be used for interacting with a user 106.
  • the voice interaction system 100 may also include or utilize a remote, network-based speech command service 108 that is configured to receive audio, to recognize speech in the audio, and to perform a function, referred to herein as a service-identified function, in response to the recognized speech.
  • the service-identified function may be implemented by the speech command service 108 independently of the audio device, and/or may be implemented by providing a command to the audio device 102 for local execution.
  • the primary mode of user interaction with the audio device 102 may be through speech.
  • the audio device 102 may receive spoken command expressions from the user 106 and may provide services in response to the commands.
  • the user may speak a predefined wake or trigger expression (e.g., "Awake"), which may be followed by commands or instructions (e.g., "I'd like to go to a movie. Please tell me what's playing at the local cinema.”).
  • Provided services may include performing actions or activities, rendering media, obtaining and/or providing information, providing information via generated or synthesized speech via the audio device 102, initiating Internet-based services on behalf of the user 106, and so forth.
  • the local audio device 102 and the speech command service 108 are configured to act in conjunction with each other to receive and respond to command expressions from the user 106.
  • the command expressions may include local command expressions that are detected and acted upon by the local device 102 independently of the speech command service 108.
  • the command expressions may also include commands that are interpreted and acted upon by or in conjunction with the remote speech command service 108.
  • the audio device 102 may have one or more microphones 1 10 and one or more audio speakers or transducers 1 12 to facilitate audio interactions with the user 106.
  • the microphone 1 10 produces a microphone signal, also referred to as an input audio signal, representing audio from the environment 104, including sounds or expressions uttered by the user 106.
  • the microphone 1 10 may comprise a microphone array that is used in conjunction with audio beamforming techniques to produce an input audio signal that is focused in a selectable direction. Similarly, a plurality of directional microphones 1 10 may be used to produce an audio signal corresponding to one of multiple available directions.
  • the audio device 102 includes operational logic, which in many cases may comprise a processor 1 14 and memory 1 16.
  • the processor 1 14 may include multiple processors and/or a processor having multiple cores.
  • the processor 1 14 may also comprise or include a digital signal processor for processing audio signals.
  • the memory 1 16 may contain applications and programs in the form of computer-executable instructions that are executed by the processor 1 14 to perform acts or actions that implement desired functionality of the audio device 102, including the functionality specifically described below.
  • the memory 1 16 may be a type of computer-readable storage media and may include volatile and nonvolatile memory. Thus, the memory 1 16 may include, but is not limited to, RAM, ROM, EEPROM, flash memory, or other memory technology.
  • the audio device 102 may include a plurality of applications, services, and/or functions 1 18, referred to collectively below as functional components 1 18, which are executable by the processor 1 14 to provide services and functionality.
  • the applications and other functional components 1 18 may include media playback services such as music players.
  • Other services or operations performed or provided by the applications and other functional components 1 18 may include, as examples, requesting and consuming entertainment (e.g., gaming, finding and playing music, movies or other content, etc.), personal management (e.g., calendaring, note taking, etc.), online shopping, financial transactions, database inquiries, person-to-person voice communications, and so forth.
  • the functional components 1 18 may be pre- installed on the audio device 102, and may implement core functionality of the audio device 102. In other embodiments, one or more of the applications or other functional components 1 18 may be installed by the user 106 or otherwise installed after the audio device 102 has been initialized by the user 106, and may implement additional or customized functionality as desired by the user 106.
  • the processor 1 14 may be configured by audio processing functionality or components 120 to process input audio signals generated by the microphone 1 10 and/or output audio signals provided to the speaker 1 12.
  • the audio processing components 120 may implement acoustic echo cancellation to reduce audio echo generated by acoustic coupling between the microphone 1 10 and the speaker 1 12.
  • the audio processing components 120 may also implement noise reduction to reduce noise in received audio signals, such as elements of input audio signals other than user speech.
  • the audio processing components 120 may include one or more audio beamformers that are responsive to multiple microphones 1 10 to generate an audio signal that is focused in a direction from which user speech has been detected.
  • the audio device 102 may also be configured to implement one or more expression detectors or speech recognition components 122, which may be used to detect a trigger expression in speech captured by the microphone 1 10.
  • the term "trigger expression” is used herein to indicate a word, phrase, or other utterance that is used to signal the audio device 102 that subsequent user speech is intended by the user to be interpreted as a command.
  • the one or more speech recognition components 122 may also be used to detect commands or command expressions in the speech captured by the microphone 1 10.
  • command expression is used herein to indicate a word, phrase, or other utterance that corresponds to or is associated with a function that is to be performed by the audio device 102 or by a service or other device that is accessible to the audio device 102, such as the speech command service 108.
  • the words “stop”, “pause”, “hang-up” may be used as command expressions.
  • the "stop” and “pause” command expressions may indicate that media playback activities should be interrupted.
  • the "hang-up” command expression may indicate that a current person-to- person communication should be terminated.
  • Other command expressions, corresponding to different functions may also be used.
  • Command expressions may comprise conversation-style directives, such as "Find a nearby Italian restaurant.”
  • Command expressions may include local command expressions that are to be interpreted by the audio device 102 without relying on the speech command service 108.
  • local command expressions are relatively short expressions such as single words or short phrases, which can be easily detected by the audio device 102.
  • Local command expressions may correspond to device functions for which relatively low response latencies are desired, such as media control or media playback control functions.
  • the services of the speech command service 108 may be utilized for other command expressions for which greater response latencies are acceptable.
  • Command expressions that are to be acted upon by the speech command service will be referred to herein as remote command expressions.
  • the speech recognition components 122 may be implemented using automated speech recognition (ASR) techniques.
  • ASR automated speech recognition
  • large vocabulary speech recognition techniques may be used for keyword detection, and the output of the speech recognition may be monitored for occurrences of the keyword.
  • the speech recognition may use hidden Markov models and Gaussian mixture models to recognize voice input and to provide a continuous word stream corresponding to the voice input. The word stream may then be monitored to detect one or more specified words or expressions.
  • the speech recognition components 122 may be implemented by one or more keyword spotters.
  • a keyword spotter is a functional component or algorithm that evaluates an audio signal to detect the presence of one or more predefined words or expressions in the audio signal.
  • a keyword spotter uses simplified ASR techniques to detect a specific word or a limited number of words rather than attempting to recognize a large vocabulary.
  • a keyword spotter may provide a notification when a specified word is detected in a voice signal, rather than providing a textual or word-based output.
  • a keyword spotter using these techniques may compare different words based on hidden Markov models (HMMs), which represent words as series of states.
  • HMMs hidden Markov models
  • an utterance is analyzed by comparing its model to a keyword model and to a background model. Comparing the model of the utterance with the keyword model yields a score that represents the likelihood that the utterance corresponds to the keyword. Comparing the model of the utterance with the background model yields a score that represents the likelihood that the utterance corresponds to a generic word other than the keyword. The two scores can be compared to determine whether the keyword was uttered.
  • the audio device 102 may further comprise control functionally 124, referred to herein as a controller or control logic, that is configured to interact with the other components of the audio device 102 in order to implement the logical functionality of the audio device 102.
  • control functionally 124 referred to herein as a controller or control logic
  • control logic 124 may comprise executable instructions, programs, and/or or program modules that are stored in the memory 1 16 and executed by the processor 1 14.
  • the speech command service 108 may in some instances be part of a network-accessible computing platform that is maintained and accessible via a network 126 such as the Internet.
  • Network-accessible computing platforms such as this may be referred to using terms such as “on-demand computing”, “software as a service (SaaS)", “platform computing”, “network-accessible platform”, “cloud services”, “data centers”, and so forth.
  • the audio device 102 and/or the speech command service 108 may communicatively couple to the network 126 via wired technologies (e.g., wires, universal serial bus (USB), fiber optic cable, etc.), wireless technologies (e.g., radio frequencies (RF), cellular, mobile telephone networks, satellite, Bluetooth, etc.), or other connection technologies.
  • the network 126 is representative of any type of communication network, including data and/or voice network, and may be implemented using wired infrastructure (e.g., coaxial cable, fiber optic cable, etc.), a wireless infrastructure (e.g., RF, cellular, microwave, satellite, Bluetooth®, etc.), and/or other connection technologies.
  • the audio device 102 is described herein as a voice- controlled or speech-based interface device, the techniques described herein may be implemented in conjunction with various different types of devices, such as telecommunications devices and components, hands-free devices, entertainment devices, media playback devices, and so forth.
  • the speech command service 108 generally provides functionality for receiving an audio stream from the audio device 102, recognizing speech in the audio stream, determining user intent from the recognized speech, and performing an action or service in response to the user intent.
  • the provided action may in some cases be performed in conjunction with the audio device 102 and in these cases the speech command service 108 may return a response to the audio device 102 indicating a command that is to be executed by the audio device 102.
  • the speech command service 108 includes operational logic, which in many cases may comprise one or more servers, computers, and or processors 128.
  • the speech command service 108 may also have memory 130 containing applications and programs in the form of instructions that are executed by the processor 128 to perform acts or actions that implement desired functionality of the speech command service, including the functionality specifically described herein.
  • the memory 130 may be a type of computer storage media and may include volatile and nonvolatile memory. Thus, the memory 130 may include, but is not limited to, RAM, ROM, EEPROM, flash memory, or other memory technology.
  • the speech command service 108 may comprise speech recognition components 132.
  • the speech recognition components 132 may include automatic speech recognition (ASR) functionality that recognizes human speech in an audio signal.
  • ASR automatic speech recognition
  • the speech command service 108 may also comprise a natural language understanding component (NLU) 134 that determines user intent based on recognized speech.
  • NLU natural language understanding component
  • the speech command service 108 may also comprise a command interpreter and action dispatcher 136 (referred to below simply as a command interpreter 136) that determines functions or commands corresponding to user intents.
  • commands may correspond to functions that are to be performed at least in part by the audio device 102, and the command interpreter 136 may in those cases provide responses to the audio device 102 indicating commands for implementing such functions.
  • Examples of commands or functions that may be performed by the audio device in response to directives from the command interpreter 136 may include playing music or other media, increasing/decreasing the volume of the speaker 1 12, generating audible speech through the speaker 1 12, initiating certain types of communications with users of similar devices, and so forth.
  • the speech command service 108 may also perform functions, in response to speech recognized from received audio, that involve entities or devices that are not shown in FIG. 1.
  • the speech command service 108 may interact with other network-based services to obtain information or services on behalf of the user 106.
  • the speech command service 108 may itself have various elements and functionality that may be responsive to speech uttered by the user 106.
  • the microphone 1 10 of the audio device 102 captures or receives audio containing speech of the user 106.
  • the audio is processed by the audio processing components 120 and the processed audio is received by the speech recognition components 122.
  • the speech recognition components 122 analyze the audio to detect occurrences of a trigger expression in the speech contained in the audio.
  • the controller 124 Upon detection of the trigger expression, the controller 124 begins sending or streaming received audio to the speech command service 108 along with a request for the speech command service 108 to recognize and interpret the user speech, and to initiate a function corresponding to any interpreted intent.
  • the speech recognition components 122 Concurrently with sending the audio to the speech command service 108, the speech recognition components 122 continue to analyze the received audio to detect an occurrence of a local command expression in the user speech.
  • the controller 124 Upon detection of a local command expression, the controller 124 initiates or performs a device function that corresponds to the local command expression. For example, in response to the local command expression "stop", the controller 124 may initiate a function that stops media playback.
  • the controller 124 may interact with one or more of the functional components 1 18 when initiating or performing the function.
  • the speech command service 108 in response to receiving the audio, concurrently analyzes the audio to recognize speech, to determine a user intent, and to determine a service-identified function that is to be implemented in response to the user intent.
  • the audio device 102 may take actions to cancel, nullify, or invalidate any service-identified functions that may eventually be initiated by the speech command service 108.
  • the audio device 102 may cancel its previous request by sending a cancellation message to the speech command service 108 and/or by stopping the streaming of the audio to the speech command service 108.
  • the audio device may ignore or discard any responses or service-specified commands that are received from the speech command service 108 in response to the earlier request.
  • the audio device may inform the speech command service 108 of actions that have been performed locally in response to the local command expression, and the speech command service 108 may modify its subsequent behavior based on this information. For example, the speech command service 108 may forego actions that it might otherwise have performed in response to recognized speech in the received audio.
  • FIG. 2 illustrates an example method 200 that may be performed by the audio device 102 in conjunction with the speech command service 108 in order to recognize and respond to user speech.
  • the method 200 will be described in the context of the system 100 of FIG. 1, although the method 200 may also be performed in other environments and may be implemented in different ways.
  • Actions on the left side of FIG. 2 are performed at or by the local audio device 102. Actions on the right side of FIG. 2 are performed at or by the remote speech command service 108.
  • An action 202 comprises receiving an audio signal that has been captured by or in conjunction with the microphone 1 10.
  • the audio signal contains or represents audio from the environment 104, and may contain user speech.
  • the audio signal may be an analog electrical signal or may comprise a digital signal such as a digital audio stream.
  • An action 204 comprises detecting an occurrence of a trigger expression in the received audio and/or in the user speech. This may be performed by the speech recognition components 122 as described above, which may in some embodiments comprise keyword spotters. If the trigger expression is not detected, the action 204 is repeated in order to continuously monitor for occurrences of the trigger expression. The remaining actions shown in FIG. 2 are performed in response to detecting the trigger expression. [0043] If the trigger expression is detected in the action 204, an action 206 is performed, comprising sending subsequently received audio to the speech command service 108 along with a service request 208 for the speech command service 108 to recognize speech in the audio and to implement a function corresponding to the recognized speech. Functions initiated by the speech command service 108 in this manner are referred to herein as service- identified functions, and may in certain cases be performed in conjunction with the audio device 102. For example, a function may be initiated by sending a command to the audio device 102.
  • the sending 206 may comprise streaming or otherwise transmitting a digital audio stream 210 to the speech command service 108, representing or containing audio that is received from the microphone 1 10 subsequent to detection of the trigger expression.
  • the action 206 may comprise opening or initiating a communication session between the audio device 102 and the speech command service 108.
  • the request 208 may be used to establish a communication session with the speech command service 108 for the purpose of recognizing speech, understanding intent, and determining actions or functions to be performed in response to user speech.
  • the request 208 may be followed or accompanied by the streamed audio 210.
  • the audio stream 210 provided to the speech command service 108 may include portions of received audio beginning at a time just prior to utterance of the trigger expression.
  • the communication session may be associated with a communication or session identifier (ID) that identifies the communication session established between the audio device 102 and the speech command service 108.
  • ID may be used or included in future communications relating to a particular user utterance or audio stream.
  • the session ID may be generated by the audio device 102 and provided in the request 208 to the speech command service 108.
  • the session ID may be generated by the speech command service 108 and provided by the speech command service 108 in acknowledgment of the request 208.
  • the term "request(ID)" is used herein to indicate a request having a particular session ID.
  • a response from the speech command service 108 relating to the same session, request, or audio stream may be indicated by the term "response(ID)".
  • each communication session and corresponding session ID may correspond to a single user utterance.
  • the audio device 102 may establish a session upon detecting the trigger expression. The audio device 102 may then continue to stream audio to the speech command service 108 as part of the same session until the end of the user utterance.
  • the speech command service 108 may provide responses to the audio device 102 through the session, using the same session ID. Responses may in some cases indicate commands to be executed by the audio device 102 in response to speech recognized by the speech command service 108 in the received audio 210.
  • the communication session may remain open until the audio device 102 receives a response from the speech command service 108 or until the audio device 102 cancels the request.
  • the speech command service 108 receives the request 208 and audio stream 210 in an action 212.
  • the speech command service 108 performs an action 214 of recognizing speech in the received audio and determining a user intent as expressed by the recognized speech, using the speech recognition and natural language understanding components 132 and 134 of the speech command service 108.
  • An action 214, performed by the command interpreter 136 comprises identifying and initiating a service- identified function in fulfillment of the determined user intent.
  • the service- identified function may in some cases be performed by the speech command service 108, independently of the audio device 102. In other cases, the speech command service 108 may identify a function that is to be performed by the audio device 102, and may send a corresponding command to the audio device 102 for execution by the audio device 102.
  • the local audio device 102 Concurrently with the actions being performed by the speech command service 108, the local audio device 102 performs further actions to determine whether the user has uttered a local command expression and to perform a corresponding local function in response to any such uttered local command expression.
  • an action 218, performed in response to detecting the trigger expression in the action 204 comprises analyzing audio received in the action 202 to detect an occurrence of a local command expression that follows or immediately follows the trigger expression in the received user speech. This may be performed by the speech recognition components 122 of the audio device 102 as described above, which may in some embodiments comprise keyword spotters.
  • an action 220 is performed of immediately initiating a device function that has been associated with the local command expression.
  • the local command expression "stop" might be associated with a function that stops media playback.
  • the audio device 102 performs an action 222 of stopping or cancelling the request 208 to the speech command service 108. This may include cancelling or nullifying implementation of the service-identified function that may have otherwise been implemented by the speech command service 108 in response to the received request 208 and accompanying audio 210.
  • the action 222 may comprise sending an explicit notification or command to the speech command service 108, requesting that the speech command service 108 cancel any further recognition activities with respect to the service request 208, and/or to cancel implementation of any service-identified functions that may otherwise have been initiated in response to recognized speech.
  • the audio device 102 may simply notify the speech command service 108 regarding any functions that have been performed locally in response to local recognition of the local command expression, and the speech command service 108 may respond by cancelling the service request 208 or by performing other actions as may be appropriate.
  • the speech command service 108 may implement the service-identified function by identifying a command to be executed by the audio device 102. In response to receiving a notification that the service request 208 is to be cancelled, the speech command service 108 may forego sending the command to the audio device 102. Alternatively, the speech command service may be allowed to complete its processing and to send a command to the audio device 102, whereupon the audio device 102 may ignore the command or forego execution of the command.
  • the speech command service may be configured to notify the audio device 102 before initiating a service-identified function, and may delay implementation of the service-identified function until receiving permission from the audio device 102.
  • the audio device 102 may be configured to deny such permission when the local command expression has been recognized locally.
  • the actions of the speech command service 108 shown in FIG. 2 are performed in parallel and asynchronously with the actions 218, 220, and 222 of the audio device 102. It is assumed in some implementations that the audio device 102 is able to detect and act upon the local command expression relatively quickly, so that it may perform the action 222 of cancelling the request 208 and subsequent processing by the speech command service 108 before the service-identified function of the action 216 has been implemented or executed.
  • FIG. 3 shows illustrates an example method 300 in which the speech command service 108 returns commands to the audio device 102, and in which the audio device 102 is configured to ignore the commands or forego execution of the commands in situations in which a local command expression has already been detected and acted upon by the audio device 102.
  • Initial actions are similar or identical to those described above. Actions performed by the audio device 102 are shown on the left and actions performed by the speech command service 108 are shown on the right.
  • An action 302 comprises receiving an audio signal containing user speech.
  • An action 304 comprises analyzing the audio signal to detect a trigger expression in the user speech. Subsequent actions shown in FIG. 3 are performed in response to detecting the trigger expression.
  • An action 306 comprises sending a request 308 and audio 310 to the speech command service 108.
  • An action 312 comprises receiving the request 308 and the audio 310 at the speech command service 108.
  • An action 314 comprises recognizing user speech and determining user intent based on the recognized user speech.
  • the speech command service 108 performs an action 316 of sending a command 318 to the audio device 102 for execution by the audio device 102 in order to implement a service-identified function corresponding to the recognized user intent.
  • the command may comprise a "stop" command, indicating that the audio device 102 is to stop playback of music.
  • An action 320 performed by the audio device 102, comprises receiving and executing the command.
  • the action 320 is shown in a dashed box to indicate that it is performed conditionally, based on whether a local command expression has been detected and acted upon by the audio device 102. Specifically, the action 320 is not performed if a local command expression has been detected by the audio device 102.
  • the audio device 102 Concurrently with the actions performed by the speech command service 108, the audio device 102 performs an action 322 of analyzing received audio to detect an occurrence of a local command expression that follows or immediately follows the trigger expression in the received user speech. In response to detecting the local command expression, an action 324 is performed of immediately initiating a local device function that has been associated with the local command expression.
  • the audio device 102 performs an action 326 of foregoing execution of the received command 318. More specifically, any commands received from the speech command service 108 in response to the request 308 are discarded or ignored. Responses and commands corresponding to the request 308 may be identified by session IDs associated with the responses.
  • the audio device performs the action 320 of executing the command 318 received from the speech command service 108.
  • FIG. 4 shows an example method 400 in which the audio device 102 is configured to actively cancel requests to the speech command service 108 after locally detecting a local command expression.
  • Initial actions are similar or identical to those described above. Actions performed by the audio device 102 are shown on the left and actions performed by the speech command service 108 are shown on the right.
  • An action 402 comprises receiving an audio signal containing user speech.
  • An action 404 comprises analyzing the audio signal to detect a trigger expression in the user speech. Subsequent actions shown in FIG. 4 are performed in response to detecting the trigger expression.
  • An action 406 comprises sending a request 408 and audio 410 to the speech command service 108.
  • An action 412 comprises receiving the request 408 and the audio 410 at the speech command service 108.
  • An action 414 comprises recognizing user speech and determining user intent based on the recognized user speech.
  • An action 416 comprises determining whether the request 408 has been cancelled by the audio device 102.
  • the audio device 102 may send a cancellation message or may terminate the current communication session in order to cancel the request. If the request has been canceled by the audio device 102, no further action is taken by the speech command service. If the request has not been canceled, an action 418 is performed, which comprises sending a command 420 to the audio device 102 for execution by the audio device 102 in order to implement a service-identified function corresponding to the recognized user intent.
  • An action 422, performed by the audio device 102 comprises receiving and executing the command.
  • the action 422 is shown in a dashed box to indicate that it is performed conditionally, depending on whether a command has been sent and received from the speech command service 108, which in turn depends on whether the audio device 102 has cancelled the request 408.
  • the audio device 102 Concurrently with the actions performed by the speech command service 108, the audio device 102 performs an action 424 of analyzing received audio to detect an occurrence of a local command expression that follows or immediately follows the trigger expression in the received user speech. In response to detecting the local command expression, an action 426 is performed of immediately initiating a local device function that has been associated with the local command expression.
  • the audio device 102 performs an action 428 of requesting the speech command service 108 to cancel the request 408 and/or to cancel implementation of any service-identified functions that may have otherwise been performed in response to recognized speech in the audio received by the speech command service 108 from the audio device 102.
  • This may comprise communicating with the speech command service 108, such as by sending a cancellation notification or request.
  • the cancellation may comprise replying to a communication or notification from the speech command service 108 of a pending implementation of a service-identified function by the speech command service.
  • the audio device 102 may reply and may request cancellation of the pending implementation.
  • the audio device 102 may cancel the implementation of any function that might have otherwise been performed in response to detecting the local command expression, and may instruct the speech command service 108 to proceed with implementation of the pending function.
  • the local command expression is not detected in the action 424, the audio device 102 performs the action 422 of executing the command 420 received from the speech command service 108.
  • the action 422 may occur asynchronously, upon receiving the command 420 from the speech command service.
  • inventions described above may be implemented programmatically, such as with computers, processors, digital signal processors, analog processors, and so forth. In other embodiments, however, one or more of the components, functions, or elements may be implemented using specialized or dedicated circuits, including analog circuits and/or digital logic circuits.
  • component as used herein, is intended to include any hardware, software, logic, or combinations of the foregoing that are used to implement the functionality attributed to the component.
  • One or more non- transitory computer-readable media storing computer-executable instructions that, when executed, cause one or more processors to perform acts comprising:
  • a method comprising:
  • cancelling implementation of the first function comprises requesting the speech command service to cancel implementation of the first function.
  • cancelling implementation of the first function comprises requesting the speech command service to cancel the pending implementation of the first function.
  • cancelling implementation of the first function comprises terminating the communication session.
  • cancelling implementation of the first function comprises forgoing execution of the command.
  • a system comprising:
  • one or more speech recognition components configured to recognize user speech in received audio, to detect a trigger expression in the user speech, and to detect a local command expression in the user speech;
  • control logic configured to perform acts in response to detection by the one or more speech recognition components of the trigger expression in the user speech, the acts comprising: sending the audio to a speech command service to recognize speech in the audio and to implement a first function corresponding to the recognized speech;
  • cancelling implementation of the at least one of the first and second functions comprises requesting the speech command service to cancel implementation of the first function.
  • cancelling implementation of the at least one of the first and second functions comprises ignoring a command received from the speech command service.
  • cancelling implementation of the at least one of the first and second functions comprises informing the speech command service that the second function has been initiated.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephonic Communication Services (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

A user device may be configured to detect a user-uttered trigger expression and to respond by interpreting subsequent words or phrases as commands. The commands may be recognized by sending audio containing the words or phrases to a remote service that is configured to perform speech recognition. Certain commands may be designated as local commands and may be detected locally rather than relying on the remote service. Upon detection of the trigger expression, audio is streamed to the remote service and also analyzed locally to detect utterances of local commands. Upon detecting a local command, a corresponding function is immediately initiated, and subsequent activities or responses by the remote service are canceled or ignored.

Description

LOCAL AND REMOTE SPEECH PROCESSING
RELATED APPLICATIONS
[0001] The present application claims priority to US Patent Application No. 14/033,302 filed on September 20, 2013, entitled "Local and Remote Speech Processing", which is incorporated by reference herein in its entirety.
BACKGROUND
[0002] Homes, offices, automobiles, and public spaces are becoming more wired and connected with the proliferation of computing devices such as notebook computers, tablets, entertainment systems, and portable communication devices. As computing devices evolve, the ways in which users interact with these devices continue to evolve. For example, people can interact with computing devices through mechanical devices (e.g., keyboards, mice, etc.), electrical devices (e.g., touch screens, touch pads, etc.), and optical devices (e.g., motion detectors, camera, etc.). Another way to interact with computing devices is through audio devices that capture and respond to human speech.
BRIEF DESCRIPTION OF THE DRAWINGS
[0003] The detailed description is described with reference to the accompanying figures. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. The use of the same reference numbers in different figures indicates similar or identical components or features.
[0004] FIG. 1 is a block diagram of an illustrative voice interaction computing architecture that includes a local audio device and a remote speech processing service.
[0005] FIG. 2-4 are flow diagrams illustrating example processes for detecting command expressions that may be performed by a local audio device in conjunction with a remote speech processing service.
DETAILED DESCRIPTION
[0006] This disclosure pertains generally to a speech interface system that provides or facilitates speech-based interactions with a user. The system includes a local device having a microphone that captures audio containing user speech. Spoken user commands may be prefaced by a keyword, referred to as a trigger expression or wake expression. Audio following a trigger expression may be streamed to a remote service for speech recognition and the service may respond by performing a function or providing a command to be performed by the audio device.
[0007] Communications with the remote service may introduce response latency, which in most cases can be minimized within acceptable limits. Some spoken commands, however, may call for less latency. As an example, spoken commands related to certain types of media rendering, such as "stop", "pause", "hang up", and so forth may need to be performed with less perceptible amounts of latency.
[0008] In accordance with various embodiments, certain command expressions, referred to herein as local commands or local command expressions, are detected by or at the local device rather than by the remote service. More specifically, the local device is configured to detect a trigger or alert expression, which indicates that subsequent speech is intended by the user to form a command. Upon detecting the trigger expression, the local device initiates a communication session with the remote service and begins streaming received audio to the service. In response, the remote service performs speech recognition on the received audio and attempts to identify user intent based on the recognized speech. In response to a recognized user intent, the remote service may perform a corresponding function. In some cases, the function may performed in conjunction with the local device. For example, the remote service may send a command to the local device indicating that the local device should execute the command to perform a corresponding function.
[0009] Concurrently with the activities of the remote service, the local device monitors or analyzes the audio to detect an occurrence of a local command expression following the trigger expression. Upon detecting a local command expression in the audio, the local device immediately implements a corresponding function. In addition, further actions by the remote service are stopped or cancelled to avoid duplicate actions with respect to a single user utterance. Actions by the remote service may be stopped by explicitly notifying the remote service that the utterance has been acted upon locally, by terminating or cancelling a communications session, and/or by foregoing execution of any commands that are specified by the remote service in response to remote recognition of user speech.
[0010] FIG. 1 shows an example of a voice interaction system 100. The system 100 may include or may utilize a local voice-based audio device 102, which may be located within an environment 104 such as a home, and which may be used for interacting with a user 106. The voice interaction system 100 may also include or utilize a remote, network-based speech command service 108 that is configured to receive audio, to recognize speech in the audio, and to perform a function, referred to herein as a service-identified function, in response to the recognized speech. The service-identified function may be implemented by the speech command service 108 independently of the audio device, and/or may be implemented by providing a command to the audio device 102 for local execution.
[0011] In certain embodiments, the primary mode of user interaction with the audio device 102 may be through speech. For example, the audio device 102 may receive spoken command expressions from the user 106 and may provide services in response to the commands. The user may speak a predefined wake or trigger expression (e.g., "Awake"), which may be followed by commands or instructions (e.g., "I'd like to go to a movie. Please tell me what's playing at the local cinema."). Provided services may include performing actions or activities, rendering media, obtaining and/or providing information, providing information via generated or synthesized speech via the audio device 102, initiating Internet-based services on behalf of the user 106, and so forth.
[0012] The local audio device 102 and the speech command service 108 are configured to act in conjunction with each other to receive and respond to command expressions from the user 106. The command expressions may include local command expressions that are detected and acted upon by the local device 102 independently of the speech command service 108. The command expressions may also include commands that are interpreted and acted upon by or in conjunction with the remote speech command service 108.
[0013] The audio device 102 may have one or more microphones 1 10 and one or more audio speakers or transducers 1 12 to facilitate audio interactions with the user 106. The microphone 1 10 produces a microphone signal, also referred to as an input audio signal, representing audio from the environment 104, including sounds or expressions uttered by the user 106.
[0014] In some cases, the microphone 1 10 may comprise a microphone array that is used in conjunction with audio beamforming techniques to produce an input audio signal that is focused in a selectable direction. Similarly, a plurality of directional microphones 1 10 may be used to produce an audio signal corresponding to one of multiple available directions.
[0015] The audio device 102 includes operational logic, which in many cases may comprise a processor 1 14 and memory 1 16. The processor 1 14 may include multiple processors and/or a processor having multiple cores. The processor 1 14 may also comprise or include a digital signal processor for processing audio signals.
[0016] The memory 1 16 may contain applications and programs in the form of computer-executable instructions that are executed by the processor 1 14 to perform acts or actions that implement desired functionality of the audio device 102, including the functionality specifically described below. The memory 1 16 may be a type of computer-readable storage media and may include volatile and nonvolatile memory. Thus, the memory 1 16 may include, but is not limited to, RAM, ROM, EEPROM, flash memory, or other memory technology.
[0017] The audio device 102 may include a plurality of applications, services, and/or functions 1 18, referred to collectively below as functional components 1 18, which are executable by the processor 1 14 to provide services and functionality. The applications and other functional components 1 18 may include media playback services such as music players. Other services or operations performed or provided by the applications and other functional components 1 18 may include, as examples, requesting and consuming entertainment (e.g., gaming, finding and playing music, movies or other content, etc.), personal management (e.g., calendaring, note taking, etc.), online shopping, financial transactions, database inquiries, person-to-person voice communications, and so forth.
[0018] In some embodiments, the functional components 1 18 may be pre- installed on the audio device 102, and may implement core functionality of the audio device 102. In other embodiments, one or more of the applications or other functional components 1 18 may be installed by the user 106 or otherwise installed after the audio device 102 has been initialized by the user 106, and may implement additional or customized functionality as desired by the user 106.
[0019] The processor 1 14 may be configured by audio processing functionality or components 120 to process input audio signals generated by the microphone 1 10 and/or output audio signals provided to the speaker 1 12. As an example, the audio processing components 120 may implement acoustic echo cancellation to reduce audio echo generated by acoustic coupling between the microphone 1 10 and the speaker 1 12. The audio processing components 120 may also implement noise reduction to reduce noise in received audio signals, such as elements of input audio signals other than user speech. In certain embodiments, the audio processing components 120 may include one or more audio beamformers that are responsive to multiple microphones 1 10 to generate an audio signal that is focused in a direction from which user speech has been detected.
[0020] The audio device 102 may also be configured to implement one or more expression detectors or speech recognition components 122, which may be used to detect a trigger expression in speech captured by the microphone 1 10. The term "trigger expression" is used herein to indicate a word, phrase, or other utterance that is used to signal the audio device 102 that subsequent user speech is intended by the user to be interpreted as a command. [0021] The one or more speech recognition components 122 may also be used to detect commands or command expressions in the speech captured by the microphone 1 10. The term "command expression" is used herein to indicate a word, phrase, or other utterance that corresponds to or is associated with a function that is to be performed by the audio device 102 or by a service or other device that is accessible to the audio device 102, such as the speech command service 108. For example, the words "stop", "pause", "hang-up" may be used as command expressions. The "stop" and "pause" command expressions may indicate that media playback activities should be interrupted. The "hang-up" command expression may indicate that a current person-to- person communication should be terminated. Other command expressions, corresponding to different functions, may also be used. Command expressions may comprise conversation-style directives, such as "Find a nearby Italian restaurant."
[0022] Command expressions may include local command expressions that are to be interpreted by the audio device 102 without relying on the speech command service 108. Generally, local command expressions are relatively short expressions such as single words or short phrases, which can be easily detected by the audio device 102. Local command expressions may correspond to device functions for which relatively low response latencies are desired, such as media control or media playback control functions. The services of the speech command service 108 may be utilized for other command expressions for which greater response latencies are acceptable. Command expressions that are to be acted upon by the speech command service will be referred to herein as remote command expressions.
[0023] In some cases, the speech recognition components 122 may be implemented using automated speech recognition (ASR) techniques. For example, large vocabulary speech recognition techniques may be used for keyword detection, and the output of the speech recognition may be monitored for occurrences of the keyword. As an example, the speech recognition may use hidden Markov models and Gaussian mixture models to recognize voice input and to provide a continuous word stream corresponding to the voice input. The word stream may then be monitored to detect one or more specified words or expressions.
[0024] Alternatively, the speech recognition components 122 may be implemented by one or more keyword spotters. A keyword spotter is a functional component or algorithm that evaluates an audio signal to detect the presence of one or more predefined words or expressions in the audio signal. Generally, a keyword spotter uses simplified ASR techniques to detect a specific word or a limited number of words rather than attempting to recognize a large vocabulary. For example, a keyword spotter may provide a notification when a specified word is detected in a voice signal, rather than providing a textual or word-based output. A keyword spotter using these techniques may compare different words based on hidden Markov models (HMMs), which represent words as series of states. Generally, an utterance is analyzed by comparing its model to a keyword model and to a background model. Comparing the model of the utterance with the keyword model yields a score that represents the likelihood that the utterance corresponds to the keyword. Comparing the model of the utterance with the background model yields a score that represents the likelihood that the utterance corresponds to a generic word other than the keyword. The two scores can be compared to determine whether the keyword was uttered.
[0025] The audio device 102 may further comprise control functionally 124, referred to herein as a controller or control logic, that is configured to interact with the other components of the audio device 102 in order to implement the logical functionality of the audio device 102.
[0026] The control logic 124, the audio processing components 120, the speech recognition components 122, and the functional components 1 18 may comprise executable instructions, programs, and/or or program modules that are stored in the memory 1 16 and executed by the processor 1 14.
[0027] The speech command service 108 may in some instances be part of a network-accessible computing platform that is maintained and accessible via a network 126 such as the Internet. Network-accessible computing platforms such as this may be referred to using terms such as "on-demand computing", "software as a service (SaaS)", "platform computing", "network-accessible platform", "cloud services", "data centers", and so forth.
[0028] The audio device 102 and/or the speech command service 108 may communicatively couple to the network 126 via wired technologies (e.g., wires, universal serial bus (USB), fiber optic cable, etc.), wireless technologies (e.g., radio frequencies (RF), cellular, mobile telephone networks, satellite, Bluetooth, etc.), or other connection technologies. The network 126 is representative of any type of communication network, including data and/or voice network, and may be implemented using wired infrastructure (e.g., coaxial cable, fiber optic cable, etc.), a wireless infrastructure (e.g., RF, cellular, microwave, satellite, Bluetooth®, etc.), and/or other connection technologies.
[0029] Although the audio device 102 is described herein as a voice- controlled or speech-based interface device, the techniques described herein may be implemented in conjunction with various different types of devices, such as telecommunications devices and components, hands-free devices, entertainment devices, media playback devices, and so forth.
[0030] The speech command service 108 generally provides functionality for receiving an audio stream from the audio device 102, recognizing speech in the audio stream, determining user intent from the recognized speech, and performing an action or service in response to the user intent. The provided action may in some cases be performed in conjunction with the audio device 102 and in these cases the speech command service 108 may return a response to the audio device 102 indicating a command that is to be executed by the audio device 102.
[0031] The speech command service 108 includes operational logic, which in many cases may comprise one or more servers, computers, and or processors 128. The speech command service 108 may also have memory 130 containing applications and programs in the form of instructions that are executed by the processor 128 to perform acts or actions that implement desired functionality of the speech command service, including the functionality specifically described herein. The memory 130 may be a type of computer storage media and may include volatile and nonvolatile memory. Thus, the memory 130 may include, but is not limited to, RAM, ROM, EEPROM, flash memory, or other memory technology.
[0032] Among other logical and physical components not specifically shown, the speech command service 108 may comprise speech recognition components 132. The speech recognition components 132 may include automatic speech recognition (ASR) functionality that recognizes human speech in an audio signal.
[0033] The speech command service 108 may also comprise a natural language understanding component (NLU) 134 that determines user intent based on recognized speech.
[0034] The speech command service 108 may also comprise a command interpreter and action dispatcher 136 (referred to below simply as a command interpreter 136) that determines functions or commands corresponding to user intents. In some cases, commands may correspond to functions that are to be performed at least in part by the audio device 102, and the command interpreter 136 may in those cases provide responses to the audio device 102 indicating commands for implementing such functions. Examples of commands or functions that may be performed by the audio device in response to directives from the command interpreter 136 may include playing music or other media, increasing/decreasing the volume of the speaker 1 12, generating audible speech through the speaker 1 12, initiating certain types of communications with users of similar devices, and so forth.
[0035] Note that the speech command service 108 may also perform functions, in response to speech recognized from received audio, that involve entities or devices that are not shown in FIG. 1. For example, the speech command service 108 may interact with other network-based services to obtain information or services on behalf of the user 106. Furthermore, the speech command service 108 may itself have various elements and functionality that may be responsive to speech uttered by the user 106.
[0036] In operation, the microphone 1 10 of the audio device 102 captures or receives audio containing speech of the user 106. The audio is processed by the audio processing components 120 and the processed audio is received by the speech recognition components 122. The speech recognition components 122 analyze the audio to detect occurrences of a trigger expression in the speech contained in the audio. Upon detection of the trigger expression, the controller 124 begins sending or streaming received audio to the speech command service 108 along with a request for the speech command service 108 to recognize and interpret the user speech, and to initiate a function corresponding to any interpreted intent.
[0037] Concurrently with sending the audio to the speech command service 108, the speech recognition components 122 continue to analyze the received audio to detect an occurrence of a local command expression in the user speech. Upon detection of a local command expression, the controller 124 initiates or performs a device function that corresponds to the local command expression. For example, in response to the local command expression "stop", the controller 124 may initiate a function that stops media playback. The controller 124 may interact with one or more of the functional components 1 18 when initiating or performing the function.
[0038] Meanwhile, the speech command service 108, in response to receiving the audio, concurrently analyzes the audio to recognize speech, to determine a user intent, and to determine a service-identified function that is to be implemented in response to the user intent. However, after locally detecting and acting upon the local command expression, the audio device 102 may take actions to cancel, nullify, or invalidate any service-identified functions that may eventually be initiated by the speech command service 108. For example, the audio device 102 may cancel its previous request by sending a cancellation message to the speech command service 108 and/or by stopping the streaming of the audio to the speech command service 108. As another example, the audio device may ignore or discard any responses or service-specified commands that are received from the speech command service 108 in response to the earlier request. In some cases, the audio device may inform the speech command service 108 of actions that have been performed locally in response to the local command expression, and the speech command service 108 may modify its subsequent behavior based on this information. For example, the speech command service 108 may forego actions that it might otherwise have performed in response to recognized speech in the received audio.
[0039] FIG. 2 illustrates an example method 200 that may be performed by the audio device 102 in conjunction with the speech command service 108 in order to recognize and respond to user speech. The method 200 will be described in the context of the system 100 of FIG. 1, although the method 200 may also be performed in other environments and may be implemented in different ways.
[0040] Actions on the left side of FIG. 2 are performed at or by the local audio device 102. Actions on the right side of FIG. 2 are performed at or by the remote speech command service 108.
[0041] An action 202 comprises receiving an audio signal that has been captured by or in conjunction with the microphone 1 10. The audio signal contains or represents audio from the environment 104, and may contain user speech. The audio signal may be an analog electrical signal or may comprise a digital signal such as a digital audio stream.
[0042] An action 204 comprises detecting an occurrence of a trigger expression in the received audio and/or in the user speech. This may be performed by the speech recognition components 122 as described above, which may in some embodiments comprise keyword spotters. If the trigger expression is not detected, the action 204 is repeated in order to continuously monitor for occurrences of the trigger expression. The remaining actions shown in FIG. 2 are performed in response to detecting the trigger expression. [0043] If the trigger expression is detected in the action 204, an action 206 is performed, comprising sending subsequently received audio to the speech command service 108 along with a service request 208 for the speech command service 108 to recognize speech in the audio and to implement a function corresponding to the recognized speech. Functions initiated by the speech command service 108 in this manner are referred to herein as service- identified functions, and may in certain cases be performed in conjunction with the audio device 102. For example, a function may be initiated by sending a command to the audio device 102.
[0044] The sending 206 may comprise streaming or otherwise transmitting a digital audio stream 210 to the speech command service 108, representing or containing audio that is received from the microphone 1 10 subsequent to detection of the trigger expression. In certain embodiments, the action 206 may comprise opening or initiating a communication session between the audio device 102 and the speech command service 108. In particular, the request 208 may be used to establish a communication session with the speech command service 108 for the purpose of recognizing speech, understanding intent, and determining actions or functions to be performed in response to user speech. The request 208 may be followed or accompanied by the streamed audio 210. In some cases, the audio stream 210 provided to the speech command service 108 may include portions of received audio beginning at a time just prior to utterance of the trigger expression. [0045] The communication session may be associated with a communication or session identifier (ID) that identifies the communication session established between the audio device 102 and the speech command service 108. The session ID may be used or included in future communications relating to a particular user utterance or audio stream. In some cases, the session ID may be generated by the audio device 102 and provided in the request 208 to the speech command service 108. Alternatively, the session ID may be generated by the speech command service 108 and provided by the speech command service 108 in acknowledgment of the request 208. The term "request(ID)" is used herein to indicate a request having a particular session ID. A response from the speech command service 108 relating to the same session, request, or audio stream may be indicated by the term "response(ID)".
[0046] In certain embodiments, each communication session and corresponding session ID may correspond to a single user utterance. For example, the audio device 102 may establish a session upon detecting the trigger expression. The audio device 102 may then continue to stream audio to the speech command service 108 as part of the same session until the end of the user utterance. The speech command service 108 may provide responses to the audio device 102 through the session, using the same session ID. Responses may in some cases indicate commands to be executed by the audio device 102 in response to speech recognized by the speech command service 108 in the received audio 210. The communication session may remain open until the audio device 102 receives a response from the speech command service 108 or until the audio device 102 cancels the request.
[0047] The speech command service 108 receives the request 208 and audio stream 210 in an action 212. In response, the speech command service 108 performs an action 214 of recognizing speech in the received audio and determining a user intent as expressed by the recognized speech, using the speech recognition and natural language understanding components 132 and 134 of the speech command service 108. An action 214, performed by the command interpreter 136, comprises identifying and initiating a service- identified function in fulfillment of the determined user intent. The service- identified function may in some cases be performed by the speech command service 108, independently of the audio device 102. In other cases, the speech command service 108 may identify a function that is to be performed by the audio device 102, and may send a corresponding command to the audio device 102 for execution by the audio device 102.
[0048] Concurrently with the actions being performed by the speech command service 108, the local audio device 102 performs further actions to determine whether the user has uttered a local command expression and to perform a corresponding local function in response to any such uttered local command expression. Specifically, an action 218, performed in response to detecting the trigger expression in the action 204, comprises analyzing audio received in the action 202 to detect an occurrence of a local command expression that follows or immediately follows the trigger expression in the received user speech. This may be performed by the speech recognition components 122 of the audio device 102 as described above, which may in some embodiments comprise keyword spotters.
[0049] In response to detecting the local command expression in the action 218, an action 220 is performed of immediately initiating a device function that has been associated with the local command expression. For example, the local command expression "stop" might be associated with a function that stops media playback.
[0050] Also in response to detecting the local command expression in the action 218, the audio device 102 performs an action 222 of stopping or cancelling the request 208 to the speech command service 108. This may include cancelling or nullifying implementation of the service-identified function that may have otherwise been implemented by the speech command service 108 in response to the received request 208 and accompanying audio 210.
[0051] In certain implementations, the action 222 may comprise sending an explicit notification or command to the speech command service 108, requesting that the speech command service 108 cancel any further recognition activities with respect to the service request 208, and/or to cancel implementation of any service-identified functions that may otherwise have been initiated in response to recognized speech. Alternatively, the audio device 102 may simply notify the speech command service 108 regarding any functions that have been performed locally in response to local recognition of the local command expression, and the speech command service 108 may respond by cancelling the service request 208 or by performing other actions as may be appropriate.
[0052] In certain implementations, the speech command service 108 may implement the service-identified function by identifying a command to be executed by the audio device 102. In response to receiving a notification that the service request 208 is to be cancelled, the speech command service 108 may forego sending the command to the audio device 102. Alternatively, the speech command service may be allowed to complete its processing and to send a command to the audio device 102, whereupon the audio device 102 may ignore the command or forego execution of the command.
[0053] In some implementations, the speech command service may be configured to notify the audio device 102 before initiating a service-identified function, and may delay implementation of the service-identified function until receiving permission from the audio device 102. In this case, the audio device 102 may be configured to deny such permission when the local command expression has been recognized locally.
[0054] The various approaches described above may be used in situations calling for different amounts of command latency. For example, waiting for communications from the speech command service may introduce relatively higher latencies, which may not be acceptable in some situations. Such communications prior to implementing a function may safeguard against duplicate or unintended actions. Immediately implementing a locally recognized command expression and either ignoring subsequent commands from the speech command service or subsequently cancelling requests to the speech command service may be more appropriate in situations where lower latencies are desired.
[0055] Note that the actions of the speech command service 108 shown in FIG. 2 are performed in parallel and asynchronously with the actions 218, 220, and 222 of the audio device 102. It is assumed in some implementations that the audio device 102 is able to detect and act upon the local command expression relatively quickly, so that it may perform the action 222 of cancelling the request 208 and subsequent processing by the speech command service 108 before the service-identified function of the action 216 has been implemented or executed.
[0056] FIG. 3 shows illustrates an example method 300 in which the speech command service 108 returns commands to the audio device 102, and in which the audio device 102 is configured to ignore the commands or forego execution of the commands in situations in which a local command expression has already been detected and acted upon by the audio device 102. Initial actions are similar or identical to those described above. Actions performed by the audio device 102 are shown on the left and actions performed by the speech command service 108 are shown on the right.
[0057] An action 302 comprises receiving an audio signal containing user speech. An action 304 comprises analyzing the audio signal to detect a trigger expression in the user speech. Subsequent actions shown in FIG. 3 are performed in response to detecting the trigger expression.
[0058] An action 306 comprises sending a request 308 and audio 310 to the speech command service 108. An action 312 comprises receiving the request 308 and the audio 310 at the speech command service 108. An action 314 comprises recognizing user speech and determining user intent based on the recognized user speech.
[0059] In response to the determined user intent, the speech command service 108 performs an action 316 of sending a command 318 to the audio device 102 for execution by the audio device 102 in order to implement a service-identified function corresponding to the recognized user intent. For example, the command may comprise a "stop" command, indicating that the audio device 102 is to stop playback of music.
[0060] An action 320, performed by the audio device 102, comprises receiving and executing the command. The action 320 is shown in a dashed box to indicate that it is performed conditionally, based on whether a local command expression has been detected and acted upon by the audio device 102. Specifically, the action 320 is not performed if a local command expression has been detected by the audio device 102.
[0061] Concurrently with the actions performed by the speech command service 108, the audio device 102 performs an action 322 of analyzing received audio to detect an occurrence of a local command expression that follows or immediately follows the trigger expression in the received user speech. In response to detecting the local command expression, an action 324 is performed of immediately initiating a local device function that has been associated with the local command expression.
[0062] Also in response to detecting the local command expression in the action 322, the audio device 102 performs an action 326 of foregoing execution of the received command 318. More specifically, any commands received from the speech command service 108 in response to the request 308 are discarded or ignored. Responses and commands corresponding to the request 308 may be identified by session IDs associated with the responses.
[0063] If the local command expression is not detected in the action 322, the audio device performs the action 320 of executing the command 318 received from the speech command service 108.
[0064] FIG. 4 shows an example method 400 in which the audio device 102 is configured to actively cancel requests to the speech command service 108 after locally detecting a local command expression. Initial actions are similar or identical to those described above. Actions performed by the audio device 102 are shown on the left and actions performed by the speech command service 108 are shown on the right.
[0065] An action 402 comprises receiving an audio signal containing user speech. An action 404 comprises analyzing the audio signal to detect a trigger expression in the user speech. Subsequent actions shown in FIG. 4 are performed in response to detecting the trigger expression. [0066] An action 406 comprises sending a request 408 and audio 410 to the speech command service 108. An action 412 comprises receiving the request 408 and the audio 410 at the speech command service 108. An action 414 comprises recognizing user speech and determining user intent based on the recognized user speech.
[0067] An action 416 comprises determining whether the request 408 has been cancelled by the audio device 102. As an example, the audio device 102 may send a cancellation message or may terminate the current communication session in order to cancel the request. If the request has been canceled by the audio device 102, no further action is taken by the speech command service. If the request has not been canceled, an action 418 is performed, which comprises sending a command 420 to the audio device 102 for execution by the audio device 102 in order to implement a service-identified function corresponding to the recognized user intent.
[0068] An action 422, performed by the audio device 102, comprises receiving and executing the command. The action 422 is shown in a dashed box to indicate that it is performed conditionally, depending on whether a command has been sent and received from the speech command service 108, which in turn depends on whether the audio device 102 has cancelled the request 408.
[0069] Concurrently with the actions performed by the speech command service 108, the audio device 102 performs an action 424 of analyzing received audio to detect an occurrence of a local command expression that follows or immediately follows the trigger expression in the received user speech. In response to detecting the local command expression, an action 426 is performed of immediately initiating a local device function that has been associated with the local command expression.
[0070] Also in response to detecting the local command expression in the action 424, the audio device 102 performs an action 428 of requesting the speech command service 108 to cancel the request 408 and/or to cancel implementation of any service-identified functions that may have otherwise been performed in response to recognized speech in the audio received by the speech command service 108 from the audio device 102. This may comprise communicating with the speech command service 108, such as by sending a cancellation notification or request.
[0071] In some cases, the cancellation may comprise replying to a communication or notification from the speech command service 108 of a pending implementation of a service-identified function by the speech command service. In response to receiving such a notification, the audio device 102 may reply and may request cancellation of the pending implementation. Alternatively, the audio device 102 may cancel the implementation of any function that might have otherwise been performed in response to detecting the local command expression, and may instruct the speech command service 108 to proceed with implementation of the pending function. [0072] If the local command expression is not detected in the action 424, the audio device 102 performs the action 422 of executing the command 420 received from the speech command service 108. The action 422 may occur asynchronously, upon receiving the command 420 from the speech command service.
[0073] The embodiments described above may be implemented programmatically, such as with computers, processors, digital signal processors, analog processors, and so forth. In other embodiments, however, one or more of the components, functions, or elements may be implemented using specialized or dedicated circuits, including analog circuits and/or digital logic circuits. The term "component", as used herein, is intended to include any hardware, software, logic, or combinations of the foregoing that are used to implement the functionality attributed to the component.
[0074] Although the subject matter has been described in language specific to structural features, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features described. Rather, the specific features are disclosed as illustrative forms of implementing the claims.
Clauses:
1. One or more non- transitory computer-readable media storing computer-executable instructions that, when executed, cause one or more processors to perform acts comprising:
receiving audio that contains user speech; detecting a trigger expression in the user speech;
in response to detecting the trigger expression in the user speech:
streaming the received audio to a remote speech command service; and
analyzing the received audio to detect a local command expression following the trigger expression in the user speech, wherein the local command expression is associated with a device function;
initiating the device function in response to detecting the local command expression following the trigger expression in the user speech;
receiving a response from the remote speech command service, wherein the response indicates a command that is to be performed in response to speech recognized by the remote speech command service in the streamed audio;
executing the command indicated by the response if the local command expression is not detected following the trigger expression in the user speech; and
foregoing execution of the command indicated by the response if the local command expression is detected following the trigger expression in the user speech.
2. The one or more computer-readable media of clause 1, wherein the streaming is associated with a communication identifier and wherein the response indicates the communication identifier. 3. The one or more computer-readable media of clause 1, wherein the device function comprises a media control function.
4. The one or more computer-readable media of clause 1, the acts further comprising stopping the streaming of the received audio in response to detecting the command expression.
5. A method, comprising:
receiving audio that contains user speech;
detecting a trigger expression in the user speech;
in response to detecting the trigger expression in the user speech:
sending the received audio to a speech command service to recognize speech in the received audio and to implement a first function corresponding to the recognized speech; and
analyzing the received audio to detect a local command expression that follows the trigger expression in the received audio, wherein the local command expression is associated with a second function;
in response to detecting the local command expression that follows the trigger expression in the received audio:
initiating the second function; and
cancelling implementation of the first function. 6. The method of clause 5, wherein cancelling implementation of the first function comprises requesting the speech command service to cancel implementation of the first function.
7. The method of clause 5, further comprising receiving a communication from the speech command service indicating a pending implementation of the first function;
wherein cancelling implementation of the first function comprises requesting the speech command service to cancel the pending implementation of the first function.
8. The method of clause 5, further comprising receiving a command corresponding to the first function from the speech command service, wherein cancelling implementation of the first function comprises forgoing execution of the command received from the speech command service.
9. The method of clause 5, further comprising informing the speech command service that the second function has been initiated.
10. The method of clause 5, wherein cancelling implementation of the first function comprising informing the speech command service that the second function has been initiated. 11. The method of clause 5, wherein the second function comprises a media control function.
12. The method of clause 5, further comprising:
establishing a communication session with the speech command service in response to detecting the trigger expression in the audio; and
wherein cancelling implementation of the first function comprises terminating the communication session.
13. The method of clause 5, further comprising:
associating an identifier with the received audio;
receiving a response from the speech command service, wherein the response indicates the identifier and a command corresponding to the first function; and
wherein cancelling implementation of the first function comprises forgoing execution of the command.
14. A system, comprising:
one or more speech recognition components configured to recognize user speech in received audio, to detect a trigger expression in the user speech, and to detect a local command expression in the user speech;
control logic configured to perform acts in response to detection by the one or more speech recognition components of the trigger expression in the user speech, the acts comprising: sending the audio to a speech command service to recognize speech in the audio and to implement a first function corresponding to the recognized speech; and
in response to detection by the one or more speech recognition components of the local command expression in the user speech: (a) identifying a second function corresponding to the local command expression and (b) cancelling implementation of at least one of the first and second functions.
15. The system of clause 14, wherein the one or more speech recognition components comprise one or more keyword spotters.
16. The system of clause 14, wherein cancelling implementation of the at least one of the first and second functions comprises requesting the speech command service to cancel implementation of the first function.
17. The system of clause 14, wherein cancelling implementation of the at least one of the first and second functions comprises ignoring a command received from the speech command service.
18. The system of clause 14, wherein the second function comprises a media control function. 19. The system of clause 14, the acts further comprising stopping the sending of the audio in response to detection of the local command expression in the user speech.
20. The system of clause 14, wherein cancelling implementation of the at least one of the first and second functions comprises informing the speech command service that the second function has been initiated.

Claims

CLAIMS What is claimed is:
1. A device storing computer-executable instructions that, when executed, cause one or more processors of the device to perform acts comprising:
receiving audio that contains user speech;
detecting a trigger expression in the user speech;
in response to detecting the trigger expression in the user speech:
streaming the received audio to a remote speech command service; and
analyzing the received audio to detect a local command expression following the trigger expression in the user speech, wherein the local command expression is associated with a device function;
initiating the device function in response to detecting the local command expression following the trigger expression in the user speech;
receiving a response from the remote speech command service, wherein the response indicates a command that is to be performed in response to speech recognized by the remote speech command service in the streamed audio;
executing the command indicated by the response if the local command expression is not detected following the trigger expression in the user speech; foregoing execution of the command indicated by the response if the local command expression is detected following the trigger expression in the user speech.
2. The device of claim 1, wherein the streaming is associated with a communication identifier and wherein the response indicates the communication identifier.
3. The device of claim 1, wherein the device function comprises a media control function.
4. The device of claim 1, the acts further comprising stopping the streaming of the received audio in response to detecting the command expression.
5. A method, comprising:
receiving audio that contains user speech;
detecting a trigger expression in the user speech;
in response to detecting the trigger expression in the user speech:
sending the received audio to a speech command service to recognize speech in the received audio and to implement a first function corresponding to the recognized speech; and
analyzing the received audio to detect a local command expression that follows the trigger expression in the received audio, wherein the local command expression is associated with a second function;
in response to detecting the local command expression that follows the trigger expression in the received audio:
initiating the second function; and
cancelling implementation of the first function.
6. The method of claim 5, wherein cancelling implementation of the first function comprises requesting the speech command service to cancel implementation of the first function.
7. The method of claim 5, further comprising receiving a communication from the speech command service indicating a pending implementation of the first function;
wherein cancelling implementation of the first function comprises requesting the speech command service to cancel the pending implementation of the first function.
8. The method of claim 5, further comprising receiving a command corresponding to the first function from the speech command service, wherein cancelling implementation of the first function comprises forgoing execution of the command received from the speech command service.
9. The method of claim 5, further comprising informing the speech command service that the second function has been initiated.
10. The method of claim 5, further comprising:
associating an identifier with the received audio;
receiving a response from the speech command service, wherein the response indicates the identifier and a command corresponding to the first function; and
wherein cancelling implementation of the first function comprises forgoing execution of the command.
11. A system, comprising:
one or more speech recognition components configured to recognize user speech in received audio, to detect a trigger expression in the user speech, and to detect a local command expression in the user speech;
control logic configured to perform acts in response to detection by the one or more speech recognition components of the trigger expression in the user speech, the acts comprising:
sending the audio to a speech command service to recognize speech in the audio and to implement a first function corresponding to the recognized speech; and
in response to detection by the one or more speech recognition components of the local command expression in the user speech: (a) identifying a second function corresponding to the local command expression and (b) cancelling implementation of at least one of the first and second functions.
12. The system of claim 11, wherein cancelling implementation of the at least one of the first and second functions comprises requesting the speech command service to cancel implementation of the first function.
13. The system of claim 11, wherein cancelling implementation of the at least one of the first and second functions comprises ignoring a command received from the speech command service.
14. The system of claim 11, the acts further comprising stopping the sending of the audio in response to detection of the local command expression in the user speech.
15. The system of claim 11, wherein cancelling implementation of the at least one of the first and second functions comprises informing the speech command service that the second function has been initiated.
EP14846698.0A 2013-09-20 2014-09-09 Local and remote speech processing Withdrawn EP3047481A4 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201314033302A 2013-09-20 2013-09-20
PCT/US2014/054700 WO2015041892A1 (en) 2013-09-20 2014-09-09 Local and remote speech processing

Publications (2)

Publication Number Publication Date
EP3047481A1 true EP3047481A1 (en) 2016-07-27
EP3047481A4 EP3047481A4 (en) 2017-03-01

Family

ID=52689281

Family Applications (1)

Application Number Title Priority Date Filing Date
EP14846698.0A Withdrawn EP3047481A4 (en) 2013-09-20 2014-09-09 Local and remote speech processing

Country Status (4)

Country Link
EP (1) EP3047481A4 (en)
JP (1) JP2016531375A (en)
CN (1) CN105793923A (en)
WO (1) WO2015041892A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190179610A1 (en) * 2017-12-12 2019-06-13 Amazon Technologies, Inc. Architecture for a hub configured to control a second device while a connection to a remote system is unavailable

Families Citing this family (152)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8677377B2 (en) 2005-09-08 2014-03-18 Apple Inc. Method and apparatus for building an intelligent automated assistant
US9318108B2 (en) 2010-01-18 2016-04-19 Apple Inc. Intelligent automated assistant
US8977255B2 (en) 2007-04-03 2015-03-10 Apple Inc. Method and system for operating a multi-function portable electronic device using voice-activation
US10002189B2 (en) 2007-12-20 2018-06-19 Apple Inc. Method and apparatus for searching using an active ontology
US9330720B2 (en) 2008-01-03 2016-05-03 Apple Inc. Methods and apparatus for altering audio output signals
US8996376B2 (en) 2008-04-05 2015-03-31 Apple Inc. Intelligent text-to-speech conversion
US20100030549A1 (en) 2008-07-31 2010-02-04 Lee Michael M Mobile device having human language translation capability with positional feedback
US8676904B2 (en) 2008-10-02 2014-03-18 Apple Inc. Electronic devices with voice command and contextual data processing capabilities
US10241644B2 (en) 2011-06-03 2019-03-26 Apple Inc. Actionable reminder entries
US10241752B2 (en) 2011-09-30 2019-03-26 Apple Inc. Interface for a virtual digital assistant
US10706373B2 (en) 2011-06-03 2020-07-07 Apple Inc. Performing actions associated with task items that represent tasks to perform
US10276170B2 (en) 2010-01-18 2019-04-30 Apple Inc. Intelligent automated assistant
US8682667B2 (en) 2010-02-25 2014-03-25 Apple Inc. User profiling for selecting user specific voice input processing information
US9262612B2 (en) 2011-03-21 2016-02-16 Apple Inc. Device access using voice authentication
US10057736B2 (en) 2011-06-03 2018-08-21 Apple Inc. Active transport based notifications
US10134385B2 (en) 2012-03-02 2018-11-20 Apple Inc. Systems and methods for name pronunciation
US10417037B2 (en) 2012-05-15 2019-09-17 Apple Inc. Systems and methods for integrating third party services with a digital assistant
US9721563B2 (en) 2012-06-08 2017-08-01 Apple Inc. Name recognition system
US9547647B2 (en) 2012-09-19 2017-01-17 Apple Inc. Voice-based media searching
EP4138075A1 (en) 2013-02-07 2023-02-22 Apple Inc. Voice trigger for a digital assistant
US10652394B2 (en) 2013-03-14 2020-05-12 Apple Inc. System and method for processing voicemail
US10748529B1 (en) 2013-03-15 2020-08-18 Apple Inc. Voice activated device for use with a voice-based digital assistant
WO2014197334A2 (en) 2013-06-07 2014-12-11 Apple Inc. System and method for user-specified pronunciation of words for speech synthesis and recognition
WO2014197335A1 (en) 2013-06-08 2014-12-11 Apple Inc. Interpreting and acting upon commands that involve sharing information with remote devices
EP3008641A1 (en) 2013-06-09 2016-04-20 Apple Inc. Device, method, and graphical user interface for enabling conversation persistence across two or more instances of a digital assistant
US10176167B2 (en) 2013-06-09 2019-01-08 Apple Inc. System and method for inferring user intent from speech inputs
CN105453026A (en) 2013-08-06 2016-03-30 苹果公司 Auto-activating smart responses based on activities from remote devices
US10296160B2 (en) 2013-12-06 2019-05-21 Apple Inc. Method for extracting salient dialog usage from live data
US10170123B2 (en) 2014-05-30 2019-01-01 Apple Inc. Intelligent assistant for home automation
US9430463B2 (en) 2014-05-30 2016-08-30 Apple Inc. Exemplar-based natural language processing
US9715875B2 (en) 2014-05-30 2017-07-25 Apple Inc. Reducing the need for manual start/end-pointing and trigger phrases
US9633004B2 (en) 2014-05-30 2017-04-25 Apple Inc. Better resolution when referencing to concepts
CN106471570B (en) 2014-05-30 2019-10-01 苹果公司 Order single language input method more
US9338493B2 (en) 2014-06-30 2016-05-10 Apple Inc. Intelligent automated assistant for TV user interactions
US9818400B2 (en) 2014-09-11 2017-11-14 Apple Inc. Method and apparatus for discovering trending terms in speech requests
US10127911B2 (en) 2014-09-30 2018-11-13 Apple Inc. Speaker identification and unsupervised speaker adaptation techniques
US9668121B2 (en) 2014-09-30 2017-05-30 Apple Inc. Social reminders
US10074360B2 (en) 2014-09-30 2018-09-11 Apple Inc. Providing an indication of the suitability of speech recognition
US10152299B2 (en) 2015-03-06 2018-12-11 Apple Inc. Reducing response latency of intelligent automated assistants
US10567477B2 (en) 2015-03-08 2020-02-18 Apple Inc. Virtual assistant continuity
US9721566B2 (en) 2015-03-08 2017-08-01 Apple Inc. Competing devices responding to voice triggers
US9886953B2 (en) 2015-03-08 2018-02-06 Apple Inc. Virtual assistant activation
US10460227B2 (en) 2015-05-15 2019-10-29 Apple Inc. Virtual assistant in a communication session
US9870196B2 (en) * 2015-05-27 2018-01-16 Google Llc Selective aborting of online processing of voice inputs in a voice-enabled electronic device
US10083688B2 (en) 2015-05-27 2018-09-25 Apple Inc. Device voice control for selecting a displayed affordance
US9966073B2 (en) 2015-05-27 2018-05-08 Google Llc Context-sensitive dynamic update of voice to text model in a voice-enabled electronic device
US10200824B2 (en) 2015-05-27 2019-02-05 Apple Inc. Systems and methods for proactively identifying and surfacing relevant content on a touch-sensitive device
US10083697B2 (en) 2015-05-27 2018-09-25 Google Llc Local persisting of data for selectively offline capable voice action in a voice-enabled electronic device
US9578173B2 (en) 2015-06-05 2017-02-21 Apple Inc. Virtual assistant aided communication with 3rd party service in a communication session
US11025565B2 (en) 2015-06-07 2021-06-01 Apple Inc. Personalized prediction of responses for instant messaging
US20160378747A1 (en) 2015-06-29 2016-12-29 Apple Inc. Virtual assistant for media playback
US10740384B2 (en) 2015-09-08 2020-08-11 Apple Inc. Intelligent automated assistant for media search and playback
US10331312B2 (en) 2015-09-08 2019-06-25 Apple Inc. Intelligent automated assistant in a media environment
US10671428B2 (en) 2015-09-08 2020-06-02 Apple Inc. Distributed personal assistant
US10747498B2 (en) 2015-09-08 2020-08-18 Apple Inc. Zero latency digital assistant
US11587559B2 (en) 2015-09-30 2023-02-21 Apple Inc. Intelligent device identification
US10691473B2 (en) 2015-11-06 2020-06-23 Apple Inc. Intelligent automated assistant in a messaging environment
US10956666B2 (en) 2015-11-09 2021-03-23 Apple Inc. Unconventional virtual assistant interactions
US10049668B2 (en) 2015-12-02 2018-08-14 Apple Inc. Applying neural network language models to weighted finite state transducers for automatic speech recognition
US10223066B2 (en) 2015-12-23 2019-03-05 Apple Inc. Proactive assistance based on dialog communication between devices
US11227589B2 (en) 2016-06-06 2022-01-18 Apple Inc. Intelligent list reading
US10049663B2 (en) 2016-06-08 2018-08-14 Apple, Inc. Intelligent automated assistant for media exploration
DK179588B1 (en) 2016-06-09 2019-02-22 Apple Inc. Intelligent automated assistant in a home environment
US10586535B2 (en) 2016-06-10 2020-03-10 Apple Inc. Intelligent digital assistant in a multi-tasking environment
US10067938B2 (en) 2016-06-10 2018-09-04 Apple Inc. Multilingual word prediction
DK179415B1 (en) 2016-06-11 2018-06-14 Apple Inc Intelligent device arbitration and control
DK201670540A1 (en) 2016-06-11 2018-01-08 Apple Inc Application integration with a digital assistant
US10474753B2 (en) 2016-09-07 2019-11-12 Apple Inc. Language identification using recurrent neural networks
US10043516B2 (en) 2016-09-23 2018-08-07 Apple Inc. Intelligent automated assistant
US11281993B2 (en) 2016-12-05 2022-03-22 Apple Inc. Model and ensemble compression for metric learning
US10593346B2 (en) 2016-12-22 2020-03-17 Apple Inc. Rank-reduced token representation for automatic speech recognition
US11204787B2 (en) 2017-01-09 2021-12-21 Apple Inc. Application integration with a digital assistant
DK201770383A1 (en) 2017-05-09 2018-12-14 Apple Inc. User interface for correcting recognition errors
US10417266B2 (en) 2017-05-09 2019-09-17 Apple Inc. Context-aware ranking of intelligent response suggestions
US10726832B2 (en) 2017-05-11 2020-07-28 Apple Inc. Maintaining privacy of personal information
DK201770439A1 (en) 2017-05-11 2018-12-13 Apple Inc. Offline personal assistant
US10395654B2 (en) 2017-05-11 2019-08-27 Apple Inc. Text normalization based on a data-driven learning network
DK180048B1 (en) 2017-05-11 2020-02-04 Apple Inc. MAINTAINING THE DATA PROTECTION OF PERSONAL INFORMATION
DK179496B1 (en) 2017-05-12 2019-01-15 Apple Inc. USER-SPECIFIC Acoustic Models
DK201770429A1 (en) 2017-05-12 2018-12-14 Apple Inc. Low-latency intelligent automated assistant
DK179745B1 (en) 2017-05-12 2019-05-01 Apple Inc. SYNCHRONIZATION AND TASK DELEGATION OF A DIGITAL ASSISTANT
US11301477B2 (en) 2017-05-12 2022-04-12 Apple Inc. Feedback analysis of a digital assistant
DK201770411A1 (en) 2017-05-15 2018-12-20 Apple Inc. Multi-modal interfaces
DK201770431A1 (en) 2017-05-15 2018-12-20 Apple Inc. Optimizing dialogue policy decisions for digital assistants using implicit feedback
DK201770432A1 (en) 2017-05-15 2018-12-21 Apple Inc. Hierarchical belief states for digital assistants
US10311144B2 (en) 2017-05-16 2019-06-04 Apple Inc. Emoji word sense disambiguation
US10403278B2 (en) 2017-05-16 2019-09-03 Apple Inc. Methods and systems for phonetic matching in digital assistant services
US20180336892A1 (en) 2017-05-16 2018-11-22 Apple Inc. Detecting a trigger of a digital assistant
US20180336275A1 (en) 2017-05-16 2018-11-22 Apple Inc. Intelligent automated assistant for media exploration
DK179549B1 (en) 2017-05-16 2019-02-12 Apple Inc. Far-field extension for digital assistant services
US10657328B2 (en) 2017-06-02 2020-05-19 Apple Inc. Multi-task recurrent neural network architecture for efficient morphology handling in neural language modeling
CN107146618A (en) * 2017-06-16 2017-09-08 北京云知声信息技术有限公司 Method of speech processing and device
CN107342083B (en) * 2017-07-05 2021-07-20 百度在线网络技术(北京)有限公司 Method and apparatus for providing voice service
US10599377B2 (en) 2017-07-11 2020-03-24 Roku, Inc. Controlling visual indicators in an audio responsive electronic device, and capturing and providing audio using an API, by native and non-native computing devices and services
BR112019002636A2 (en) * 2017-08-02 2019-05-28 Panasonic Ip Man Co Ltd information processing apparatus, speech recognition system and information processing method
US10455322B2 (en) 2017-08-18 2019-10-22 Roku, Inc. Remote control with presence sensor
US10777197B2 (en) 2017-08-28 2020-09-15 Roku, Inc. Audio responsive device with play/stop and tell me something buttons
US11062702B2 (en) 2017-08-28 2021-07-13 Roku, Inc. Media system with multiple digital assistants
US11062710B2 (en) 2017-08-28 2021-07-13 Roku, Inc. Local and cloud speech recognition
US10515637B1 (en) 2017-09-19 2019-12-24 Amazon Technologies, Inc. Dynamic speech processing
US10445429B2 (en) 2017-09-21 2019-10-15 Apple Inc. Natural language understanding using vocabularies with compressed serialized tries
US10755051B2 (en) 2017-09-29 2020-08-25 Apple Inc. Rule-based natural language processing
US10636424B2 (en) 2017-11-30 2020-04-28 Apple Inc. Multi-turn canned dialog
CN111629658B (en) * 2017-12-22 2023-09-15 瑞思迈传感器技术有限公司 Apparatus, system, and method for motion sensing
US10733982B2 (en) 2018-01-08 2020-08-04 Apple Inc. Multi-directional dialog
US10733375B2 (en) 2018-01-31 2020-08-04 Apple Inc. Knowledge-based framework for improving natural language understanding
US11145298B2 (en) 2018-02-13 2021-10-12 Roku, Inc. Trigger word detection with multiple digital assistants
US10789959B2 (en) 2018-03-02 2020-09-29 Apple Inc. Training speaker recognition models for digital assistants
US10592604B2 (en) 2018-03-12 2020-03-17 Apple Inc. Inverse text normalization for automatic speech recognition
CN108320749A (en) * 2018-03-14 2018-07-24 百度在线网络技术(北京)有限公司 Far field voice control device and far field speech control system
US10984799B2 (en) * 2018-03-23 2021-04-20 Amazon Technologies, Inc. Hybrid speech interface device
US10818288B2 (en) 2018-03-26 2020-10-27 Apple Inc. Natural assistant interaction
US10909331B2 (en) 2018-03-30 2021-02-02 Apple Inc. Implicit identification of translation payload with neural machine translation
US10928918B2 (en) 2018-05-07 2021-02-23 Apple Inc. Raise to speak
US11145294B2 (en) 2018-05-07 2021-10-12 Apple Inc. Intelligent automated assistant for delivering content from user experiences
US10984780B2 (en) 2018-05-21 2021-04-20 Apple Inc. Global semantic word embeddings using bi-directional recurrent neural networks
US11386266B2 (en) 2018-06-01 2022-07-12 Apple Inc. Text correction
DK180639B1 (en) 2018-06-01 2021-11-04 Apple Inc DISABILITY OF ATTENTION-ATTENTIVE VIRTUAL ASSISTANT
DK201870355A1 (en) 2018-06-01 2019-12-16 Apple Inc. Virtual assistant operation in multi-device environments
DK179822B1 (en) 2018-06-01 2019-07-12 Apple Inc. Voice interaction at a primary device to access call functionality of a companion device
US10892996B2 (en) 2018-06-01 2021-01-12 Apple Inc. Variable latency device coordination
US10944859B2 (en) 2018-06-03 2021-03-09 Apple Inc. Accelerated task performance
US11373645B1 (en) * 2018-06-18 2022-06-28 Amazon Technologies, Inc. Updating personalized data on a speech interface device
WO2020005241A1 (en) * 2018-06-27 2020-01-02 Google Llc Rendering responses to a spoken utterance of a user utilizing a local text-response map
JP7000268B2 (en) 2018-07-18 2022-01-19 株式会社東芝 Information processing equipment, information processing methods, and programs
US11010561B2 (en) 2018-09-27 2021-05-18 Apple Inc. Sentiment prediction from textual data
US10839159B2 (en) 2018-09-28 2020-11-17 Apple Inc. Named entity normalization in a spoken dialog system
US11462215B2 (en) 2018-09-28 2022-10-04 Apple Inc. Multi-modal inputs for voice commands
US11170166B2 (en) 2018-09-28 2021-11-09 Apple Inc. Neural typographical error modeling via generative adversarial networks
US11475898B2 (en) 2018-10-26 2022-10-18 Apple Inc. Low-latency multi-speaker speech recognition
WO2020096218A1 (en) * 2018-11-05 2020-05-14 Samsung Electronics Co., Ltd. Electronic device and operation method thereof
US10885912B2 (en) 2018-11-13 2021-01-05 Motorola Solutions, Inc. Methods and systems for providing a corrected voice command
US11638059B2 (en) 2019-01-04 2023-04-25 Apple Inc. Content playback on multiple devices
US11348573B2 (en) 2019-03-18 2022-05-31 Apple Inc. Multimodality in digital assistant systems
DK201970509A1 (en) 2019-05-06 2021-01-15 Apple Inc Spoken notifications
US11307752B2 (en) 2019-05-06 2022-04-19 Apple Inc. User configurable task triggers
US11423908B2 (en) 2019-05-06 2022-08-23 Apple Inc. Interpreting spoken requests
US11475884B2 (en) 2019-05-06 2022-10-18 Apple Inc. Reducing digital assistant latency when a language is incorrectly determined
US11140099B2 (en) 2019-05-21 2021-10-05 Apple Inc. Providing message response suggestions
US11496600B2 (en) 2019-05-31 2022-11-08 Apple Inc. Remote execution of machine-learned models
US11289073B2 (en) 2019-05-31 2022-03-29 Apple Inc. Device text to speech
DK180129B1 (en) 2019-05-31 2020-06-02 Apple Inc. User activity shortcut suggestions
DK201970510A1 (en) 2019-05-31 2021-02-11 Apple Inc Voice identification in digital assistant systems
US11227599B2 (en) 2019-06-01 2022-01-18 Apple Inc. Methods and user interfaces for voice-based control of electronic devices
US11360641B2 (en) 2019-06-01 2022-06-14 Apple Inc. Increasing the relevance of new available information
US11488406B2 (en) 2019-09-25 2022-11-01 Apple Inc. Text detection using global geometry estimators
JP7451033B2 (en) * 2020-03-06 2024-03-18 アルパイン株式会社 data processing system
US11038934B1 (en) 2020-05-11 2021-06-15 Apple Inc. Digital assistant hardware abstraction
US11061543B1 (en) 2020-05-11 2021-07-13 Apple Inc. Providing relevant data items based on context
US11490204B2 (en) 2020-07-20 2022-11-01 Apple Inc. Multi-device audio adjustment coordination
US11438683B2 (en) 2020-07-21 2022-09-06 Apple Inc. User identification using headphones
WO2023287471A1 (en) * 2021-07-15 2023-01-19 Arris Enterprises Llc Command services manager for secure sharing of commands to registered agents

Family Cites Families (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS58208799A (en) * 1982-05-28 1983-12-05 トヨタ自動車株式会社 Voice recognition system for vehicle
WO2000058942A2 (en) * 1999-03-26 2000-10-05 Koninklijke Philips Electronics N.V. Client-server speech recognition
JP2001005492A (en) * 1999-06-21 2001-01-12 Matsushita Electric Ind Co Ltd Voice recognizing method and voice recognition device
WO2004017161A2 (en) * 2002-08-16 2004-02-26 Nuasis Corporation High availability voip subsystem
KR100521154B1 (en) * 2004-02-03 2005-10-12 삼성전자주식회사 Apparatus and method processing call in voice/data integration switching system
US9848086B2 (en) * 2004-02-23 2017-12-19 Nokia Technologies Oy Methods, apparatus and computer program products for dispatching and prioritizing communication of generic-recipient messages to recipients
JP4483428B2 (en) * 2004-06-25 2010-06-16 日本電気株式会社 Speech recognition / synthesis system, synchronization control method, synchronization control program, and synchronization control apparatus
CN1728750B (en) * 2004-07-27 2012-07-18 邓里文 Method of packet voice communication
US20070258418A1 (en) * 2006-05-03 2007-11-08 Sprint Spectrum L.P. Method and system for controlling streaming of media to wireless communication devices
JP5380777B2 (en) * 2007-02-21 2014-01-08 ヤマハ株式会社 Audio conferencing equipment
US8090077B2 (en) * 2007-04-02 2012-01-03 Microsoft Corporation Testing acoustic echo cancellation and interference in VoIP telephones
JP4925906B2 (en) * 2007-04-26 2012-05-09 株式会社日立製作所 Control device, information providing method, and information providing program
CN101246687A (en) * 2008-03-20 2008-08-20 北京航空航天大学 Intelligent voice interaction system and method thereof
US8364481B2 (en) * 2008-07-02 2013-01-29 Google Inc. Speech recognition with parallel recognition tasks
US8019608B2 (en) * 2008-08-29 2011-09-13 Multimodal Technologies, Inc. Distributed speech recognition using one way communication
US8676904B2 (en) * 2008-10-02 2014-03-18 Apple Inc. Electronic devices with voice command and contextual data processing capabilities
JP5244663B2 (en) * 2009-03-18 2013-07-24 Kddi株式会社 Speech recognition processing method and system for inputting text by speech
US9171541B2 (en) * 2009-11-10 2015-10-27 Voicebox Technologies Corporation System and method for hybrid processing in a natural language voice services environment
US9953653B2 (en) * 2011-01-07 2018-04-24 Nuance Communications, Inc. Configurable speech recognition system using multiple recognizers
JP5658641B2 (en) * 2011-09-15 2015-01-28 株式会社Nttドコモ Terminal device, voice recognition program, voice recognition method, and voice recognition system
US20130085753A1 (en) * 2011-09-30 2013-04-04 Google Inc. Hybrid Client/Server Speech Recognition In A Mobile Device
US9620122B2 (en) * 2011-12-08 2017-04-11 Lenovo (Singapore) Pte. Ltd Hybrid speech recognition

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See references of WO2015041892A1 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190179610A1 (en) * 2017-12-12 2019-06-13 Amazon Technologies, Inc. Architecture for a hub configured to control a second device while a connection to a remote system is unavailable
US10713007B2 (en) * 2017-12-12 2020-07-14 Amazon Technologies, Inc. Architecture for a hub configured to control a second device while a connection to a remote system is unavailable

Also Published As

Publication number Publication date
JP2016531375A (en) 2016-10-06
WO2015041892A1 (en) 2015-03-26
EP3047481A4 (en) 2017-03-01
CN105793923A (en) 2016-07-20

Similar Documents

Publication Publication Date Title
WO2015041892A1 (en) Local and remote speech processing
US11600271B2 (en) Detecting self-generated wake expressions
US9672812B1 (en) Qualifying trigger expressions in speech-based systems
US10354649B2 (en) Altering audio to improve automatic speech recognition
CN108351872B (en) Method and system for responding to user speech
CN107004411B (en) Voice application architecture
EP3084633B1 (en) Attribute-based audio channel arbitration
US10079017B1 (en) Speech-responsive portable speaker
US9734845B1 (en) Mitigating effects of electronic audio sources in expression detection
US9098467B1 (en) Accepting voice commands based on user identity
US9324322B1 (en) Automatic volume attenuation for speech enabled devices
US9293134B1 (en) Source-specific speech interactions
US10297250B1 (en) Asynchronous transfer of audio data
US11004453B2 (en) Avoiding wake word self-triggering
KR20190075800A (en) Intelligent personal assistant interface system
US9224404B2 (en) Dynamic audio processing parameters with automatic speech recognition
CN102591455A (en) Selective Transmission of Voice Data
US20240005918A1 (en) System For Recognizing and Responding to Environmental Noises
US10923122B1 (en) Pausing automatic speech recognition
EP2760019B1 (en) Dynamic audio processing parameters with automatic speech recognition

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 20160302

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

AX Request for extension of the european patent

Extension state: BA ME

DAX Request for extension of the european patent (deleted)
A4 Supplementary search report drawn up and despatched

Effective date: 20170127

RIC1 Information provided on ipc code assigned before grant

Ipc: G10L 15/22 20060101ALI20170123BHEP

Ipc: G10L 15/08 20060101ALI20170123BHEP

Ipc: G10L 15/00 20130101ALI20170123BHEP

Ipc: G10L 15/32 20130101ALI20170123BHEP

Ipc: G10L 15/30 20130101AFI20170123BHEP

17Q First examination report despatched

Effective date: 20180719

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN

18D Application deemed to be withdrawn

Effective date: 20181130