WO2015041892A1 - Local and remote speech processing - Google Patents

Local and remote speech processing Download PDF

Info

Publication number
WO2015041892A1
WO2015041892A1 PCT/US2014/054700 US2014054700W WO2015041892A1 WO 2015041892 A1 WO2015041892 A1 WO 2015041892A1 US 2014054700 W US2014054700 W US 2014054700W WO 2015041892 A1 WO2015041892 A1 WO 2015041892A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
command
expression
function
audio
Prior art date
Application number
PCT/US2014/054700
Other languages
French (fr)
Inventor
Nikko Strom
Peter Spalding VANLUND
Bjorn HOFFMEISTER
Original Assignee
Rawles Llc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Rawles Llc filed Critical Rawles Llc
Priority to EP14846698.0A priority Critical patent/EP3047481A4/en
Priority to JP2016543926A priority patent/JP2016531375A/en
Priority to CN201480050711.8A priority patent/CN105793923A/en
Publication of WO2015041892A1 publication Critical patent/WO2015041892A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/32Multiple recognisers used in sequence or in parallel; Score combination systems therefor, e.g. voting systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Definitions

  • Homes, offices, automobiles, and public spaces are becoming more wired and connected with the proliferation of computing devices such as notebook computers, tablets, entertainment systems, and portable communication devices.
  • computing devices such as notebook computers, tablets, entertainment systems, and portable communication devices.
  • computing devices evolve, the ways in which users interact with these devices continue to evolve. For example, people can interact with computing devices through mechanical devices (e.g., keyboards, mice, etc.), electrical devices (e.g., touch screens, touch pads, etc.), and optical devices (e.g., motion detectors, camera, etc.).
  • Another way to interact with computing devices is through audio devices that capture and respond to human speech.
  • FIG. 1 is a block diagram of an illustrative voice interaction computing architecture that includes a local audio device and a remote speech processing service.
  • FIG. 2-4 are flow diagrams illustrating example processes for detecting command expressions that may be performed by a local audio device in conjunction with a remote speech processing service.
  • This disclosure pertains generally to a speech interface system that provides or facilitates speech-based interactions with a user.
  • the system includes a local device having a microphone that captures audio containing user speech.
  • Spoken user commands may be prefaced by a keyword, referred to as a trigger expression or wake expression. Audio following a trigger expression may be streamed to a remote service for speech recognition and the service may respond by performing a function or providing a command to be performed by the audio device.
  • certain command expressions are detected by or at the local device rather than by the remote service.
  • the local device is configured to detect a trigger or alert expression, which indicates that subsequent speech is intended by the user to form a command.
  • the local device initiates a communication session with the remote service and begins streaming received audio to the service.
  • the remote service performs speech recognition on the received audio and attempts to identify user intent based on the recognized speech.
  • the remote service may perform a corresponding function.
  • the function may performed in conjunction with the local device. For example, the remote service may send a command to the local device indicating that the local device should execute the command to perform a corresponding function.
  • the local device monitors or analyzes the audio to detect an occurrence of a local command expression following the trigger expression. Upon detecting a local command expression in the audio, the local device immediately implements a corresponding function. In addition, further actions by the remote service are stopped or cancelled to avoid duplicate actions with respect to a single user utterance. Actions by the remote service may be stopped by explicitly notifying the remote service that the utterance has been acted upon locally, by terminating or cancelling a communications session, and/or by foregoing execution of any commands that are specified by the remote service in response to remote recognition of user speech.
  • FIG. 1 shows an example of a voice interaction system 100.
  • the system 100 may include or may utilize a local voice-based audio device 102, which may be located within an environment 104 such as a home, and which may be used for interacting with a user 106.
  • the voice interaction system 100 may also include or utilize a remote, network-based speech command service 108 that is configured to receive audio, to recognize speech in the audio, and to perform a function, referred to herein as a service-identified function, in response to the recognized speech.
  • the service-identified function may be implemented by the speech command service 108 independently of the audio device, and/or may be implemented by providing a command to the audio device 102 for local execution.
  • the primary mode of user interaction with the audio device 102 may be through speech.
  • the audio device 102 may receive spoken command expressions from the user 106 and may provide services in response to the commands.
  • the user may speak a predefined wake or trigger expression (e.g., "Awake"), which may be followed by commands or instructions (e.g., "I'd like to go to a movie. Please tell me what's playing at the local cinema.”).
  • Provided services may include performing actions or activities, rendering media, obtaining and/or providing information, providing information via generated or synthesized speech via the audio device 102, initiating Internet-based services on behalf of the user 106, and so forth.
  • the local audio device 102 and the speech command service 108 are configured to act in conjunction with each other to receive and respond to command expressions from the user 106.
  • the command expressions may include local command expressions that are detected and acted upon by the local device 102 independently of the speech command service 108.
  • the command expressions may also include commands that are interpreted and acted upon by or in conjunction with the remote speech command service 108.
  • the audio device 102 may have one or more microphones 1 10 and one or more audio speakers or transducers 1 12 to facilitate audio interactions with the user 106.
  • the microphone 1 10 produces a microphone signal, also referred to as an input audio signal, representing audio from the environment 104, including sounds or expressions uttered by the user 106.
  • the microphone 1 10 may comprise a microphone array that is used in conjunction with audio beamforming techniques to produce an input audio signal that is focused in a selectable direction. Similarly, a plurality of directional microphones 1 10 may be used to produce an audio signal corresponding to one of multiple available directions.
  • the audio device 102 includes operational logic, which in many cases may comprise a processor 1 14 and memory 1 16.
  • the processor 1 14 may include multiple processors and/or a processor having multiple cores.
  • the processor 1 14 may also comprise or include a digital signal processor for processing audio signals.
  • the memory 1 16 may contain applications and programs in the form of computer-executable instructions that are executed by the processor 1 14 to perform acts or actions that implement desired functionality of the audio device 102, including the functionality specifically described below.
  • the memory 1 16 may be a type of computer-readable storage media and may include volatile and nonvolatile memory. Thus, the memory 1 16 may include, but is not limited to, RAM, ROM, EEPROM, flash memory, or other memory technology.
  • the audio device 102 may include a plurality of applications, services, and/or functions 1 18, referred to collectively below as functional components 1 18, which are executable by the processor 1 14 to provide services and functionality.
  • the applications and other functional components 1 18 may include media playback services such as music players.
  • Other services or operations performed or provided by the applications and other functional components 1 18 may include, as examples, requesting and consuming entertainment (e.g., gaming, finding and playing music, movies or other content, etc.), personal management (e.g., calendaring, note taking, etc.), online shopping, financial transactions, database inquiries, person-to-person voice communications, and so forth.
  • the functional components 1 18 may be pre- installed on the audio device 102, and may implement core functionality of the audio device 102. In other embodiments, one or more of the applications or other functional components 1 18 may be installed by the user 106 or otherwise installed after the audio device 102 has been initialized by the user 106, and may implement additional or customized functionality as desired by the user 106.
  • the processor 1 14 may be configured by audio processing functionality or components 120 to process input audio signals generated by the microphone 1 10 and/or output audio signals provided to the speaker 1 12.
  • the audio processing components 120 may implement acoustic echo cancellation to reduce audio echo generated by acoustic coupling between the microphone 1 10 and the speaker 1 12.
  • the audio processing components 120 may also implement noise reduction to reduce noise in received audio signals, such as elements of input audio signals other than user speech.
  • the audio processing components 120 may include one or more audio beamformers that are responsive to multiple microphones 1 10 to generate an audio signal that is focused in a direction from which user speech has been detected.
  • the audio device 102 may also be configured to implement one or more expression detectors or speech recognition components 122, which may be used to detect a trigger expression in speech captured by the microphone 1 10.
  • the term "trigger expression” is used herein to indicate a word, phrase, or other utterance that is used to signal the audio device 102 that subsequent user speech is intended by the user to be interpreted as a command.
  • the one or more speech recognition components 122 may also be used to detect commands or command expressions in the speech captured by the microphone 1 10.
  • command expression is used herein to indicate a word, phrase, or other utterance that corresponds to or is associated with a function that is to be performed by the audio device 102 or by a service or other device that is accessible to the audio device 102, such as the speech command service 108.
  • the words “stop”, “pause”, “hang-up” may be used as command expressions.
  • the "stop” and “pause” command expressions may indicate that media playback activities should be interrupted.
  • the "hang-up” command expression may indicate that a current person-to- person communication should be terminated.
  • Other command expressions, corresponding to different functions may also be used.
  • Command expressions may comprise conversation-style directives, such as "Find a nearby Italian restaurant.”
  • Command expressions may include local command expressions that are to be interpreted by the audio device 102 without relying on the speech command service 108.
  • local command expressions are relatively short expressions such as single words or short phrases, which can be easily detected by the audio device 102.
  • Local command expressions may correspond to device functions for which relatively low response latencies are desired, such as media control or media playback control functions.
  • the services of the speech command service 108 may be utilized for other command expressions for which greater response latencies are acceptable.
  • Command expressions that are to be acted upon by the speech command service will be referred to herein as remote command expressions.
  • the speech recognition components 122 may be implemented using automated speech recognition (ASR) techniques.
  • ASR automated speech recognition
  • large vocabulary speech recognition techniques may be used for keyword detection, and the output of the speech recognition may be monitored for occurrences of the keyword.
  • the speech recognition may use hidden Markov models and Gaussian mixture models to recognize voice input and to provide a continuous word stream corresponding to the voice input. The word stream may then be monitored to detect one or more specified words or expressions.
  • the speech recognition components 122 may be implemented by one or more keyword spotters.
  • a keyword spotter is a functional component or algorithm that evaluates an audio signal to detect the presence of one or more predefined words or expressions in the audio signal.
  • a keyword spotter uses simplified ASR techniques to detect a specific word or a limited number of words rather than attempting to recognize a large vocabulary.
  • a keyword spotter may provide a notification when a specified word is detected in a voice signal, rather than providing a textual or word-based output.
  • a keyword spotter using these techniques may compare different words based on hidden Markov models (HMMs), which represent words as series of states.
  • HMMs hidden Markov models
  • an utterance is analyzed by comparing its model to a keyword model and to a background model. Comparing the model of the utterance with the keyword model yields a score that represents the likelihood that the utterance corresponds to the keyword. Comparing the model of the utterance with the background model yields a score that represents the likelihood that the utterance corresponds to a generic word other than the keyword. The two scores can be compared to determine whether the keyword was uttered.
  • the audio device 102 may further comprise control functionally 124, referred to herein as a controller or control logic, that is configured to interact with the other components of the audio device 102 in order to implement the logical functionality of the audio device 102.
  • control functionally 124 referred to herein as a controller or control logic
  • control logic 124 may comprise executable instructions, programs, and/or or program modules that are stored in the memory 1 16 and executed by the processor 1 14.
  • the speech command service 108 may in some instances be part of a network-accessible computing platform that is maintained and accessible via a network 126 such as the Internet.
  • Network-accessible computing platforms such as this may be referred to using terms such as “on-demand computing”, “software as a service (SaaS)", “platform computing”, “network-accessible platform”, “cloud services”, “data centers”, and so forth.
  • the audio device 102 and/or the speech command service 108 may communicatively couple to the network 126 via wired technologies (e.g., wires, universal serial bus (USB), fiber optic cable, etc.), wireless technologies (e.g., radio frequencies (RF), cellular, mobile telephone networks, satellite, Bluetooth, etc.), or other connection technologies.
  • the network 126 is representative of any type of communication network, including data and/or voice network, and may be implemented using wired infrastructure (e.g., coaxial cable, fiber optic cable, etc.), a wireless infrastructure (e.g., RF, cellular, microwave, satellite, Bluetooth®, etc.), and/or other connection technologies.
  • the audio device 102 is described herein as a voice- controlled or speech-based interface device, the techniques described herein may be implemented in conjunction with various different types of devices, such as telecommunications devices and components, hands-free devices, entertainment devices, media playback devices, and so forth.
  • the speech command service 108 generally provides functionality for receiving an audio stream from the audio device 102, recognizing speech in the audio stream, determining user intent from the recognized speech, and performing an action or service in response to the user intent.
  • the provided action may in some cases be performed in conjunction with the audio device 102 and in these cases the speech command service 108 may return a response to the audio device 102 indicating a command that is to be executed by the audio device 102.
  • the speech command service 108 includes operational logic, which in many cases may comprise one or more servers, computers, and or processors 128.
  • the speech command service 108 may also have memory 130 containing applications and programs in the form of instructions that are executed by the processor 128 to perform acts or actions that implement desired functionality of the speech command service, including the functionality specifically described herein.
  • the memory 130 may be a type of computer storage media and may include volatile and nonvolatile memory. Thus, the memory 130 may include, but is not limited to, RAM, ROM, EEPROM, flash memory, or other memory technology.
  • the speech command service 108 may comprise speech recognition components 132.
  • the speech recognition components 132 may include automatic speech recognition (ASR) functionality that recognizes human speech in an audio signal.
  • ASR automatic speech recognition
  • the speech command service 108 may also comprise a natural language understanding component (NLU) 134 that determines user intent based on recognized speech.
  • NLU natural language understanding component
  • the speech command service 108 may also comprise a command interpreter and action dispatcher 136 (referred to below simply as a command interpreter 136) that determines functions or commands corresponding to user intents.
  • commands may correspond to functions that are to be performed at least in part by the audio device 102, and the command interpreter 136 may in those cases provide responses to the audio device 102 indicating commands for implementing such functions.
  • Examples of commands or functions that may be performed by the audio device in response to directives from the command interpreter 136 may include playing music or other media, increasing/decreasing the volume of the speaker 1 12, generating audible speech through the speaker 1 12, initiating certain types of communications with users of similar devices, and so forth.
  • the speech command service 108 may also perform functions, in response to speech recognized from received audio, that involve entities or devices that are not shown in FIG. 1.
  • the speech command service 108 may interact with other network-based services to obtain information or services on behalf of the user 106.
  • the speech command service 108 may itself have various elements and functionality that may be responsive to speech uttered by the user 106.
  • the microphone 1 10 of the audio device 102 captures or receives audio containing speech of the user 106.
  • the audio is processed by the audio processing components 120 and the processed audio is received by the speech recognition components 122.
  • the speech recognition components 122 analyze the audio to detect occurrences of a trigger expression in the speech contained in the audio.
  • the controller 124 Upon detection of the trigger expression, the controller 124 begins sending or streaming received audio to the speech command service 108 along with a request for the speech command service 108 to recognize and interpret the user speech, and to initiate a function corresponding to any interpreted intent.
  • the speech recognition components 122 Concurrently with sending the audio to the speech command service 108, the speech recognition components 122 continue to analyze the received audio to detect an occurrence of a local command expression in the user speech.
  • the controller 124 Upon detection of a local command expression, the controller 124 initiates or performs a device function that corresponds to the local command expression. For example, in response to the local command expression "stop", the controller 124 may initiate a function that stops media playback.
  • the controller 124 may interact with one or more of the functional components 1 18 when initiating or performing the function.
  • the speech command service 108 in response to receiving the audio, concurrently analyzes the audio to recognize speech, to determine a user intent, and to determine a service-identified function that is to be implemented in response to the user intent.
  • the audio device 102 may take actions to cancel, nullify, or invalidate any service-identified functions that may eventually be initiated by the speech command service 108.
  • the audio device 102 may cancel its previous request by sending a cancellation message to the speech command service 108 and/or by stopping the streaming of the audio to the speech command service 108.
  • the audio device may ignore or discard any responses or service-specified commands that are received from the speech command service 108 in response to the earlier request.
  • the audio device may inform the speech command service 108 of actions that have been performed locally in response to the local command expression, and the speech command service 108 may modify its subsequent behavior based on this information. For example, the speech command service 108 may forego actions that it might otherwise have performed in response to recognized speech in the received audio.
  • FIG. 2 illustrates an example method 200 that may be performed by the audio device 102 in conjunction with the speech command service 108 in order to recognize and respond to user speech.
  • the method 200 will be described in the context of the system 100 of FIG. 1, although the method 200 may also be performed in other environments and may be implemented in different ways.
  • Actions on the left side of FIG. 2 are performed at or by the local audio device 102. Actions on the right side of FIG. 2 are performed at or by the remote speech command service 108.
  • An action 202 comprises receiving an audio signal that has been captured by or in conjunction with the microphone 1 10.
  • the audio signal contains or represents audio from the environment 104, and may contain user speech.
  • the audio signal may be an analog electrical signal or may comprise a digital signal such as a digital audio stream.
  • An action 204 comprises detecting an occurrence of a trigger expression in the received audio and/or in the user speech. This may be performed by the speech recognition components 122 as described above, which may in some embodiments comprise keyword spotters. If the trigger expression is not detected, the action 204 is repeated in order to continuously monitor for occurrences of the trigger expression. The remaining actions shown in FIG. 2 are performed in response to detecting the trigger expression. [0043] If the trigger expression is detected in the action 204, an action 206 is performed, comprising sending subsequently received audio to the speech command service 108 along with a service request 208 for the speech command service 108 to recognize speech in the audio and to implement a function corresponding to the recognized speech. Functions initiated by the speech command service 108 in this manner are referred to herein as service- identified functions, and may in certain cases be performed in conjunction with the audio device 102. For example, a function may be initiated by sending a command to the audio device 102.
  • the sending 206 may comprise streaming or otherwise transmitting a digital audio stream 210 to the speech command service 108, representing or containing audio that is received from the microphone 1 10 subsequent to detection of the trigger expression.
  • the action 206 may comprise opening or initiating a communication session between the audio device 102 and the speech command service 108.
  • the request 208 may be used to establish a communication session with the speech command service 108 for the purpose of recognizing speech, understanding intent, and determining actions or functions to be performed in response to user speech.
  • the request 208 may be followed or accompanied by the streamed audio 210.
  • the audio stream 210 provided to the speech command service 108 may include portions of received audio beginning at a time just prior to utterance of the trigger expression.
  • the communication session may be associated with a communication or session identifier (ID) that identifies the communication session established between the audio device 102 and the speech command service 108.
  • ID may be used or included in future communications relating to a particular user utterance or audio stream.
  • the session ID may be generated by the audio device 102 and provided in the request 208 to the speech command service 108.
  • the session ID may be generated by the speech command service 108 and provided by the speech command service 108 in acknowledgment of the request 208.
  • the term "request(ID)" is used herein to indicate a request having a particular session ID.
  • a response from the speech command service 108 relating to the same session, request, or audio stream may be indicated by the term "response(ID)".
  • each communication session and corresponding session ID may correspond to a single user utterance.
  • the audio device 102 may establish a session upon detecting the trigger expression. The audio device 102 may then continue to stream audio to the speech command service 108 as part of the same session until the end of the user utterance.
  • the speech command service 108 may provide responses to the audio device 102 through the session, using the same session ID. Responses may in some cases indicate commands to be executed by the audio device 102 in response to speech recognized by the speech command service 108 in the received audio 210.
  • the communication session may remain open until the audio device 102 receives a response from the speech command service 108 or until the audio device 102 cancels the request.
  • the speech command service 108 receives the request 208 and audio stream 210 in an action 212.
  • the speech command service 108 performs an action 214 of recognizing speech in the received audio and determining a user intent as expressed by the recognized speech, using the speech recognition and natural language understanding components 132 and 134 of the speech command service 108.
  • An action 214, performed by the command interpreter 136 comprises identifying and initiating a service- identified function in fulfillment of the determined user intent.
  • the service- identified function may in some cases be performed by the speech command service 108, independently of the audio device 102. In other cases, the speech command service 108 may identify a function that is to be performed by the audio device 102, and may send a corresponding command to the audio device 102 for execution by the audio device 102.
  • the local audio device 102 Concurrently with the actions being performed by the speech command service 108, the local audio device 102 performs further actions to determine whether the user has uttered a local command expression and to perform a corresponding local function in response to any such uttered local command expression.
  • an action 218, performed in response to detecting the trigger expression in the action 204 comprises analyzing audio received in the action 202 to detect an occurrence of a local command expression that follows or immediately follows the trigger expression in the received user speech. This may be performed by the speech recognition components 122 of the audio device 102 as described above, which may in some embodiments comprise keyword spotters.
  • an action 220 is performed of immediately initiating a device function that has been associated with the local command expression.
  • the local command expression "stop" might be associated with a function that stops media playback.
  • the audio device 102 performs an action 222 of stopping or cancelling the request 208 to the speech command service 108. This may include cancelling or nullifying implementation of the service-identified function that may have otherwise been implemented by the speech command service 108 in response to the received request 208 and accompanying audio 210.
  • the action 222 may comprise sending an explicit notification or command to the speech command service 108, requesting that the speech command service 108 cancel any further recognition activities with respect to the service request 208, and/or to cancel implementation of any service-identified functions that may otherwise have been initiated in response to recognized speech.
  • the audio device 102 may simply notify the speech command service 108 regarding any functions that have been performed locally in response to local recognition of the local command expression, and the speech command service 108 may respond by cancelling the service request 208 or by performing other actions as may be appropriate.
  • the speech command service 108 may implement the service-identified function by identifying a command to be executed by the audio device 102. In response to receiving a notification that the service request 208 is to be cancelled, the speech command service 108 may forego sending the command to the audio device 102. Alternatively, the speech command service may be allowed to complete its processing and to send a command to the audio device 102, whereupon the audio device 102 may ignore the command or forego execution of the command.
  • the speech command service may be configured to notify the audio device 102 before initiating a service-identified function, and may delay implementation of the service-identified function until receiving permission from the audio device 102.
  • the audio device 102 may be configured to deny such permission when the local command expression has been recognized locally.
  • the actions of the speech command service 108 shown in FIG. 2 are performed in parallel and asynchronously with the actions 218, 220, and 222 of the audio device 102. It is assumed in some implementations that the audio device 102 is able to detect and act upon the local command expression relatively quickly, so that it may perform the action 222 of cancelling the request 208 and subsequent processing by the speech command service 108 before the service-identified function of the action 216 has been implemented or executed.
  • FIG. 3 shows illustrates an example method 300 in which the speech command service 108 returns commands to the audio device 102, and in which the audio device 102 is configured to ignore the commands or forego execution of the commands in situations in which a local command expression has already been detected and acted upon by the audio device 102.
  • Initial actions are similar or identical to those described above. Actions performed by the audio device 102 are shown on the left and actions performed by the speech command service 108 are shown on the right.
  • An action 302 comprises receiving an audio signal containing user speech.
  • An action 304 comprises analyzing the audio signal to detect a trigger expression in the user speech. Subsequent actions shown in FIG. 3 are performed in response to detecting the trigger expression.
  • An action 306 comprises sending a request 308 and audio 310 to the speech command service 108.
  • An action 312 comprises receiving the request 308 and the audio 310 at the speech command service 108.
  • An action 314 comprises recognizing user speech and determining user intent based on the recognized user speech.
  • the speech command service 108 performs an action 316 of sending a command 318 to the audio device 102 for execution by the audio device 102 in order to implement a service-identified function corresponding to the recognized user intent.
  • the command may comprise a "stop" command, indicating that the audio device 102 is to stop playback of music.
  • An action 320 performed by the audio device 102, comprises receiving and executing the command.
  • the action 320 is shown in a dashed box to indicate that it is performed conditionally, based on whether a local command expression has been detected and acted upon by the audio device 102. Specifically, the action 320 is not performed if a local command expression has been detected by the audio device 102.
  • the audio device 102 Concurrently with the actions performed by the speech command service 108, the audio device 102 performs an action 322 of analyzing received audio to detect an occurrence of a local command expression that follows or immediately follows the trigger expression in the received user speech. In response to detecting the local command expression, an action 324 is performed of immediately initiating a local device function that has been associated with the local command expression.
  • the audio device 102 performs an action 326 of foregoing execution of the received command 318. More specifically, any commands received from the speech command service 108 in response to the request 308 are discarded or ignored. Responses and commands corresponding to the request 308 may be identified by session IDs associated with the responses.
  • the audio device performs the action 320 of executing the command 318 received from the speech command service 108.
  • FIG. 4 shows an example method 400 in which the audio device 102 is configured to actively cancel requests to the speech command service 108 after locally detecting a local command expression.
  • Initial actions are similar or identical to those described above. Actions performed by the audio device 102 are shown on the left and actions performed by the speech command service 108 are shown on the right.
  • An action 402 comprises receiving an audio signal containing user speech.
  • An action 404 comprises analyzing the audio signal to detect a trigger expression in the user speech. Subsequent actions shown in FIG. 4 are performed in response to detecting the trigger expression.
  • An action 406 comprises sending a request 408 and audio 410 to the speech command service 108.
  • An action 412 comprises receiving the request 408 and the audio 410 at the speech command service 108.
  • An action 414 comprises recognizing user speech and determining user intent based on the recognized user speech.
  • An action 416 comprises determining whether the request 408 has been cancelled by the audio device 102.
  • the audio device 102 may send a cancellation message or may terminate the current communication session in order to cancel the request. If the request has been canceled by the audio device 102, no further action is taken by the speech command service. If the request has not been canceled, an action 418 is performed, which comprises sending a command 420 to the audio device 102 for execution by the audio device 102 in order to implement a service-identified function corresponding to the recognized user intent.
  • An action 422, performed by the audio device 102 comprises receiving and executing the command.
  • the action 422 is shown in a dashed box to indicate that it is performed conditionally, depending on whether a command has been sent and received from the speech command service 108, which in turn depends on whether the audio device 102 has cancelled the request 408.
  • the audio device 102 Concurrently with the actions performed by the speech command service 108, the audio device 102 performs an action 424 of analyzing received audio to detect an occurrence of a local command expression that follows or immediately follows the trigger expression in the received user speech. In response to detecting the local command expression, an action 426 is performed of immediately initiating a local device function that has been associated with the local command expression.
  • the audio device 102 performs an action 428 of requesting the speech command service 108 to cancel the request 408 and/or to cancel implementation of any service-identified functions that may have otherwise been performed in response to recognized speech in the audio received by the speech command service 108 from the audio device 102.
  • This may comprise communicating with the speech command service 108, such as by sending a cancellation notification or request.
  • the cancellation may comprise replying to a communication or notification from the speech command service 108 of a pending implementation of a service-identified function by the speech command service.
  • the audio device 102 may reply and may request cancellation of the pending implementation.
  • the audio device 102 may cancel the implementation of any function that might have otherwise been performed in response to detecting the local command expression, and may instruct the speech command service 108 to proceed with implementation of the pending function.
  • the local command expression is not detected in the action 424, the audio device 102 performs the action 422 of executing the command 420 received from the speech command service 108.
  • the action 422 may occur asynchronously, upon receiving the command 420 from the speech command service.
  • inventions described above may be implemented programmatically, such as with computers, processors, digital signal processors, analog processors, and so forth. In other embodiments, however, one or more of the components, functions, or elements may be implemented using specialized or dedicated circuits, including analog circuits and/or digital logic circuits.
  • component as used herein, is intended to include any hardware, software, logic, or combinations of the foregoing that are used to implement the functionality attributed to the component.
  • One or more non- transitory computer-readable media storing computer-executable instructions that, when executed, cause one or more processors to perform acts comprising:
  • a method comprising:
  • cancelling implementation of the first function comprises requesting the speech command service to cancel implementation of the first function.
  • cancelling implementation of the first function comprises requesting the speech command service to cancel the pending implementation of the first function.
  • cancelling implementation of the first function comprises terminating the communication session.
  • cancelling implementation of the first function comprises forgoing execution of the command.
  • a system comprising:
  • one or more speech recognition components configured to recognize user speech in received audio, to detect a trigger expression in the user speech, and to detect a local command expression in the user speech;
  • control logic configured to perform acts in response to detection by the one or more speech recognition components of the trigger expression in the user speech, the acts comprising: sending the audio to a speech command service to recognize speech in the audio and to implement a first function corresponding to the recognized speech;
  • cancelling implementation of the at least one of the first and second functions comprises requesting the speech command service to cancel implementation of the first function.
  • cancelling implementation of the at least one of the first and second functions comprises ignoring a command received from the speech command service.
  • cancelling implementation of the at least one of the first and second functions comprises informing the speech command service that the second function has been initiated.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephonic Communication Services (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

A user device may be configured to detect a user-uttered trigger expression and to respond by interpreting subsequent words or phrases as commands. The commands may be recognized by sending audio containing the words or phrases to a remote service that is configured to perform speech recognition. Certain commands may be designated as local commands and may be detected locally rather than relying on the remote service. Upon detection of the trigger expression, audio is streamed to the remote service and also analyzed locally to detect utterances of local commands. Upon detecting a local command, a corresponding function is immediately initiated, and subsequent activities or responses by the remote service are canceled or ignored.

Description

LOCAL AND REMOTE SPEECH PROCESSING
RELATED APPLICATIONS
[0001] The present application claims priority to US Patent Application No. 14/033,302 filed on September 20, 2013, entitled "Local and Remote Speech Processing", which is incorporated by reference herein in its entirety.
BACKGROUND
[0002] Homes, offices, automobiles, and public spaces are becoming more wired and connected with the proliferation of computing devices such as notebook computers, tablets, entertainment systems, and portable communication devices. As computing devices evolve, the ways in which users interact with these devices continue to evolve. For example, people can interact with computing devices through mechanical devices (e.g., keyboards, mice, etc.), electrical devices (e.g., touch screens, touch pads, etc.), and optical devices (e.g., motion detectors, camera, etc.). Another way to interact with computing devices is through audio devices that capture and respond to human speech.
BRIEF DESCRIPTION OF THE DRAWINGS
[0003] The detailed description is described with reference to the accompanying figures. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. The use of the same reference numbers in different figures indicates similar or identical components or features.
[0004] FIG. 1 is a block diagram of an illustrative voice interaction computing architecture that includes a local audio device and a remote speech processing service.
[0005] FIG. 2-4 are flow diagrams illustrating example processes for detecting command expressions that may be performed by a local audio device in conjunction with a remote speech processing service.
DETAILED DESCRIPTION
[0006] This disclosure pertains generally to a speech interface system that provides or facilitates speech-based interactions with a user. The system includes a local device having a microphone that captures audio containing user speech. Spoken user commands may be prefaced by a keyword, referred to as a trigger expression or wake expression. Audio following a trigger expression may be streamed to a remote service for speech recognition and the service may respond by performing a function or providing a command to be performed by the audio device.
[0007] Communications with the remote service may introduce response latency, which in most cases can be minimized within acceptable limits. Some spoken commands, however, may call for less latency. As an example, spoken commands related to certain types of media rendering, such as "stop", "pause", "hang up", and so forth may need to be performed with less perceptible amounts of latency.
[0008] In accordance with various embodiments, certain command expressions, referred to herein as local commands or local command expressions, are detected by or at the local device rather than by the remote service. More specifically, the local device is configured to detect a trigger or alert expression, which indicates that subsequent speech is intended by the user to form a command. Upon detecting the trigger expression, the local device initiates a communication session with the remote service and begins streaming received audio to the service. In response, the remote service performs speech recognition on the received audio and attempts to identify user intent based on the recognized speech. In response to a recognized user intent, the remote service may perform a corresponding function. In some cases, the function may performed in conjunction with the local device. For example, the remote service may send a command to the local device indicating that the local device should execute the command to perform a corresponding function.
[0009] Concurrently with the activities of the remote service, the local device monitors or analyzes the audio to detect an occurrence of a local command expression following the trigger expression. Upon detecting a local command expression in the audio, the local device immediately implements a corresponding function. In addition, further actions by the remote service are stopped or cancelled to avoid duplicate actions with respect to a single user utterance. Actions by the remote service may be stopped by explicitly notifying the remote service that the utterance has been acted upon locally, by terminating or cancelling a communications session, and/or by foregoing execution of any commands that are specified by the remote service in response to remote recognition of user speech.
[0010] FIG. 1 shows an example of a voice interaction system 100. The system 100 may include or may utilize a local voice-based audio device 102, which may be located within an environment 104 such as a home, and which may be used for interacting with a user 106. The voice interaction system 100 may also include or utilize a remote, network-based speech command service 108 that is configured to receive audio, to recognize speech in the audio, and to perform a function, referred to herein as a service-identified function, in response to the recognized speech. The service-identified function may be implemented by the speech command service 108 independently of the audio device, and/or may be implemented by providing a command to the audio device 102 for local execution.
[0011] In certain embodiments, the primary mode of user interaction with the audio device 102 may be through speech. For example, the audio device 102 may receive spoken command expressions from the user 106 and may provide services in response to the commands. The user may speak a predefined wake or trigger expression (e.g., "Awake"), which may be followed by commands or instructions (e.g., "I'd like to go to a movie. Please tell me what's playing at the local cinema."). Provided services may include performing actions or activities, rendering media, obtaining and/or providing information, providing information via generated or synthesized speech via the audio device 102, initiating Internet-based services on behalf of the user 106, and so forth.
[0012] The local audio device 102 and the speech command service 108 are configured to act in conjunction with each other to receive and respond to command expressions from the user 106. The command expressions may include local command expressions that are detected and acted upon by the local device 102 independently of the speech command service 108. The command expressions may also include commands that are interpreted and acted upon by or in conjunction with the remote speech command service 108.
[0013] The audio device 102 may have one or more microphones 1 10 and one or more audio speakers or transducers 1 12 to facilitate audio interactions with the user 106. The microphone 1 10 produces a microphone signal, also referred to as an input audio signal, representing audio from the environment 104, including sounds or expressions uttered by the user 106.
[0014] In some cases, the microphone 1 10 may comprise a microphone array that is used in conjunction with audio beamforming techniques to produce an input audio signal that is focused in a selectable direction. Similarly, a plurality of directional microphones 1 10 may be used to produce an audio signal corresponding to one of multiple available directions.
[0015] The audio device 102 includes operational logic, which in many cases may comprise a processor 1 14 and memory 1 16. The processor 1 14 may include multiple processors and/or a processor having multiple cores. The processor 1 14 may also comprise or include a digital signal processor for processing audio signals.
[0016] The memory 1 16 may contain applications and programs in the form of computer-executable instructions that are executed by the processor 1 14 to perform acts or actions that implement desired functionality of the audio device 102, including the functionality specifically described below. The memory 1 16 may be a type of computer-readable storage media and may include volatile and nonvolatile memory. Thus, the memory 1 16 may include, but is not limited to, RAM, ROM, EEPROM, flash memory, or other memory technology.
[0017] The audio device 102 may include a plurality of applications, services, and/or functions 1 18, referred to collectively below as functional components 1 18, which are executable by the processor 1 14 to provide services and functionality. The applications and other functional components 1 18 may include media playback services such as music players. Other services or operations performed or provided by the applications and other functional components 1 18 may include, as examples, requesting and consuming entertainment (e.g., gaming, finding and playing music, movies or other content, etc.), personal management (e.g., calendaring, note taking, etc.), online shopping, financial transactions, database inquiries, person-to-person voice communications, and so forth.
[0018] In some embodiments, the functional components 1 18 may be pre- installed on the audio device 102, and may implement core functionality of the audio device 102. In other embodiments, one or more of the applications or other functional components 1 18 may be installed by the user 106 or otherwise installed after the audio device 102 has been initialized by the user 106, and may implement additional or customized functionality as desired by the user 106.
[0019] The processor 1 14 may be configured by audio processing functionality or components 120 to process input audio signals generated by the microphone 1 10 and/or output audio signals provided to the speaker 1 12. As an example, the audio processing components 120 may implement acoustic echo cancellation to reduce audio echo generated by acoustic coupling between the microphone 1 10 and the speaker 1 12. The audio processing components 120 may also implement noise reduction to reduce noise in received audio signals, such as elements of input audio signals other than user speech. In certain embodiments, the audio processing components 120 may include one or more audio beamformers that are responsive to multiple microphones 1 10 to generate an audio signal that is focused in a direction from which user speech has been detected.
[0020] The audio device 102 may also be configured to implement one or more expression detectors or speech recognition components 122, which may be used to detect a trigger expression in speech captured by the microphone 1 10. The term "trigger expression" is used herein to indicate a word, phrase, or other utterance that is used to signal the audio device 102 that subsequent user speech is intended by the user to be interpreted as a command. [0021] The one or more speech recognition components 122 may also be used to detect commands or command expressions in the speech captured by the microphone 1 10. The term "command expression" is used herein to indicate a word, phrase, or other utterance that corresponds to or is associated with a function that is to be performed by the audio device 102 or by a service or other device that is accessible to the audio device 102, such as the speech command service 108. For example, the words "stop", "pause", "hang-up" may be used as command expressions. The "stop" and "pause" command expressions may indicate that media playback activities should be interrupted. The "hang-up" command expression may indicate that a current person-to- person communication should be terminated. Other command expressions, corresponding to different functions, may also be used. Command expressions may comprise conversation-style directives, such as "Find a nearby Italian restaurant."
[0022] Command expressions may include local command expressions that are to be interpreted by the audio device 102 without relying on the speech command service 108. Generally, local command expressions are relatively short expressions such as single words or short phrases, which can be easily detected by the audio device 102. Local command expressions may correspond to device functions for which relatively low response latencies are desired, such as media control or media playback control functions. The services of the speech command service 108 may be utilized for other command expressions for which greater response latencies are acceptable. Command expressions that are to be acted upon by the speech command service will be referred to herein as remote command expressions.
[0023] In some cases, the speech recognition components 122 may be implemented using automated speech recognition (ASR) techniques. For example, large vocabulary speech recognition techniques may be used for keyword detection, and the output of the speech recognition may be monitored for occurrences of the keyword. As an example, the speech recognition may use hidden Markov models and Gaussian mixture models to recognize voice input and to provide a continuous word stream corresponding to the voice input. The word stream may then be monitored to detect one or more specified words or expressions.
[0024] Alternatively, the speech recognition components 122 may be implemented by one or more keyword spotters. A keyword spotter is a functional component or algorithm that evaluates an audio signal to detect the presence of one or more predefined words or expressions in the audio signal. Generally, a keyword spotter uses simplified ASR techniques to detect a specific word or a limited number of words rather than attempting to recognize a large vocabulary. For example, a keyword spotter may provide a notification when a specified word is detected in a voice signal, rather than providing a textual or word-based output. A keyword spotter using these techniques may compare different words based on hidden Markov models (HMMs), which represent words as series of states. Generally, an utterance is analyzed by comparing its model to a keyword model and to a background model. Comparing the model of the utterance with the keyword model yields a score that represents the likelihood that the utterance corresponds to the keyword. Comparing the model of the utterance with the background model yields a score that represents the likelihood that the utterance corresponds to a generic word other than the keyword. The two scores can be compared to determine whether the keyword was uttered.
[0025] The audio device 102 may further comprise control functionally 124, referred to herein as a controller or control logic, that is configured to interact with the other components of the audio device 102 in order to implement the logical functionality of the audio device 102.
[0026] The control logic 124, the audio processing components 120, the speech recognition components 122, and the functional components 1 18 may comprise executable instructions, programs, and/or or program modules that are stored in the memory 1 16 and executed by the processor 1 14.
[0027] The speech command service 108 may in some instances be part of a network-accessible computing platform that is maintained and accessible via a network 126 such as the Internet. Network-accessible computing platforms such as this may be referred to using terms such as "on-demand computing", "software as a service (SaaS)", "platform computing", "network-accessible platform", "cloud services", "data centers", and so forth.
[0028] The audio device 102 and/or the speech command service 108 may communicatively couple to the network 126 via wired technologies (e.g., wires, universal serial bus (USB), fiber optic cable, etc.), wireless technologies (e.g., radio frequencies (RF), cellular, mobile telephone networks, satellite, Bluetooth, etc.), or other connection technologies. The network 126 is representative of any type of communication network, including data and/or voice network, and may be implemented using wired infrastructure (e.g., coaxial cable, fiber optic cable, etc.), a wireless infrastructure (e.g., RF, cellular, microwave, satellite, Bluetooth®, etc.), and/or other connection technologies.
[0029] Although the audio device 102 is described herein as a voice- controlled or speech-based interface device, the techniques described herein may be implemented in conjunction with various different types of devices, such as telecommunications devices and components, hands-free devices, entertainment devices, media playback devices, and so forth.
[0030] The speech command service 108 generally provides functionality for receiving an audio stream from the audio device 102, recognizing speech in the audio stream, determining user intent from the recognized speech, and performing an action or service in response to the user intent. The provided action may in some cases be performed in conjunction with the audio device 102 and in these cases the speech command service 108 may return a response to the audio device 102 indicating a command that is to be executed by the audio device 102.
[0031] The speech command service 108 includes operational logic, which in many cases may comprise one or more servers, computers, and or processors 128. The speech command service 108 may also have memory 130 containing applications and programs in the form of instructions that are executed by the processor 128 to perform acts or actions that implement desired functionality of the speech command service, including the functionality specifically described herein. The memory 130 may be a type of computer storage media and may include volatile and nonvolatile memory. Thus, the memory 130 may include, but is not limited to, RAM, ROM, EEPROM, flash memory, or other memory technology.
[0032] Among other logical and physical components not specifically shown, the speech command service 108 may comprise speech recognition components 132. The speech recognition components 132 may include automatic speech recognition (ASR) functionality that recognizes human speech in an audio signal.
[0033] The speech command service 108 may also comprise a natural language understanding component (NLU) 134 that determines user intent based on recognized speech.
[0034] The speech command service 108 may also comprise a command interpreter and action dispatcher 136 (referred to below simply as a command interpreter 136) that determines functions or commands corresponding to user intents. In some cases, commands may correspond to functions that are to be performed at least in part by the audio device 102, and the command interpreter 136 may in those cases provide responses to the audio device 102 indicating commands for implementing such functions. Examples of commands or functions that may be performed by the audio device in response to directives from the command interpreter 136 may include playing music or other media, increasing/decreasing the volume of the speaker 1 12, generating audible speech through the speaker 1 12, initiating certain types of communications with users of similar devices, and so forth.
[0035] Note that the speech command service 108 may also perform functions, in response to speech recognized from received audio, that involve entities or devices that are not shown in FIG. 1. For example, the speech command service 108 may interact with other network-based services to obtain information or services on behalf of the user 106. Furthermore, the speech command service 108 may itself have various elements and functionality that may be responsive to speech uttered by the user 106.
[0036] In operation, the microphone 1 10 of the audio device 102 captures or receives audio containing speech of the user 106. The audio is processed by the audio processing components 120 and the processed audio is received by the speech recognition components 122. The speech recognition components 122 analyze the audio to detect occurrences of a trigger expression in the speech contained in the audio. Upon detection of the trigger expression, the controller 124 begins sending or streaming received audio to the speech command service 108 along with a request for the speech command service 108 to recognize and interpret the user speech, and to initiate a function corresponding to any interpreted intent.
[0037] Concurrently with sending the audio to the speech command service 108, the speech recognition components 122 continue to analyze the received audio to detect an occurrence of a local command expression in the user speech. Upon detection of a local command expression, the controller 124 initiates or performs a device function that corresponds to the local command expression. For example, in response to the local command expression "stop", the controller 124 may initiate a function that stops media playback. The controller 124 may interact with one or more of the functional components 1 18 when initiating or performing the function.
[0038] Meanwhile, the speech command service 108, in response to receiving the audio, concurrently analyzes the audio to recognize speech, to determine a user intent, and to determine a service-identified function that is to be implemented in response to the user intent. However, after locally detecting and acting upon the local command expression, the audio device 102 may take actions to cancel, nullify, or invalidate any service-identified functions that may eventually be initiated by the speech command service 108. For example, the audio device 102 may cancel its previous request by sending a cancellation message to the speech command service 108 and/or by stopping the streaming of the audio to the speech command service 108. As another example, the audio device may ignore or discard any responses or service-specified commands that are received from the speech command service 108 in response to the earlier request. In some cases, the audio device may inform the speech command service 108 of actions that have been performed locally in response to the local command expression, and the speech command service 108 may modify its subsequent behavior based on this information. For example, the speech command service 108 may forego actions that it might otherwise have performed in response to recognized speech in the received audio.
[0039] FIG. 2 illustrates an example method 200 that may be performed by the audio device 102 in conjunction with the speech command service 108 in order to recognize and respond to user speech. The method 200 will be described in the context of the system 100 of FIG. 1, although the method 200 may also be performed in other environments and may be implemented in different ways.
[0040] Actions on the left side of FIG. 2 are performed at or by the local audio device 102. Actions on the right side of FIG. 2 are performed at or by the remote speech command service 108.
[0041] An action 202 comprises receiving an audio signal that has been captured by or in conjunction with the microphone 1 10. The audio signal contains or represents audio from the environment 104, and may contain user speech. The audio signal may be an analog electrical signal or may comprise a digital signal such as a digital audio stream.
[0042] An action 204 comprises detecting an occurrence of a trigger expression in the received audio and/or in the user speech. This may be performed by the speech recognition components 122 as described above, which may in some embodiments comprise keyword spotters. If the trigger expression is not detected, the action 204 is repeated in order to continuously monitor for occurrences of the trigger expression. The remaining actions shown in FIG. 2 are performed in response to detecting the trigger expression. [0043] If the trigger expression is detected in the action 204, an action 206 is performed, comprising sending subsequently received audio to the speech command service 108 along with a service request 208 for the speech command service 108 to recognize speech in the audio and to implement a function corresponding to the recognized speech. Functions initiated by the speech command service 108 in this manner are referred to herein as service- identified functions, and may in certain cases be performed in conjunction with the audio device 102. For example, a function may be initiated by sending a command to the audio device 102.
[0044] The sending 206 may comprise streaming or otherwise transmitting a digital audio stream 210 to the speech command service 108, representing or containing audio that is received from the microphone 1 10 subsequent to detection of the trigger expression. In certain embodiments, the action 206 may comprise opening or initiating a communication session between the audio device 102 and the speech command service 108. In particular, the request 208 may be used to establish a communication session with the speech command service 108 for the purpose of recognizing speech, understanding intent, and determining actions or functions to be performed in response to user speech. The request 208 may be followed or accompanied by the streamed audio 210. In some cases, the audio stream 210 provided to the speech command service 108 may include portions of received audio beginning at a time just prior to utterance of the trigger expression. [0045] The communication session may be associated with a communication or session identifier (ID) that identifies the communication session established between the audio device 102 and the speech command service 108. The session ID may be used or included in future communications relating to a particular user utterance or audio stream. In some cases, the session ID may be generated by the audio device 102 and provided in the request 208 to the speech command service 108. Alternatively, the session ID may be generated by the speech command service 108 and provided by the speech command service 108 in acknowledgment of the request 208. The term "request(ID)" is used herein to indicate a request having a particular session ID. A response from the speech command service 108 relating to the same session, request, or audio stream may be indicated by the term "response(ID)".
[0046] In certain embodiments, each communication session and corresponding session ID may correspond to a single user utterance. For example, the audio device 102 may establish a session upon detecting the trigger expression. The audio device 102 may then continue to stream audio to the speech command service 108 as part of the same session until the end of the user utterance. The speech command service 108 may provide responses to the audio device 102 through the session, using the same session ID. Responses may in some cases indicate commands to be executed by the audio device 102 in response to speech recognized by the speech command service 108 in the received audio 210. The communication session may remain open until the audio device 102 receives a response from the speech command service 108 or until the audio device 102 cancels the request.
[0047] The speech command service 108 receives the request 208 and audio stream 210 in an action 212. In response, the speech command service 108 performs an action 214 of recognizing speech in the received audio and determining a user intent as expressed by the recognized speech, using the speech recognition and natural language understanding components 132 and 134 of the speech command service 108. An action 214, performed by the command interpreter 136, comprises identifying and initiating a service- identified function in fulfillment of the determined user intent. The service- identified function may in some cases be performed by the speech command service 108, independently of the audio device 102. In other cases, the speech command service 108 may identify a function that is to be performed by the audio device 102, and may send a corresponding command to the audio device 102 for execution by the audio device 102.
[0048] Concurrently with the actions being performed by the speech command service 108, the local audio device 102 performs further actions to determine whether the user has uttered a local command expression and to perform a corresponding local function in response to any such uttered local command expression. Specifically, an action 218, performed in response to detecting the trigger expression in the action 204, comprises analyzing audio received in the action 202 to detect an occurrence of a local command expression that follows or immediately follows the trigger expression in the received user speech. This may be performed by the speech recognition components 122 of the audio device 102 as described above, which may in some embodiments comprise keyword spotters.
[0049] In response to detecting the local command expression in the action 218, an action 220 is performed of immediately initiating a device function that has been associated with the local command expression. For example, the local command expression "stop" might be associated with a function that stops media playback.
[0050] Also in response to detecting the local command expression in the action 218, the audio device 102 performs an action 222 of stopping or cancelling the request 208 to the speech command service 108. This may include cancelling or nullifying implementation of the service-identified function that may have otherwise been implemented by the speech command service 108 in response to the received request 208 and accompanying audio 210.
[0051] In certain implementations, the action 222 may comprise sending an explicit notification or command to the speech command service 108, requesting that the speech command service 108 cancel any further recognition activities with respect to the service request 208, and/or to cancel implementation of any service-identified functions that may otherwise have been initiated in response to recognized speech. Alternatively, the audio device 102 may simply notify the speech command service 108 regarding any functions that have been performed locally in response to local recognition of the local command expression, and the speech command service 108 may respond by cancelling the service request 208 or by performing other actions as may be appropriate.
[0052] In certain implementations, the speech command service 108 may implement the service-identified function by identifying a command to be executed by the audio device 102. In response to receiving a notification that the service request 208 is to be cancelled, the speech command service 108 may forego sending the command to the audio device 102. Alternatively, the speech command service may be allowed to complete its processing and to send a command to the audio device 102, whereupon the audio device 102 may ignore the command or forego execution of the command.
[0053] In some implementations, the speech command service may be configured to notify the audio device 102 before initiating a service-identified function, and may delay implementation of the service-identified function until receiving permission from the audio device 102. In this case, the audio device 102 may be configured to deny such permission when the local command expression has been recognized locally.
[0054] The various approaches described above may be used in situations calling for different amounts of command latency. For example, waiting for communications from the speech command service may introduce relatively higher latencies, which may not be acceptable in some situations. Such communications prior to implementing a function may safeguard against duplicate or unintended actions. Immediately implementing a locally recognized command expression and either ignoring subsequent commands from the speech command service or subsequently cancelling requests to the speech command service may be more appropriate in situations where lower latencies are desired.
[0055] Note that the actions of the speech command service 108 shown in FIG. 2 are performed in parallel and asynchronously with the actions 218, 220, and 222 of the audio device 102. It is assumed in some implementations that the audio device 102 is able to detect and act upon the local command expression relatively quickly, so that it may perform the action 222 of cancelling the request 208 and subsequent processing by the speech command service 108 before the service-identified function of the action 216 has been implemented or executed.
[0056] FIG. 3 shows illustrates an example method 300 in which the speech command service 108 returns commands to the audio device 102, and in which the audio device 102 is configured to ignore the commands or forego execution of the commands in situations in which a local command expression has already been detected and acted upon by the audio device 102. Initial actions are similar or identical to those described above. Actions performed by the audio device 102 are shown on the left and actions performed by the speech command service 108 are shown on the right.
[0057] An action 302 comprises receiving an audio signal containing user speech. An action 304 comprises analyzing the audio signal to detect a trigger expression in the user speech. Subsequent actions shown in FIG. 3 are performed in response to detecting the trigger expression.
[0058] An action 306 comprises sending a request 308 and audio 310 to the speech command service 108. An action 312 comprises receiving the request 308 and the audio 310 at the speech command service 108. An action 314 comprises recognizing user speech and determining user intent based on the recognized user speech.
[0059] In response to the determined user intent, the speech command service 108 performs an action 316 of sending a command 318 to the audio device 102 for execution by the audio device 102 in order to implement a service-identified function corresponding to the recognized user intent. For example, the command may comprise a "stop" command, indicating that the audio device 102 is to stop playback of music.
[0060] An action 320, performed by the audio device 102, comprises receiving and executing the command. The action 320 is shown in a dashed box to indicate that it is performed conditionally, based on whether a local command expression has been detected and acted upon by the audio device 102. Specifically, the action 320 is not performed if a local command expression has been detected by the audio device 102.
[0061] Concurrently with the actions performed by the speech command service 108, the audio device 102 performs an action 322 of analyzing received audio to detect an occurrence of a local command expression that follows or immediately follows the trigger expression in the received user speech. In response to detecting the local command expression, an action 324 is performed of immediately initiating a local device function that has been associated with the local command expression.
[0062] Also in response to detecting the local command expression in the action 322, the audio device 102 performs an action 326 of foregoing execution of the received command 318. More specifically, any commands received from the speech command service 108 in response to the request 308 are discarded or ignored. Responses and commands corresponding to the request 308 may be identified by session IDs associated with the responses.
[0063] If the local command expression is not detected in the action 322, the audio device performs the action 320 of executing the command 318 received from the speech command service 108.
[0064] FIG. 4 shows an example method 400 in which the audio device 102 is configured to actively cancel requests to the speech command service 108 after locally detecting a local command expression. Initial actions are similar or identical to those described above. Actions performed by the audio device 102 are shown on the left and actions performed by the speech command service 108 are shown on the right.
[0065] An action 402 comprises receiving an audio signal containing user speech. An action 404 comprises analyzing the audio signal to detect a trigger expression in the user speech. Subsequent actions shown in FIG. 4 are performed in response to detecting the trigger expression. [0066] An action 406 comprises sending a request 408 and audio 410 to the speech command service 108. An action 412 comprises receiving the request 408 and the audio 410 at the speech command service 108. An action 414 comprises recognizing user speech and determining user intent based on the recognized user speech.
[0067] An action 416 comprises determining whether the request 408 has been cancelled by the audio device 102. As an example, the audio device 102 may send a cancellation message or may terminate the current communication session in order to cancel the request. If the request has been canceled by the audio device 102, no further action is taken by the speech command service. If the request has not been canceled, an action 418 is performed, which comprises sending a command 420 to the audio device 102 for execution by the audio device 102 in order to implement a service-identified function corresponding to the recognized user intent.
[0068] An action 422, performed by the audio device 102, comprises receiving and executing the command. The action 422 is shown in a dashed box to indicate that it is performed conditionally, depending on whether a command has been sent and received from the speech command service 108, which in turn depends on whether the audio device 102 has cancelled the request 408.
[0069] Concurrently with the actions performed by the speech command service 108, the audio device 102 performs an action 424 of analyzing received audio to detect an occurrence of a local command expression that follows or immediately follows the trigger expression in the received user speech. In response to detecting the local command expression, an action 426 is performed of immediately initiating a local device function that has been associated with the local command expression.
[0070] Also in response to detecting the local command expression in the action 424, the audio device 102 performs an action 428 of requesting the speech command service 108 to cancel the request 408 and/or to cancel implementation of any service-identified functions that may have otherwise been performed in response to recognized speech in the audio received by the speech command service 108 from the audio device 102. This may comprise communicating with the speech command service 108, such as by sending a cancellation notification or request.
[0071] In some cases, the cancellation may comprise replying to a communication or notification from the speech command service 108 of a pending implementation of a service-identified function by the speech command service. In response to receiving such a notification, the audio device 102 may reply and may request cancellation of the pending implementation. Alternatively, the audio device 102 may cancel the implementation of any function that might have otherwise been performed in response to detecting the local command expression, and may instruct the speech command service 108 to proceed with implementation of the pending function. [0072] If the local command expression is not detected in the action 424, the audio device 102 performs the action 422 of executing the command 420 received from the speech command service 108. The action 422 may occur asynchronously, upon receiving the command 420 from the speech command service.
[0073] The embodiments described above may be implemented programmatically, such as with computers, processors, digital signal processors, analog processors, and so forth. In other embodiments, however, one or more of the components, functions, or elements may be implemented using specialized or dedicated circuits, including analog circuits and/or digital logic circuits. The term "component", as used herein, is intended to include any hardware, software, logic, or combinations of the foregoing that are used to implement the functionality attributed to the component.
[0074] Although the subject matter has been described in language specific to structural features, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features described. Rather, the specific features are disclosed as illustrative forms of implementing the claims.
Clauses:
1. One or more non- transitory computer-readable media storing computer-executable instructions that, when executed, cause one or more processors to perform acts comprising:
receiving audio that contains user speech; detecting a trigger expression in the user speech;
in response to detecting the trigger expression in the user speech:
streaming the received audio to a remote speech command service; and
analyzing the received audio to detect a local command expression following the trigger expression in the user speech, wherein the local command expression is associated with a device function;
initiating the device function in response to detecting the local command expression following the trigger expression in the user speech;
receiving a response from the remote speech command service, wherein the response indicates a command that is to be performed in response to speech recognized by the remote speech command service in the streamed audio;
executing the command indicated by the response if the local command expression is not detected following the trigger expression in the user speech; and
foregoing execution of the command indicated by the response if the local command expression is detected following the trigger expression in the user speech.
2. The one or more computer-readable media of clause 1, wherein the streaming is associated with a communication identifier and wherein the response indicates the communication identifier. 3. The one or more computer-readable media of clause 1, wherein the device function comprises a media control function.
4. The one or more computer-readable media of clause 1, the acts further comprising stopping the streaming of the received audio in response to detecting the command expression.
5. A method, comprising:
receiving audio that contains user speech;
detecting a trigger expression in the user speech;
in response to detecting the trigger expression in the user speech:
sending the received audio to a speech command service to recognize speech in the received audio and to implement a first function corresponding to the recognized speech; and
analyzing the received audio to detect a local command expression that follows the trigger expression in the received audio, wherein the local command expression is associated with a second function;
in response to detecting the local command expression that follows the trigger expression in the received audio:
initiating the second function; and
cancelling implementation of the first function. 6. The method of clause 5, wherein cancelling implementation of the first function comprises requesting the speech command service to cancel implementation of the first function.
7. The method of clause 5, further comprising receiving a communication from the speech command service indicating a pending implementation of the first function;
wherein cancelling implementation of the first function comprises requesting the speech command service to cancel the pending implementation of the first function.
8. The method of clause 5, further comprising receiving a command corresponding to the first function from the speech command service, wherein cancelling implementation of the first function comprises forgoing execution of the command received from the speech command service.
9. The method of clause 5, further comprising informing the speech command service that the second function has been initiated.
10. The method of clause 5, wherein cancelling implementation of the first function comprising informing the speech command service that the second function has been initiated. 11. The method of clause 5, wherein the second function comprises a media control function.
12. The method of clause 5, further comprising:
establishing a communication session with the speech command service in response to detecting the trigger expression in the audio; and
wherein cancelling implementation of the first function comprises terminating the communication session.
13. The method of clause 5, further comprising:
associating an identifier with the received audio;
receiving a response from the speech command service, wherein the response indicates the identifier and a command corresponding to the first function; and
wherein cancelling implementation of the first function comprises forgoing execution of the command.
14. A system, comprising:
one or more speech recognition components configured to recognize user speech in received audio, to detect a trigger expression in the user speech, and to detect a local command expression in the user speech;
control logic configured to perform acts in response to detection by the one or more speech recognition components of the trigger expression in the user speech, the acts comprising: sending the audio to a speech command service to recognize speech in the audio and to implement a first function corresponding to the recognized speech; and
in response to detection by the one or more speech recognition components of the local command expression in the user speech: (a) identifying a second function corresponding to the local command expression and (b) cancelling implementation of at least one of the first and second functions.
15. The system of clause 14, wherein the one or more speech recognition components comprise one or more keyword spotters.
16. The system of clause 14, wherein cancelling implementation of the at least one of the first and second functions comprises requesting the speech command service to cancel implementation of the first function.
17. The system of clause 14, wherein cancelling implementation of the at least one of the first and second functions comprises ignoring a command received from the speech command service.
18. The system of clause 14, wherein the second function comprises a media control function. 19. The system of clause 14, the acts further comprising stopping the sending of the audio in response to detection of the local command expression in the user speech.
20. The system of clause 14, wherein cancelling implementation of the at least one of the first and second functions comprises informing the speech command service that the second function has been initiated.

Claims

CLAIMS What is claimed is:
1. A device storing computer-executable instructions that, when executed, cause one or more processors of the device to perform acts comprising:
receiving audio that contains user speech;
detecting a trigger expression in the user speech;
in response to detecting the trigger expression in the user speech:
streaming the received audio to a remote speech command service; and
analyzing the received audio to detect a local command expression following the trigger expression in the user speech, wherein the local command expression is associated with a device function;
initiating the device function in response to detecting the local command expression following the trigger expression in the user speech;
receiving a response from the remote speech command service, wherein the response indicates a command that is to be performed in response to speech recognized by the remote speech command service in the streamed audio;
executing the command indicated by the response if the local command expression is not detected following the trigger expression in the user speech; foregoing execution of the command indicated by the response if the local command expression is detected following the trigger expression in the user speech.
2. The device of claim 1, wherein the streaming is associated with a communication identifier and wherein the response indicates the communication identifier.
3. The device of claim 1, wherein the device function comprises a media control function.
4. The device of claim 1, the acts further comprising stopping the streaming of the received audio in response to detecting the command expression.
5. A method, comprising:
receiving audio that contains user speech;
detecting a trigger expression in the user speech;
in response to detecting the trigger expression in the user speech:
sending the received audio to a speech command service to recognize speech in the received audio and to implement a first function corresponding to the recognized speech; and
analyzing the received audio to detect a local command expression that follows the trigger expression in the received audio, wherein the local command expression is associated with a second function;
in response to detecting the local command expression that follows the trigger expression in the received audio:
initiating the second function; and
cancelling implementation of the first function.
6. The method of claim 5, wherein cancelling implementation of the first function comprises requesting the speech command service to cancel implementation of the first function.
7. The method of claim 5, further comprising receiving a communication from the speech command service indicating a pending implementation of the first function;
wherein cancelling implementation of the first function comprises requesting the speech command service to cancel the pending implementation of the first function.
8. The method of claim 5, further comprising receiving a command corresponding to the first function from the speech command service, wherein cancelling implementation of the first function comprises forgoing execution of the command received from the speech command service.
9. The method of claim 5, further comprising informing the speech command service that the second function has been initiated.
10. The method of claim 5, further comprising:
associating an identifier with the received audio;
receiving a response from the speech command service, wherein the response indicates the identifier and a command corresponding to the first function; and
wherein cancelling implementation of the first function comprises forgoing execution of the command.
11. A system, comprising:
one or more speech recognition components configured to recognize user speech in received audio, to detect a trigger expression in the user speech, and to detect a local command expression in the user speech;
control logic configured to perform acts in response to detection by the one or more speech recognition components of the trigger expression in the user speech, the acts comprising:
sending the audio to a speech command service to recognize speech in the audio and to implement a first function corresponding to the recognized speech; and
in response to detection by the one or more speech recognition components of the local command expression in the user speech: (a) identifying a second function corresponding to the local command expression and (b) cancelling implementation of at least one of the first and second functions.
12. The system of claim 11, wherein cancelling implementation of the at least one of the first and second functions comprises requesting the speech command service to cancel implementation of the first function.
13. The system of claim 11, wherein cancelling implementation of the at least one of the first and second functions comprises ignoring a command received from the speech command service.
14. The system of claim 11, the acts further comprising stopping the sending of the audio in response to detection of the local command expression in the user speech.
15. The system of claim 11, wherein cancelling implementation of the at least one of the first and second functions comprises informing the speech command service that the second function has been initiated.
PCT/US2014/054700 2013-09-20 2014-09-09 Local and remote speech processing WO2015041892A1 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
EP14846698.0A EP3047481A4 (en) 2013-09-20 2014-09-09 Local and remote speech processing
JP2016543926A JP2016531375A (en) 2013-09-20 2014-09-09 Local and remote speech processing
CN201480050711.8A CN105793923A (en) 2013-09-20 2014-09-09 Local and remote speech processing

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201314033302A 2013-09-20 2013-09-20
US14/033,302 2013-09-20

Publications (1)

Publication Number Publication Date
WO2015041892A1 true WO2015041892A1 (en) 2015-03-26

Family

ID=52689281

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2014/054700 WO2015041892A1 (en) 2013-09-20 2014-09-09 Local and remote speech processing

Country Status (4)

Country Link
EP (1) EP3047481A4 (en)
JP (1) JP2016531375A (en)
CN (1) CN105793923A (en)
WO (1) WO2015041892A1 (en)

Cited By (143)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9865248B2 (en) 2008-04-05 2018-01-09 Apple Inc. Intelligent text-to-speech conversion
US9870196B2 (en) * 2015-05-27 2018-01-16 Google Llc Selective aborting of online processing of voice inputs in a voice-enabled electronic device
DK201670578A1 (en) * 2016-06-09 2018-02-26 Apple Inc Intelligent automated assistant in a home environment
US9966060B2 (en) 2013-06-07 2018-05-08 Apple Inc. System and method for user-specified pronunciation of words for speech synthesis and recognition
US9966073B2 (en) 2015-05-27 2018-05-08 Google Llc Context-sensitive dynamic update of voice to text model in a voice-enabled electronic device
US9971774B2 (en) 2012-09-19 2018-05-15 Apple Inc. Voice-based media searching
US9986419B2 (en) 2014-09-30 2018-05-29 Apple Inc. Social reminders
US10043516B2 (en) 2016-09-23 2018-08-07 Apple Inc. Intelligent automated assistant
US10049675B2 (en) 2010-02-25 2018-08-14 Apple Inc. User profiling for voice input processing
US10067938B2 (en) 2016-06-10 2018-09-04 Apple Inc. Multilingual word prediction
US10079014B2 (en) 2012-06-08 2018-09-18 Apple Inc. Name recognition system
US10083690B2 (en) 2014-05-30 2018-09-25 Apple Inc. Better resolution when referencing to concepts
US10083697B2 (en) 2015-05-27 2018-09-25 Google Llc Local persisting of data for selectively offline capable voice action in a voice-enabled electronic device
US10108612B2 (en) 2008-07-31 2018-10-23 Apple Inc. Mobile device having human language translation capability with positional feedback
WO2019046170A1 (en) * 2017-08-28 2019-03-07 Roku, Inc. Local and cloud speech recognition
US10303715B2 (en) 2017-05-16 2019-05-28 Apple Inc. Intelligent automated assistant for media exploration
US10311144B2 (en) 2017-05-16 2019-06-04 Apple Inc. Emoji word sense disambiguation
US10311871B2 (en) 2015-03-08 2019-06-04 Apple Inc. Competing devices responding to voice triggers
US10318871B2 (en) 2005-09-08 2019-06-11 Apple Inc. Method and apparatus for building an intelligent automated assistant
US10332518B2 (en) 2017-05-09 2019-06-25 Apple Inc. User interface for correcting recognition errors
US10354652B2 (en) 2015-12-02 2019-07-16 Apple Inc. Applying neural network language models to weighted finite state transducers for automatic speech recognition
US10356243B2 (en) 2015-06-05 2019-07-16 Apple Inc. Virtual assistant aided communication with 3rd party service in a communication session
US10381016B2 (en) 2008-01-03 2019-08-13 Apple Inc. Methods and apparatus for altering audio output signals
US10395654B2 (en) 2017-05-11 2019-08-27 Apple Inc. Text normalization based on a data-driven learning network
US10403278B2 (en) 2017-05-16 2019-09-03 Apple Inc. Methods and systems for phonetic matching in digital assistant services
US10403283B1 (en) 2018-06-01 2019-09-03 Apple Inc. Voice interaction at a primary device to access call functionality of a companion device
US10410637B2 (en) 2017-05-12 2019-09-10 Apple Inc. User-specific acoustic models
US10417344B2 (en) 2014-05-30 2019-09-17 Apple Inc. Exemplar-based natural language processing
US10417266B2 (en) 2017-05-09 2019-09-17 Apple Inc. Context-aware ranking of intelligent response suggestions
US10417405B2 (en) 2011-03-21 2019-09-17 Apple Inc. Device access using voice authentication
US20190295552A1 (en) * 2018-03-23 2019-09-26 Amazon Technologies, Inc. Speech interface device
US10431204B2 (en) 2014-09-11 2019-10-01 Apple Inc. Method and apparatus for discovering trending terms in speech requests
US10438595B2 (en) 2014-09-30 2019-10-08 Apple Inc. Speaker identification and unsupervised speaker adaptation techniques
US10445429B2 (en) 2017-09-21 2019-10-15 Apple Inc. Natural language understanding using vocabularies with compressed serialized tries
US10455322B2 (en) 2017-08-18 2019-10-22 Roku, Inc. Remote control with presence sensor
US10453443B2 (en) 2014-09-30 2019-10-22 Apple Inc. Providing an indication of the suitability of speech recognition
US10474753B2 (en) 2016-09-07 2019-11-12 Apple Inc. Language identification using recurrent neural networks
US10482874B2 (en) 2017-05-15 2019-11-19 Apple Inc. Hierarchical belief states for digital assistants
US10496705B1 (en) 2018-06-03 2019-12-03 Apple Inc. Accelerated task performance
US10497365B2 (en) 2014-05-30 2019-12-03 Apple Inc. Multi-command single utterance input method
US10515637B1 (en) 2017-09-19 2019-12-24 Amazon Technologies, Inc. Dynamic speech processing
US10529332B2 (en) 2015-03-08 2020-01-07 Apple Inc. Virtual assistant activation
US10567477B2 (en) 2015-03-08 2020-02-18 Apple Inc. Virtual assistant continuity
US10580409B2 (en) 2016-06-11 2020-03-03 Apple Inc. Application integration with a digital assistant
US10592604B2 (en) 2018-03-12 2020-03-17 Apple Inc. Inverse text normalization for automatic speech recognition
US10593346B2 (en) 2016-12-22 2020-03-17 Apple Inc. Rank-reduced token representation for automatic speech recognition
US10636424B2 (en) 2017-11-30 2020-04-28 Apple Inc. Multi-turn canned dialog
US10643611B2 (en) 2008-10-02 2020-05-05 Apple Inc. Electronic devices with voice command and contextual data processing capabilities
CN111145735A (en) * 2018-11-05 2020-05-12 三星电子株式会社 Electronic device and operation method thereof
US10657961B2 (en) 2013-06-08 2020-05-19 Apple Inc. Interpreting and acting upon commands that involve sharing information with remote devices
US10657328B2 (en) 2017-06-02 2020-05-19 Apple Inc. Multi-task recurrent neural network architecture for efficient morphology handling in neural language modeling
WO2020101865A1 (en) * 2018-11-13 2020-05-22 Motorola Solutions, Inc. Methods and systems for providing a corrected voice command
US10684703B2 (en) 2018-06-01 2020-06-16 Apple Inc. Attention aware virtual assistant dismissal
US10699717B2 (en) 2014-05-30 2020-06-30 Apple Inc. Intelligent assistant for home automation
US10706841B2 (en) 2010-01-18 2020-07-07 Apple Inc. Task flow identification based on user intent
US10714117B2 (en) 2013-02-07 2020-07-14 Apple Inc. Voice trigger for a digital assistant
US10726832B2 (en) 2017-05-11 2020-07-28 Apple Inc. Maintaining privacy of personal information
US10733375B2 (en) 2018-01-31 2020-08-04 Apple Inc. Knowledge-based framework for improving natural language understanding
US10733982B2 (en) 2018-01-08 2020-08-04 Apple Inc. Multi-directional dialog
US10733993B2 (en) 2016-06-10 2020-08-04 Apple Inc. Intelligent digital assistant in a multi-tasking environment
US10741185B2 (en) 2010-01-18 2020-08-11 Apple Inc. Intelligent automated assistant
US10748546B2 (en) 2017-05-16 2020-08-18 Apple Inc. Digital assistant services based on device capabilities
US10755051B2 (en) 2017-09-29 2020-08-25 Apple Inc. Rule-based natural language processing
US10755703B2 (en) 2017-05-11 2020-08-25 Apple Inc. Offline personal assistant
CN111629658A (en) * 2017-12-22 2020-09-04 瑞思迈传感器技术有限公司 Apparatus, system and method for motion sensing
US10769385B2 (en) 2013-06-09 2020-09-08 Apple Inc. System and method for inferring user intent from speech inputs
US10777197B2 (en) 2017-08-28 2020-09-15 Roku, Inc. Audio responsive device with play/stop and tell me something buttons
US10791176B2 (en) 2017-05-12 2020-09-29 Apple Inc. Synchronization and task delegation of a digital assistant
US10789945B2 (en) 2017-05-12 2020-09-29 Apple Inc. Low-latency intelligent automated assistant
US10789959B2 (en) 2018-03-02 2020-09-29 Apple Inc. Training speaker recognition models for digital assistants
US10795541B2 (en) 2009-06-05 2020-10-06 Apple Inc. Intelligent organization of tasks items
US10810274B2 (en) 2017-05-15 2020-10-20 Apple Inc. Optimizing dialogue policy decisions for digital assistants using implicit feedback
US10818288B2 (en) 2018-03-26 2020-10-27 Apple Inc. Natural assistant interaction
US10839159B2 (en) 2018-09-28 2020-11-17 Apple Inc. Named entity normalization in a spoken dialog system
US10892996B2 (en) 2018-06-01 2021-01-12 Apple Inc. Variable latency device coordination
US10904611B2 (en) 2014-06-30 2021-01-26 Apple Inc. Intelligent automated assistant for TV user interactions
US10909331B2 (en) 2018-03-30 2021-02-02 Apple Inc. Implicit identification of translation payload with neural machine translation
US10928918B2 (en) 2018-05-07 2021-02-23 Apple Inc. Raise to speak
US10942702B2 (en) 2016-06-11 2021-03-09 Apple Inc. Intelligent device arbitration and control
US10942703B2 (en) 2015-12-23 2021-03-09 Apple Inc. Proactive assistance based on dialog communication between devices
US10984780B2 (en) 2018-05-21 2021-04-20 Apple Inc. Global semantic word embeddings using bi-directional recurrent neural networks
US11010561B2 (en) 2018-09-27 2021-05-18 Apple Inc. Sentiment prediction from textual data
US11010127B2 (en) 2015-06-29 2021-05-18 Apple Inc. Virtual assistant for media playback
US11025565B2 (en) 2015-06-07 2021-06-01 Apple Inc. Personalized prediction of responses for instant messaging
US11023513B2 (en) 2007-12-20 2021-06-01 Apple Inc. Method and apparatus for searching using an active ontology
US11048473B2 (en) 2013-06-09 2021-06-29 Apple Inc. Device, method, and graphical user interface for enabling conversation persistence across two or more instances of a digital assistant
US11062702B2 (en) 2017-08-28 2021-07-13 Roku, Inc. Media system with multiple digital assistants
US11062705B2 (en) 2018-07-18 2021-07-13 Kabushiki Kaisha Toshiba Information processing apparatus, information processing method, and computer program product
US11070949B2 (en) 2015-05-27 2021-07-20 Apple Inc. Systems and methods for proactively identifying and surfacing relevant content on an electronic device with a touch-sensitive display
US11069336B2 (en) 2012-03-02 2021-07-20 Apple Inc. Systems and methods for name pronunciation
US11069347B2 (en) 2016-06-08 2021-07-20 Apple Inc. Intelligent automated assistant for media exploration
US11080012B2 (en) 2009-06-05 2021-08-03 Apple Inc. Interface for a virtual digital assistant
US11120372B2 (en) 2011-06-03 2021-09-14 Apple Inc. Performing actions associated with task items that represent tasks to perform
US11126389B2 (en) 2017-07-11 2021-09-21 Roku, Inc. Controlling visual indicators in an audio responsive electronic device, and capturing and providing audio using an API, by native and non-native computing devices and services
US11126400B2 (en) 2015-09-08 2021-09-21 Apple Inc. Zero latency digital assistant
US11127397B2 (en) 2015-05-27 2021-09-21 Apple Inc. Device voice control
US11133008B2 (en) 2014-05-30 2021-09-28 Apple Inc. Reducing the need for manual start/end-pointing and trigger phrases
US11140099B2 (en) 2019-05-21 2021-10-05 Apple Inc. Providing message response suggestions
US11145298B2 (en) 2018-02-13 2021-10-12 Roku, Inc. Trigger word detection with multiple digital assistants
US11145294B2 (en) 2018-05-07 2021-10-12 Apple Inc. Intelligent automated assistant for delivering content from user experiences
US11170166B2 (en) 2018-09-28 2021-11-09 Apple Inc. Neural typographical error modeling via generative adversarial networks
US11204787B2 (en) 2017-01-09 2021-12-21 Apple Inc. Application integration with a digital assistant
US11217251B2 (en) 2019-05-06 2022-01-04 Apple Inc. Spoken notifications
US11227589B2 (en) 2016-06-06 2022-01-18 Apple Inc. Intelligent list reading
US11231904B2 (en) 2015-03-06 2022-01-25 Apple Inc. Reducing response latency of intelligent automated assistants
US11237797B2 (en) 2019-05-31 2022-02-01 Apple Inc. User activity shortcut suggestions
US11269678B2 (en) 2012-05-15 2022-03-08 Apple Inc. Systems and methods for integrating third party services with a digital assistant
US11281993B2 (en) 2016-12-05 2022-03-22 Apple Inc. Model and ensemble compression for metric learning
US11289073B2 (en) 2019-05-31 2022-03-29 Apple Inc. Device text to speech
US11301477B2 (en) 2017-05-12 2022-04-12 Apple Inc. Feedback analysis of a digital assistant
US11307752B2 (en) 2019-05-06 2022-04-19 Apple Inc. User configurable task triggers
US11314370B2 (en) 2013-12-06 2022-04-26 Apple Inc. Method for extracting salient dialog usage from live data
US11350253B2 (en) 2011-06-03 2022-05-31 Apple Inc. Active transport based notifications
US11348573B2 (en) 2019-03-18 2022-05-31 Apple Inc. Multimodality in digital assistant systems
US11360641B2 (en) 2019-06-01 2022-06-14 Apple Inc. Increasing the relevance of new available information
US11373645B1 (en) * 2018-06-18 2022-06-28 Amazon Technologies, Inc. Updating personalized data on a speech interface device
US11388291B2 (en) 2013-03-14 2022-07-12 Apple Inc. System and method for processing voicemail
US11386266B2 (en) 2018-06-01 2022-07-12 Apple Inc. Text correction
US11423908B2 (en) 2019-05-06 2022-08-23 Apple Inc. Interpreting spoken requests
US11462215B2 (en) 2018-09-28 2022-10-04 Apple Inc. Multi-modal inputs for voice commands
US11467802B2 (en) 2017-05-11 2022-10-11 Apple Inc. Maintaining privacy of personal information
US11468282B2 (en) 2015-05-15 2022-10-11 Apple Inc. Virtual assistant in a communication session
US11475898B2 (en) 2018-10-26 2022-10-18 Apple Inc. Low-latency multi-speaker speech recognition
US11475884B2 (en) 2019-05-06 2022-10-18 Apple Inc. Reducing digital assistant latency when a language is incorrectly determined
US11488406B2 (en) 2019-09-25 2022-11-01 Apple Inc. Text detection using global geometry estimators
US11495218B2 (en) 2018-06-01 2022-11-08 Apple Inc. Virtual assistant operation in multi-device environments
US11496600B2 (en) 2019-05-31 2022-11-08 Apple Inc. Remote execution of machine-learned models
US11500672B2 (en) 2015-09-08 2022-11-15 Apple Inc. Distributed personal assistant
US11526368B2 (en) 2015-11-06 2022-12-13 Apple Inc. Intelligent automated assistant in a messaging environment
US11532306B2 (en) 2017-05-16 2022-12-20 Apple Inc. Detecting a trigger of a digital assistant
WO2023287471A1 (en) * 2021-07-15 2023-01-19 Arris Enterprises Llc Command services manager for secure sharing of commands to registered agents
US11638059B2 (en) 2019-01-04 2023-04-25 Apple Inc. Content playback on multiple devices
US11657813B2 (en) 2019-05-31 2023-05-23 Apple Inc. Voice identification in digital assistant systems
US11671920B2 (en) 2007-04-03 2023-06-06 Apple Inc. Method and system for operating a multifunction portable electronic device using voice-activation
US11696060B2 (en) 2020-07-21 2023-07-04 Apple Inc. User identification using headphones
US11765209B2 (en) 2020-05-11 2023-09-19 Apple Inc. Digital assistant hardware abstraction
US11790914B2 (en) 2019-06-01 2023-10-17 Apple Inc. Methods and user interfaces for voice-based control of electronic devices
US11798547B2 (en) 2013-03-15 2023-10-24 Apple Inc. Voice activated device for use with a voice-based digital assistant
US11809483B2 (en) 2015-09-08 2023-11-07 Apple Inc. Intelligent automated assistant for media search and playback
US11838734B2 (en) 2020-07-20 2023-12-05 Apple Inc. Multi-device audio adjustment coordination
US11853536B2 (en) 2015-09-08 2023-12-26 Apple Inc. Intelligent automated assistant in a media environment
US11886805B2 (en) 2015-11-09 2024-01-30 Apple Inc. Unconventional virtual assistant interactions
US11914848B2 (en) 2020-05-11 2024-02-27 Apple Inc. Providing relevant data items based on context

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107146618A (en) * 2017-06-16 2017-09-08 北京云知声信息技术有限公司 Method of speech processing and device
CN107342083B (en) * 2017-07-05 2021-07-20 百度在线网络技术(北京)有限公司 Method and apparatus for providing voice service
WO2019026313A1 (en) * 2017-08-02 2019-02-07 パナソニックIpマネジメント株式会社 Information processing device, speech recognition system, and information processing method
US10713007B2 (en) * 2017-12-12 2020-07-14 Amazon Technologies, Inc. Architecture for a hub configured to control a second device while a connection to a remote system is unavailable
CN108320749A (en) * 2018-03-14 2018-07-24 百度在线网络技术(北京)有限公司 Far field voice control device and far field speech control system
EP3613037B1 (en) * 2018-06-27 2020-10-21 Google LLC Rendering responses to a spoken utterance of a user utilizing a local text-response map
JP7451033B2 (en) 2020-03-06 2024-03-18 アルパイン株式会社 data processing system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070258418A1 (en) * 2006-05-03 2007-11-08 Sprint Spectrum L.P. Method and system for controlling streaming of media to wireless communication devices
US20120179469A1 (en) 2011-01-07 2012-07-12 Nuance Communication, Inc. Configurable speech recognition system using multiple recognizers
US8296383B2 (en) * 2008-10-02 2012-10-23 Apple Inc. Electronic devices with voice command and contextual data processing capabilities
US8364481B2 (en) * 2008-07-02 2013-01-29 Google Inc. Speech recognition with parallel recognition tasks
US20130151250A1 (en) * 2011-12-08 2013-06-13 Lenovo (Singapore) Pte. Ltd Hybrid speech recognition

Family Cites Families (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS58208799A (en) * 1982-05-28 1983-12-05 トヨタ自動車株式会社 Voice recognition system for vehicle
WO2000058942A2 (en) * 1999-03-26 2000-10-05 Koninklijke Philips Electronics N.V. Client-server speech recognition
JP2001005492A (en) * 1999-06-21 2001-01-12 Matsushita Electric Ind Co Ltd Voice recognizing method and voice recognition device
AU2003263957A1 (en) * 2002-08-16 2004-03-03 Nuasis Corporation Contact center architecture
KR100521154B1 (en) * 2004-02-03 2005-10-12 삼성전자주식회사 Apparatus and method processing call in voice/data integration switching system
US9848086B2 (en) * 2004-02-23 2017-12-19 Nokia Technologies Oy Methods, apparatus and computer program products for dispatching and prioritizing communication of generic-recipient messages to recipients
JP4483428B2 (en) * 2004-06-25 2010-06-16 日本電気株式会社 Speech recognition / synthesis system, synchronization control method, synchronization control program, and synchronization control apparatus
CN1728750B (en) * 2004-07-27 2012-07-18 邓里文 Method of packet voice communication
JP5380777B2 (en) * 2007-02-21 2014-01-08 ヤマハ株式会社 Audio conferencing equipment
US8090077B2 (en) * 2007-04-02 2012-01-03 Microsoft Corporation Testing acoustic echo cancellation and interference in VoIP telephones
JP4925906B2 (en) * 2007-04-26 2012-05-09 株式会社日立製作所 Control device, information providing method, and information providing program
CN101246687A (en) * 2008-03-20 2008-08-20 北京航空航天大学 Intelligent voice interaction system and method thereof
US8019608B2 (en) * 2008-08-29 2011-09-13 Multimodal Technologies, Inc. Distributed speech recognition using one way communication
JP5244663B2 (en) * 2009-03-18 2013-07-24 Kddi株式会社 Speech recognition processing method and system for inputting text by speech
US9171541B2 (en) * 2009-11-10 2015-10-27 Voicebox Technologies Corporation System and method for hybrid processing in a natural language voice services environment
JP5658641B2 (en) * 2011-09-15 2015-01-28 株式会社Nttドコモ Terminal device, voice recognition program, voice recognition method, and voice recognition system
US20130085753A1 (en) * 2011-09-30 2013-04-04 Google Inc. Hybrid Client/Server Speech Recognition In A Mobile Device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070258418A1 (en) * 2006-05-03 2007-11-08 Sprint Spectrum L.P. Method and system for controlling streaming of media to wireless communication devices
US8364481B2 (en) * 2008-07-02 2013-01-29 Google Inc. Speech recognition with parallel recognition tasks
US8296383B2 (en) * 2008-10-02 2012-10-23 Apple Inc. Electronic devices with voice command and contextual data processing capabilities
US20120179469A1 (en) 2011-01-07 2012-07-12 Nuance Communication, Inc. Configurable speech recognition system using multiple recognizers
US20130151250A1 (en) * 2011-12-08 2013-06-13 Lenovo (Singapore) Pte. Ltd Hybrid speech recognition

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
FENN.: "Hype Cycle for Emerging Technologies", vol. 1-72, 2 August 2010 (2010-08-02), XP055327561, Retrieved from the Internet <URL:http://www,chinnovate.com/wp-content/uploads/2011/09/Hype-Cycle-for-Emerging-Technologies-2010.pdf> [retrieved on 20141030] *
See also references of EP3047481A4 *

Cited By (230)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10318871B2 (en) 2005-09-08 2019-06-11 Apple Inc. Method and apparatus for building an intelligent automated assistant
US11928604B2 (en) 2005-09-08 2024-03-12 Apple Inc. Method and apparatus for building an intelligent automated assistant
US11671920B2 (en) 2007-04-03 2023-06-06 Apple Inc. Method and system for operating a multifunction portable electronic device using voice-activation
US11023513B2 (en) 2007-12-20 2021-06-01 Apple Inc. Method and apparatus for searching using an active ontology
US10381016B2 (en) 2008-01-03 2019-08-13 Apple Inc. Methods and apparatus for altering audio output signals
US9865248B2 (en) 2008-04-05 2018-01-09 Apple Inc. Intelligent text-to-speech conversion
US10108612B2 (en) 2008-07-31 2018-10-23 Apple Inc. Mobile device having human language translation capability with positional feedback
US11900936B2 (en) 2008-10-02 2024-02-13 Apple Inc. Electronic devices with voice command and contextual data processing capabilities
US11348582B2 (en) 2008-10-02 2022-05-31 Apple Inc. Electronic devices with voice command and contextual data processing capabilities
US10643611B2 (en) 2008-10-02 2020-05-05 Apple Inc. Electronic devices with voice command and contextual data processing capabilities
US11080012B2 (en) 2009-06-05 2021-08-03 Apple Inc. Interface for a virtual digital assistant
US10795541B2 (en) 2009-06-05 2020-10-06 Apple Inc. Intelligent organization of tasks items
US10741185B2 (en) 2010-01-18 2020-08-11 Apple Inc. Intelligent automated assistant
US10706841B2 (en) 2010-01-18 2020-07-07 Apple Inc. Task flow identification based on user intent
US11423886B2 (en) 2010-01-18 2022-08-23 Apple Inc. Task flow identification based on user intent
US10692504B2 (en) 2010-02-25 2020-06-23 Apple Inc. User profiling for voice input processing
US10049675B2 (en) 2010-02-25 2018-08-14 Apple Inc. User profiling for voice input processing
US10417405B2 (en) 2011-03-21 2019-09-17 Apple Inc. Device access using voice authentication
US11350253B2 (en) 2011-06-03 2022-05-31 Apple Inc. Active transport based notifications
US11120372B2 (en) 2011-06-03 2021-09-14 Apple Inc. Performing actions associated with task items that represent tasks to perform
US11069336B2 (en) 2012-03-02 2021-07-20 Apple Inc. Systems and methods for name pronunciation
US11321116B2 (en) 2012-05-15 2022-05-03 Apple Inc. Systems and methods for integrating third party services with a digital assistant
US11269678B2 (en) 2012-05-15 2022-03-08 Apple Inc. Systems and methods for integrating third party services with a digital assistant
US10079014B2 (en) 2012-06-08 2018-09-18 Apple Inc. Name recognition system
US9971774B2 (en) 2012-09-19 2018-05-15 Apple Inc. Voice-based media searching
US11862186B2 (en) 2013-02-07 2024-01-02 Apple Inc. Voice trigger for a digital assistant
US11636869B2 (en) 2013-02-07 2023-04-25 Apple Inc. Voice trigger for a digital assistant
US11557310B2 (en) 2013-02-07 2023-01-17 Apple Inc. Voice trigger for a digital assistant
US10714117B2 (en) 2013-02-07 2020-07-14 Apple Inc. Voice trigger for a digital assistant
US10978090B2 (en) 2013-02-07 2021-04-13 Apple Inc. Voice trigger for a digital assistant
US11388291B2 (en) 2013-03-14 2022-07-12 Apple Inc. System and method for processing voicemail
US11798547B2 (en) 2013-03-15 2023-10-24 Apple Inc. Voice activated device for use with a voice-based digital assistant
US9966060B2 (en) 2013-06-07 2018-05-08 Apple Inc. System and method for user-specified pronunciation of words for speech synthesis and recognition
US10657961B2 (en) 2013-06-08 2020-05-19 Apple Inc. Interpreting and acting upon commands that involve sharing information with remote devices
US10769385B2 (en) 2013-06-09 2020-09-08 Apple Inc. System and method for inferring user intent from speech inputs
US11048473B2 (en) 2013-06-09 2021-06-29 Apple Inc. Device, method, and graphical user interface for enabling conversation persistence across two or more instances of a digital assistant
US11727219B2 (en) 2013-06-09 2023-08-15 Apple Inc. System and method for inferring user intent from speech inputs
US11314370B2 (en) 2013-12-06 2022-04-26 Apple Inc. Method for extracting salient dialog usage from live data
US10417344B2 (en) 2014-05-30 2019-09-17 Apple Inc. Exemplar-based natural language processing
US11699448B2 (en) 2014-05-30 2023-07-11 Apple Inc. Intelligent assistant for home automation
US11257504B2 (en) 2014-05-30 2022-02-22 Apple Inc. Intelligent assistant for home automation
US10878809B2 (en) 2014-05-30 2020-12-29 Apple Inc. Multi-command single utterance input method
US10714095B2 (en) 2014-05-30 2020-07-14 Apple Inc. Intelligent assistant for home automation
US10497365B2 (en) 2014-05-30 2019-12-03 Apple Inc. Multi-command single utterance input method
US10699717B2 (en) 2014-05-30 2020-06-30 Apple Inc. Intelligent assistant for home automation
US11810562B2 (en) 2014-05-30 2023-11-07 Apple Inc. Reducing the need for manual start/end-pointing and trigger phrases
US11670289B2 (en) 2014-05-30 2023-06-06 Apple Inc. Multi-command single utterance input method
US10083690B2 (en) 2014-05-30 2018-09-25 Apple Inc. Better resolution when referencing to concepts
US10657966B2 (en) 2014-05-30 2020-05-19 Apple Inc. Better resolution when referencing to concepts
US11133008B2 (en) 2014-05-30 2021-09-28 Apple Inc. Reducing the need for manual start/end-pointing and trigger phrases
US11516537B2 (en) 2014-06-30 2022-11-29 Apple Inc. Intelligent automated assistant for TV user interactions
US11838579B2 (en) 2014-06-30 2023-12-05 Apple Inc. Intelligent automated assistant for TV user interactions
US10904611B2 (en) 2014-06-30 2021-01-26 Apple Inc. Intelligent automated assistant for TV user interactions
US10431204B2 (en) 2014-09-11 2019-10-01 Apple Inc. Method and apparatus for discovering trending terms in speech requests
US10390213B2 (en) 2014-09-30 2019-08-20 Apple Inc. Social reminders
US9986419B2 (en) 2014-09-30 2018-05-29 Apple Inc. Social reminders
US10453443B2 (en) 2014-09-30 2019-10-22 Apple Inc. Providing an indication of the suitability of speech recognition
US10438595B2 (en) 2014-09-30 2019-10-08 Apple Inc. Speaker identification and unsupervised speaker adaptation techniques
US11231904B2 (en) 2015-03-06 2022-01-25 Apple Inc. Reducing response latency of intelligent automated assistants
US11842734B2 (en) 2015-03-08 2023-12-12 Apple Inc. Virtual assistant activation
US10529332B2 (en) 2015-03-08 2020-01-07 Apple Inc. Virtual assistant activation
US11087759B2 (en) 2015-03-08 2021-08-10 Apple Inc. Virtual assistant activation
US10567477B2 (en) 2015-03-08 2020-02-18 Apple Inc. Virtual assistant continuity
US10311871B2 (en) 2015-03-08 2019-06-04 Apple Inc. Competing devices responding to voice triggers
US10930282B2 (en) 2015-03-08 2021-02-23 Apple Inc. Competing devices responding to voice triggers
US11468282B2 (en) 2015-05-15 2022-10-11 Apple Inc. Virtual assistant in a communication session
US10986214B2 (en) 2015-05-27 2021-04-20 Google Llc Local persisting of data for selectively offline capable voice action in a voice-enabled electronic device
US9870196B2 (en) * 2015-05-27 2018-01-16 Google Llc Selective aborting of online processing of voice inputs in a voice-enabled electronic device
US10482883B2 (en) 2015-05-27 2019-11-19 Google Llc Context-sensitive dynamic update of voice to text model in a voice-enabled electronic device
US11127397B2 (en) 2015-05-27 2021-09-21 Apple Inc. Device voice control
US11676606B2 (en) 2015-05-27 2023-06-13 Google Llc Context-sensitive dynamic update of voice to text model in a voice-enabled electronic device
US10083697B2 (en) 2015-05-27 2018-09-25 Google Llc Local persisting of data for selectively offline capable voice action in a voice-enabled electronic device
US11070949B2 (en) 2015-05-27 2021-07-20 Apple Inc. Systems and methods for proactively identifying and surfacing relevant content on an electronic device with a touch-sensitive display
US10334080B2 (en) 2015-05-27 2019-06-25 Google Llc Local persisting of data for selectively offline capable voice action in a voice-enabled electronic device
US9966073B2 (en) 2015-05-27 2018-05-08 Google Llc Context-sensitive dynamic update of voice to text model in a voice-enabled electronic device
US11087762B2 (en) 2015-05-27 2021-08-10 Google Llc Context-sensitive dynamic update of voice to text model in a voice-enabled electronic device
US10681212B2 (en) 2015-06-05 2020-06-09 Apple Inc. Virtual assistant aided communication with 3rd party service in a communication session
US10356243B2 (en) 2015-06-05 2019-07-16 Apple Inc. Virtual assistant aided communication with 3rd party service in a communication session
US11025565B2 (en) 2015-06-07 2021-06-01 Apple Inc. Personalized prediction of responses for instant messaging
US11010127B2 (en) 2015-06-29 2021-05-18 Apple Inc. Virtual assistant for media playback
US11947873B2 (en) 2015-06-29 2024-04-02 Apple Inc. Virtual assistant for media playback
US11550542B2 (en) 2015-09-08 2023-01-10 Apple Inc. Zero latency digital assistant
US11853536B2 (en) 2015-09-08 2023-12-26 Apple Inc. Intelligent automated assistant in a media environment
US11500672B2 (en) 2015-09-08 2022-11-15 Apple Inc. Distributed personal assistant
US11954405B2 (en) 2015-09-08 2024-04-09 Apple Inc. Zero latency digital assistant
US11126400B2 (en) 2015-09-08 2021-09-21 Apple Inc. Zero latency digital assistant
US11809483B2 (en) 2015-09-08 2023-11-07 Apple Inc. Intelligent automated assistant for media search and playback
US11809886B2 (en) 2015-11-06 2023-11-07 Apple Inc. Intelligent automated assistant in a messaging environment
US11526368B2 (en) 2015-11-06 2022-12-13 Apple Inc. Intelligent automated assistant in a messaging environment
US11886805B2 (en) 2015-11-09 2024-01-30 Apple Inc. Unconventional virtual assistant interactions
US10354652B2 (en) 2015-12-02 2019-07-16 Apple Inc. Applying neural network language models to weighted finite state transducers for automatic speech recognition
US10942703B2 (en) 2015-12-23 2021-03-09 Apple Inc. Proactive assistance based on dialog communication between devices
US11227589B2 (en) 2016-06-06 2022-01-18 Apple Inc. Intelligent list reading
US11069347B2 (en) 2016-06-08 2021-07-20 Apple Inc. Intelligent automated assistant for media exploration
US10354011B2 (en) 2016-06-09 2019-07-16 Apple Inc. Intelligent automated assistant in a home environment
DK201670578A1 (en) * 2016-06-09 2018-02-26 Apple Inc Intelligent automated assistant in a home environment
US10067938B2 (en) 2016-06-10 2018-09-04 Apple Inc. Multilingual word prediction
US11037565B2 (en) 2016-06-10 2021-06-15 Apple Inc. Intelligent digital assistant in a multi-tasking environment
US10733993B2 (en) 2016-06-10 2020-08-04 Apple Inc. Intelligent digital assistant in a multi-tasking environment
US11657820B2 (en) 2016-06-10 2023-05-23 Apple Inc. Intelligent digital assistant in a multi-tasking environment
US10580409B2 (en) 2016-06-11 2020-03-03 Apple Inc. Application integration with a digital assistant
US11749275B2 (en) 2016-06-11 2023-09-05 Apple Inc. Application integration with a digital assistant
US11152002B2 (en) 2016-06-11 2021-10-19 Apple Inc. Application integration with a digital assistant
US10942702B2 (en) 2016-06-11 2021-03-09 Apple Inc. Intelligent device arbitration and control
US11809783B2 (en) 2016-06-11 2023-11-07 Apple Inc. Intelligent device arbitration and control
US10474753B2 (en) 2016-09-07 2019-11-12 Apple Inc. Language identification using recurrent neural networks
US10043516B2 (en) 2016-09-23 2018-08-07 Apple Inc. Intelligent automated assistant
US10553215B2 (en) 2016-09-23 2020-02-04 Apple Inc. Intelligent automated assistant
US11281993B2 (en) 2016-12-05 2022-03-22 Apple Inc. Model and ensemble compression for metric learning
US10593346B2 (en) 2016-12-22 2020-03-17 Apple Inc. Rank-reduced token representation for automatic speech recognition
US11204787B2 (en) 2017-01-09 2021-12-21 Apple Inc. Application integration with a digital assistant
US11656884B2 (en) 2017-01-09 2023-05-23 Apple Inc. Application integration with a digital assistant
US10417266B2 (en) 2017-05-09 2019-09-17 Apple Inc. Context-aware ranking of intelligent response suggestions
US10332518B2 (en) 2017-05-09 2019-06-25 Apple Inc. User interface for correcting recognition errors
US10741181B2 (en) 2017-05-09 2020-08-11 Apple Inc. User interface for correcting recognition errors
US10726832B2 (en) 2017-05-11 2020-07-28 Apple Inc. Maintaining privacy of personal information
US10755703B2 (en) 2017-05-11 2020-08-25 Apple Inc. Offline personal assistant
US11599331B2 (en) 2017-05-11 2023-03-07 Apple Inc. Maintaining privacy of personal information
US10847142B2 (en) 2017-05-11 2020-11-24 Apple Inc. Maintaining privacy of personal information
US11467802B2 (en) 2017-05-11 2022-10-11 Apple Inc. Maintaining privacy of personal information
US10395654B2 (en) 2017-05-11 2019-08-27 Apple Inc. Text normalization based on a data-driven learning network
US11538469B2 (en) 2017-05-12 2022-12-27 Apple Inc. Low-latency intelligent automated assistant
US11862151B2 (en) 2017-05-12 2024-01-02 Apple Inc. Low-latency intelligent automated assistant
US10789945B2 (en) 2017-05-12 2020-09-29 Apple Inc. Low-latency intelligent automated assistant
US11301477B2 (en) 2017-05-12 2022-04-12 Apple Inc. Feedback analysis of a digital assistant
US10791176B2 (en) 2017-05-12 2020-09-29 Apple Inc. Synchronization and task delegation of a digital assistant
US11580990B2 (en) 2017-05-12 2023-02-14 Apple Inc. User-specific acoustic models
US11405466B2 (en) 2017-05-12 2022-08-02 Apple Inc. Synchronization and task delegation of a digital assistant
US11837237B2 (en) 2017-05-12 2023-12-05 Apple Inc. User-specific acoustic models
US11380310B2 (en) 2017-05-12 2022-07-05 Apple Inc. Low-latency intelligent automated assistant
US10410637B2 (en) 2017-05-12 2019-09-10 Apple Inc. User-specific acoustic models
US10482874B2 (en) 2017-05-15 2019-11-19 Apple Inc. Hierarchical belief states for digital assistants
US10810274B2 (en) 2017-05-15 2020-10-20 Apple Inc. Optimizing dialogue policy decisions for digital assistants using implicit feedback
US10403278B2 (en) 2017-05-16 2019-09-03 Apple Inc. Methods and systems for phonetic matching in digital assistant services
US11217255B2 (en) 2017-05-16 2022-01-04 Apple Inc. Far-field extension for digital assistant services
US10748546B2 (en) 2017-05-16 2020-08-18 Apple Inc. Digital assistant services based on device capabilities
US10311144B2 (en) 2017-05-16 2019-06-04 Apple Inc. Emoji word sense disambiguation
US11675829B2 (en) 2017-05-16 2023-06-13 Apple Inc. Intelligent automated assistant for media exploration
US10303715B2 (en) 2017-05-16 2019-05-28 Apple Inc. Intelligent automated assistant for media exploration
US11532306B2 (en) 2017-05-16 2022-12-20 Apple Inc. Detecting a trigger of a digital assistant
US10909171B2 (en) 2017-05-16 2021-02-02 Apple Inc. Intelligent automated assistant for media exploration
US10657328B2 (en) 2017-06-02 2020-05-19 Apple Inc. Multi-task recurrent neural network architecture for efficient morphology handling in neural language modeling
US11126389B2 (en) 2017-07-11 2021-09-21 Roku, Inc. Controlling visual indicators in an audio responsive electronic device, and capturing and providing audio using an API, by native and non-native computing devices and services
US10455322B2 (en) 2017-08-18 2019-10-22 Roku, Inc. Remote control with presence sensor
US10777197B2 (en) 2017-08-28 2020-09-15 Roku, Inc. Audio responsive device with play/stop and tell me something buttons
US11646025B2 (en) 2017-08-28 2023-05-09 Roku, Inc. Media system with multiple digital assistants
US11804227B2 (en) 2017-08-28 2023-10-31 Roku, Inc. Local and cloud speech recognition
US11062702B2 (en) 2017-08-28 2021-07-13 Roku, Inc. Media system with multiple digital assistants
US11062710B2 (en) 2017-08-28 2021-07-13 Roku, Inc. Local and cloud speech recognition
WO2019046170A1 (en) * 2017-08-28 2019-03-07 Roku, Inc. Local and cloud speech recognition
US11961521B2 (en) 2017-08-28 2024-04-16 Roku, Inc. Media system with multiple digital assistants
US10515637B1 (en) 2017-09-19 2019-12-24 Amazon Technologies, Inc. Dynamic speech processing
US10445429B2 (en) 2017-09-21 2019-10-15 Apple Inc. Natural language understanding using vocabularies with compressed serialized tries
US10755051B2 (en) 2017-09-29 2020-08-25 Apple Inc. Rule-based natural language processing
US10636424B2 (en) 2017-11-30 2020-04-28 Apple Inc. Multi-turn canned dialog
CN111629658B (en) * 2017-12-22 2023-09-15 瑞思迈传感器技术有限公司 Apparatus, system, and method for motion sensing
CN111629658A (en) * 2017-12-22 2020-09-04 瑞思迈传感器技术有限公司 Apparatus, system and method for motion sensing
US10733982B2 (en) 2018-01-08 2020-08-04 Apple Inc. Multi-directional dialog
US10733375B2 (en) 2018-01-31 2020-08-04 Apple Inc. Knowledge-based framework for improving natural language understanding
US11145298B2 (en) 2018-02-13 2021-10-12 Roku, Inc. Trigger word detection with multiple digital assistants
US11935537B2 (en) 2018-02-13 2024-03-19 Roku, Inc. Trigger word detection with multiple digital assistants
US11664026B2 (en) 2018-02-13 2023-05-30 Roku, Inc. Trigger word detection with multiple digital assistants
US10789959B2 (en) 2018-03-02 2020-09-29 Apple Inc. Training speaker recognition models for digital assistants
US10592604B2 (en) 2018-03-12 2020-03-17 Apple Inc. Inverse text normalization for automatic speech recognition
US10984799B2 (en) * 2018-03-23 2021-04-20 Amazon Technologies, Inc. Hybrid speech interface device
US20190295552A1 (en) * 2018-03-23 2019-09-26 Amazon Technologies, Inc. Speech interface device
US11710482B2 (en) 2018-03-26 2023-07-25 Apple Inc. Natural assistant interaction
US10818288B2 (en) 2018-03-26 2020-10-27 Apple Inc. Natural assistant interaction
US10909331B2 (en) 2018-03-30 2021-02-02 Apple Inc. Implicit identification of translation payload with neural machine translation
US11487364B2 (en) 2018-05-07 2022-11-01 Apple Inc. Raise to speak
US11145294B2 (en) 2018-05-07 2021-10-12 Apple Inc. Intelligent automated assistant for delivering content from user experiences
US10928918B2 (en) 2018-05-07 2021-02-23 Apple Inc. Raise to speak
US11854539B2 (en) 2018-05-07 2023-12-26 Apple Inc. Intelligent automated assistant for delivering content from user experiences
US11900923B2 (en) 2018-05-07 2024-02-13 Apple Inc. Intelligent automated assistant for delivering content from user experiences
US11169616B2 (en) 2018-05-07 2021-11-09 Apple Inc. Raise to speak
US11907436B2 (en) 2018-05-07 2024-02-20 Apple Inc. Raise to speak
US10984780B2 (en) 2018-05-21 2021-04-20 Apple Inc. Global semantic word embeddings using bi-directional recurrent neural networks
US11009970B2 (en) 2018-06-01 2021-05-18 Apple Inc. Attention aware virtual assistant dismissal
US10720160B2 (en) 2018-06-01 2020-07-21 Apple Inc. Voice interaction at a primary device to access call functionality of a companion device
US11386266B2 (en) 2018-06-01 2022-07-12 Apple Inc. Text correction
US11495218B2 (en) 2018-06-01 2022-11-08 Apple Inc. Virtual assistant operation in multi-device environments
US10892996B2 (en) 2018-06-01 2021-01-12 Apple Inc. Variable latency device coordination
US10403283B1 (en) 2018-06-01 2019-09-03 Apple Inc. Voice interaction at a primary device to access call functionality of a companion device
US10984798B2 (en) 2018-06-01 2021-04-20 Apple Inc. Voice interaction at a primary device to access call functionality of a companion device
US11360577B2 (en) 2018-06-01 2022-06-14 Apple Inc. Attention aware virtual assistant dismissal
US11431642B2 (en) 2018-06-01 2022-08-30 Apple Inc. Variable latency device coordination
US11630525B2 (en) 2018-06-01 2023-04-18 Apple Inc. Attention aware virtual assistant dismissal
US10684703B2 (en) 2018-06-01 2020-06-16 Apple Inc. Attention aware virtual assistant dismissal
US10944859B2 (en) 2018-06-03 2021-03-09 Apple Inc. Accelerated task performance
US10504518B1 (en) 2018-06-03 2019-12-10 Apple Inc. Accelerated task performance
US10496705B1 (en) 2018-06-03 2019-12-03 Apple Inc. Accelerated task performance
US11373645B1 (en) * 2018-06-18 2022-06-28 Amazon Technologies, Inc. Updating personalized data on a speech interface device
US11062705B2 (en) 2018-07-18 2021-07-13 Kabushiki Kaisha Toshiba Information processing apparatus, information processing method, and computer program product
US11010561B2 (en) 2018-09-27 2021-05-18 Apple Inc. Sentiment prediction from textual data
US11462215B2 (en) 2018-09-28 2022-10-04 Apple Inc. Multi-modal inputs for voice commands
US11893992B2 (en) 2018-09-28 2024-02-06 Apple Inc. Multi-modal inputs for voice commands
US11170166B2 (en) 2018-09-28 2021-11-09 Apple Inc. Neural typographical error modeling via generative adversarial networks
US10839159B2 (en) 2018-09-28 2020-11-17 Apple Inc. Named entity normalization in a spoken dialog system
US11475898B2 (en) 2018-10-26 2022-10-18 Apple Inc. Low-latency multi-speaker speech recognition
CN111145735A (en) * 2018-11-05 2020-05-12 三星电子株式会社 Electronic device and operation method thereof
CN111145735B (en) * 2018-11-05 2023-10-24 三星电子株式会社 Electronic device and method of operating the same
US10885912B2 (en) 2018-11-13 2021-01-05 Motorola Solutions, Inc. Methods and systems for providing a corrected voice command
WO2020101865A1 (en) * 2018-11-13 2020-05-22 Motorola Solutions, Inc. Methods and systems for providing a corrected voice command
US11638059B2 (en) 2019-01-04 2023-04-25 Apple Inc. Content playback on multiple devices
US11783815B2 (en) 2019-03-18 2023-10-10 Apple Inc. Multimodality in digital assistant systems
US11348573B2 (en) 2019-03-18 2022-05-31 Apple Inc. Multimodality in digital assistant systems
US11307752B2 (en) 2019-05-06 2022-04-19 Apple Inc. User configurable task triggers
US11475884B2 (en) 2019-05-06 2022-10-18 Apple Inc. Reducing digital assistant latency when a language is incorrectly determined
US11423908B2 (en) 2019-05-06 2022-08-23 Apple Inc. Interpreting spoken requests
US11705130B2 (en) 2019-05-06 2023-07-18 Apple Inc. Spoken notifications
US11217251B2 (en) 2019-05-06 2022-01-04 Apple Inc. Spoken notifications
US11675491B2 (en) 2019-05-06 2023-06-13 Apple Inc. User configurable task triggers
US11140099B2 (en) 2019-05-21 2021-10-05 Apple Inc. Providing message response suggestions
US11888791B2 (en) 2019-05-21 2024-01-30 Apple Inc. Providing message response suggestions
US11289073B2 (en) 2019-05-31 2022-03-29 Apple Inc. Device text to speech
US11237797B2 (en) 2019-05-31 2022-02-01 Apple Inc. User activity shortcut suggestions
US11657813B2 (en) 2019-05-31 2023-05-23 Apple Inc. Voice identification in digital assistant systems
US11496600B2 (en) 2019-05-31 2022-11-08 Apple Inc. Remote execution of machine-learned models
US11360739B2 (en) 2019-05-31 2022-06-14 Apple Inc. User activity shortcut suggestions
US11360641B2 (en) 2019-06-01 2022-06-14 Apple Inc. Increasing the relevance of new available information
US11790914B2 (en) 2019-06-01 2023-10-17 Apple Inc. Methods and user interfaces for voice-based control of electronic devices
US11488406B2 (en) 2019-09-25 2022-11-01 Apple Inc. Text detection using global geometry estimators
US11914848B2 (en) 2020-05-11 2024-02-27 Apple Inc. Providing relevant data items based on context
US11924254B2 (en) 2020-05-11 2024-03-05 Apple Inc. Digital assistant hardware abstraction
US11765209B2 (en) 2020-05-11 2023-09-19 Apple Inc. Digital assistant hardware abstraction
US11838734B2 (en) 2020-07-20 2023-12-05 Apple Inc. Multi-device audio adjustment coordination
US11750962B2 (en) 2020-07-21 2023-09-05 Apple Inc. User identification using headphones
US11696060B2 (en) 2020-07-21 2023-07-04 Apple Inc. User identification using headphones
US20230013916A1 (en) * 2021-07-15 2023-01-19 Arris Enterprises Llc Command services manager for secure sharing of commands to registered agents
WO2023287471A1 (en) * 2021-07-15 2023-01-19 Arris Enterprises Llc Command services manager for secure sharing of commands to registered agents

Also Published As

Publication number Publication date
EP3047481A4 (en) 2017-03-01
EP3047481A1 (en) 2016-07-27
JP2016531375A (en) 2016-10-06
CN105793923A (en) 2016-07-20

Similar Documents

Publication Publication Date Title
WO2015041892A1 (en) Local and remote speech processing
US11600271B2 (en) Detecting self-generated wake expressions
US9672812B1 (en) Qualifying trigger expressions in speech-based systems
US10354649B2 (en) Altering audio to improve automatic speech recognition
CN108351872B (en) Method and system for responding to user speech
CN107004411B (en) Voice application architecture
EP3084633B1 (en) Attribute-based audio channel arbitration
US9734845B1 (en) Mitigating effects of electronic audio sources in expression detection
US9098467B1 (en) Accepting voice commands based on user identity
US9324322B1 (en) Automatic volume attenuation for speech enabled devices
US9293134B1 (en) Source-specific speech interactions
US10297250B1 (en) Asynchronous transfer of audio data
KR20190075800A (en) Intelligent personal assistant interface system
US9224404B2 (en) Dynamic audio processing parameters with automatic speech recognition
CN102591455A (en) Selective Transmission of Voice Data
US11862153B1 (en) System for recognizing and responding to environmental noises
EP2760019B1 (en) Dynamic audio processing parameters with automatic speech recognition

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14846698

Country of ref document: EP

Kind code of ref document: A1

REEP Request for entry into the european phase

Ref document number: 2014846698

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 2014846698

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2016543926

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE