CN105793923A

CN105793923A - Local and remote speech processing

Info

Publication number: CN105793923A
Application number: CN201480050711.8A
Authority: CN
Inventors: 尼克尔·斯特罗姆; 彼得·斯伯丁·万兰德; 比约恩·霍夫梅斯特
Original assignee: Amazon Technologies Inc
Current assignee: Amazon Technologies Inc
Priority date: 2013-09-20
Filing date: 2014-09-09
Publication date: 2016-07-20
Also published as: EP3047481A4; WO2015041892A1; EP3047481A1; JP2016531375A

Abstract

A user device may be configured to detect a user-uttered trigger expression and to respond by interpreting subsequent words or phrases as commands. The commands may be recognized by sending audio containing the words or phrases to a remote service that is configured to perform speech recognition. Certain commands may be designated as local commands and may be detected locally rather than relying on the remote service. Upon detection of the trigger expression, audio is streamed to the remote service and also analyzed locally to detect utterances of local commands. Upon detecting a local command, a corresponding function is immediately initiated, and subsequent activities or responses by the remote service are canceled or ignored.

Description

Local and remote speech processes

Related application

This application claims the U.S. Patent Application No. 14/033 that title is " LocalandRemoteSpeechProcessing (local and remote speech processes) " of JIUYUE in 2013 submission on the 20th, the priority of 302, described patent application is incorporated herein in its entirety by reference.

Background

Family, office, automobile and public space become and computing equipment, such as notebook computer, panel computer, entertainment systems and portable communication device develop rapidly contact more and more tightr.Along with the development of computing equipment, the mode that user and these equipment interact also continues to development.Such as, people can pass through plant equipment (such as, keyboard, mouse etc.), electrical equipment (such as, touch screen, Trackpad etc.) and optical device (such as, motion detector, video camera etc.) and computing equipment interact.The another way interacted with computing equipment is by catching human speech the audio frequency apparatus that described human speech is responded.

Accompanying drawing is sketched

Detailed description is described with reference to.In the drawings, in reference number, the Digital ID of the leftmost side occurs the figure of described reference number first.The same reference numbers used in various figures indicates similar or identical parts or feature.

Fig. 1 is the block diagram of illustrative speech (voice) the interactive computing architecture including local audio devices and remote speech process service.

Fig. 2-4 illustrates to be processed, by local audio devices and remote speech, the flow chart that service combines the example process of the order statement performed for detecting.

Describe in detail

The disclosure relates in general to and a kind of provide or promote and the voice-based mutual speech interface system of user.Described system includes the local device with mike, and described microphones capture comprises the audio frequency of user speech.With keyword, triggering statement can be referred to as or wakes statement up before spoken user order.Audio frequency after triggering statement can be streamed to remote service to carry out speech recognition, and the order performed by audio frequency apparatus can be responded by described service by execution function or offer.

Communication with remote service can introduce the response waiting time, and in most of the cases the described response waiting time can minimize in acceptable limits.But, some verbal orders may call for the less waiting time.For example, about the verbal order of certain form of media presentation, such as " stopping ", " time-out ", " termination " etc., it may be necessary to perform with amount of less appreciable waiting time.

According to various embodiments, some orders statement, is referred to herein as local command or local command statement, local device but not remote service detects or detect at local device but not remote service.More specifically, it is that user view is to form order that local device is configured to detect triggering or warning statement, described triggering or warning statement instruction voice subsequently.When triggering statement being detected, local device starts the communication session with remote service and starts the audio streaming received is transferred to described service.As response, the remote service audio frequency to receiving performs speech recognition and attempts to identify user view based on the voice of institute's identification.In response to the user view of institute's identification, remote service can perform the function of correspondence.In some cases, can be combined with local device and perform function.Such as, remote service can send order to local device, and instruction local device should perform described order to perform the function of correspondence.

With the active synchronization of remote service, local device monitoring or analysis audio frequency to detect the generation of local command statement after triggering is stated.When local command in audio frequency being detected is stated, local device realizes the function of correspondence immediately.Additionally, stop or cancelling other actions performed by remote service to avoid the palikinesia about unique user language.The action performed by remote service can be stopped in the following manner: clearly notice remote service described in language this locality implement, by terminating or cancel communication session and/or by abandoning any order specified by remote service in response to long-distance user's speech recognition.

Fig. 1 illustrates the example of voice interaction system 100.System 100 can include or the available local audio frequency apparatus 102 based on speech, and described audio frequency apparatus 102 can be located in environment 104 (such as family) and can be used for interacting with user 106.Voice interaction system 100 may also include or utilize long-range network voice command service 108, institute's speech commands service 108 is configured to receive the voice in audio frequency, identification audio frequency and the voice in response to institute's identification performs function, is referred to herein as the function that service identifies.The function that service identifies can be serviced 108 by voice command and realize independent of audio frequency apparatus, and/or can by providing order to realize locally executing to audio frequency apparatus 102.

In certain embodiments, user can be through voice with the mutual Main Patterns of audio frequency apparatus 102.Such as, the verbal order that audio frequency apparatus 102 can receive from user 106 is stated and may be in response to described order to provide service.User can say and predefined wake or trigger statement (such as, " waking up ") up, described in wake or trigger statement up after can be that (such as, " I wants to go to the cinema for order or instruction.Could you tell me what film local cinema is playing.”).The service provided can include execution action or activity, presents media, acquisition and/or offer information, provide information by the voice generated by audio frequency apparatus 102 or synthesize, represent user 106 and start service based on the Internet etc..

Local audio devices 102 and voice command service 108 are configured to be bonded to each other work and state with the order received from user 106 and it is responded.The local command that order statement can include being carried out detecting and implementing independent of voice command service 108 by local device 102 is stated.Order statement may also include by remote speech command service 108 or is combined, with remote speech command service 108, the order making an explanation and implementing.

Audio frequency apparatus 102 can have one or more mike 110 and one or more audio tweeter or changer 112, in order to promotes mutual with the audio frequency of user 106.Mike 110 produces microphone signal, is also referred to as input audio signal, and it represents the audio frequency from environment 104, including the sound sent by user 106 or statement.

In some cases, mike 110 can include microphone array, and described microphone array and audio signal beam form the input audio signal that technology is utilized in conjunction with concentrating on optional direction with generation.Similarly, multiple directions mike 110 can be used to produce the audio signal corresponding in multiple usable directions.

Audio frequency apparatus 102 includes operation logic, and described operation logic can include processor 114 and memorizer 116 in many cases.Processor 114 can include multiple processor and/or have the processor of multiple core.Processor 114 also can contain or comprise the digital signal processor for processing audio signal.

Memorizer 116 can comprise the application in form of computer-executable instructions and program, and processor 114 performs computer executable instructions to perform to realize operation or the action of the desired function (including the following function being expressly recited) of audio frequency apparatus 102.Memorizer 116 can be a class computer-readable recording medium and can include volatibility and nonvolatile memory.Therefore, memorizer 116 may include but be not limited to RAM, ROM, EEPROM, flash memory or other memory technologies.

Audio frequency apparatus 102 can include can performing to provide service and multiple application of function, service and/or function 118 by processor 114, is hereafter referred to collectively as functional unit 118.Application and other functional units 118 can include media playback services, such as music player.By application and other functional units 118 perform or provide other service or operation can include (as an example) request and consumer entertainment (such as, game, search and play music, film or other guide etc.), personal management (such as, schedule formulation, notepaper making etc.), online shopping, financial transaction, data base querying, interpersonal Speech Communication etc..

In some embodiments, functional unit 118 can be pre-installed on audio frequency apparatus 102, and can realize the Core Feature of audio frequency apparatus 102.In other embodiments, application or other functional units 118 one or more can undertake installing or otherwise installing by user 106 user 106 has initialized audio frequency apparatus 102 after, and function that is other or that customize can be realized according to the expectation of user 106.

Processor 114 can be configured by Audio Processing function or assembly 120 input audio signal that reason mike 110 generates and/or the output audio signal providing speaker 112.For example, audio processing components 120 can realize acoustic echo and eliminate the audio echo being coupled generation with minimizing by the acoustics between mike 110 with speaker 112.Audio processing components 120 also can realize noise decrease with reduce noise in reception audio signal, such as input audio signal but not the element of user speech.In certain embodiments, audio processing components 120 can include one or more audio signal beam shaper, and described audio signal beam shaper concentrates on audio signal on the direction having detected that user speech in response to multiple mikes 110 to generate.

Audio frequency apparatus 102 can also be configured to implement one or more statement detector or speech recognition assembly 122, and the one or more statement detector or speech recognition assembly 122 can be used for the triggering statement in the voice that detection is caught by mike 110.It is that user view is interpreted the word of order, phrase or other language that term " trigger statement " to be used for indicating for signaling audio frequency apparatus 102 user speech subsequently in this article.

One or more speech recognition assembly 122 can also be used to detect the order in the voice caught by mike 110 or order statement.Term " order statement " is in this article for indicating corresponding to by by audio frequency apparatus 102 or the function performed by the addressable service of audio frequency apparatus 102 or other equipment (such as voice command service 108) or the word being associated with described function, phrase or other language.Such as, word " stopping ", " time-out ", " termination " can be used as order statement." stopping " and " time-out " order statement may indicate that should interrupt media playback activity." termination " order statement may indicate that current interpersonal communication should terminate.It is used as other order statements corresponding to difference in functionality.Order statement can include conversational instruction, such as " finds neighbouring Italian restaurant ".

Order statement can include the local command statement that will be made an explanation when being independent of voice command service 108 by audio frequency apparatus 102.In general, local command statement is relatively short statement, and such as word or short phrase, it can easily be detected by audio frequency apparatus 102.Local command statement may correspond to the functions of the equipments of expectation relatively low response waiting time, and such as media control or media playback controls function.The service of voice command service 108 can be used for other order statements of acceptable bigger response waiting time.The order implemented by voice command service is expressed in and will be referred to as remote command statement herein.

In some cases, speech recognition assembly 122 can use automatic speech recognizing (ASR) technology to realize.Such as, large vocabulary speech recognition technology can be used to carry out keyword search, and the output appearance to find keyword of speech recognition can be monitored.For example, speech recognition can use hidden Markov model and gauss hybrid models to carry out identification speech and input and provide the continuous word stream inputted corresponding to described speech.Subsequently, word stream can be monitored to detect one or more word specified or statement.

As an alternative, speech recognition assembly 122 can be realized by one or more keyword direction finders.Keyword direction finder is functional unit or algorithm, and its assessment audio signal is to detect the existence of one or more predefined words or statement in audio signal.In general, keyword direction finder uses the ASR technology simplified detect the word of certain words or limited quantity rather than attempt identification large vocabulary.Such as, when specified word being detected in voice signal, keyword direction finder can provide notice rather than provide text or the output based on word.Various words can be compared by the keyword direction finder using these technology based on hidden Markov model (HMM), and word list is shown as state sequence by described hidden Markov model.In general, by by discourse model and keyword model and be compared to language is analyzed with background model.The model of language and keyword model are compared and draws the score representing language corresponding to the probability of keyword.The model of language and background model are compared and draws the score representing language corresponding to the probability of the generic word except keyword.Can compare to determine whether to have said keyword by two scores.

Audio frequency apparatus 102 may also include and is referred to herein as controller or controls the control function 124 of logic, and described control function 124 is configured to other assemblies with audio frequency apparatus 102 and interacts to realize the logic function of audio frequency apparatus 102.

Control executable instruction, program and/or or program module that logic 124, audio processing components 120, speech recognition assembly 122 and functional unit 118 can include being stored in memorizer 116 and being performed by processor 114.

Voice command service 108 can be the part that network-accessible calculates platform in some cases, and described network-accessible calculates platform to be undertaken safeguarding and may have access to by network 126 (such as the Internet).Such network-accessible calculates platform and term such as " on-demand computing ", " namely software service (SaaS) ", " platform calculating ", " network-accessible platform ", " cloud service ", " data center " etc. can be used to refer to.

Audio frequency apparatus 102 and/or voice command service 108 can pass through cable technology (such as, electric wire, USB (universal serial bus) (USB), fiber optic cables etc.), wireless technology (such as, radio frequency (RF), honeycomb, mobile telephone network, satellite, bluetooth etc.) or other interconnection techniques be communicatively coupled to network 126.Network 126 represents any kind of communication network, including data and/or voice network, and can use non-wireless infrastructures (such as, coaxial cable, fiber optic cables etc.), radio infrastructure (such as, RF, honeycomb, microwave, satellite,Deng) and/or other interconnection techniques realize.

Although audio frequency apparatus 102 is described herein as speech control or voice-based interface equipment, but the techniques described herein realize in combinations with various types of equipment, such as telecommunication apparatus and assembly, hand free device, amusement equipment, media-playback device etc..

Voice command service 108 is commonly provided for following function: receives the voice from the voice in the audio stream of audio frequency apparatus 102, identification audio stream, according to institute's identification and determines user view and in response to user view execution action or service.The action provided performs in combinations with audio frequency apparatus 102 in some cases, and voice command service 108 can indicate the response of the order performed by audio frequency apparatus 102 to audio frequency apparatus 102 return in these cases.

Voice command service 108 includes operation logic, and described operation logic can include one or more server, computer and or processor 128 in many cases.Voice command service 108 also can have the memorizer 130 comprising application and program in instruction type, and processor 128 performs instruction to perform to realize operation or the action of the desired function (including function explicitly described herein) of voice command service.Memorizer 130 can be a class computer-readable storage medium and can include volatibility and nonvolatile memory.Therefore, memorizer 130 may include but be not limited to RAM, ROM, EEPROM, flash memory or other memory technologies.

Be not explicitly depicted other logically and physically among assembly, voice command service 108 can include speech recognition assembly 132.Speech recognition assembly 132 can include automatic speech recognizing (ASR) function of the human speech in identification audio signal.

Voice command service 108 may also include the voice based on institute's identification and determines the natural language understanding assembly (NLU) 134 of user view.

Voice command service 108 may also include determining that command interpreter and the action allotter 136 (hereinafter referred to as command interpreter 136) of the function corresponding to user view or order.In some cases, order may correspond to the function that will be performed at least in part by audio frequency apparatus 102, and command interpreter 136 can provide instruction for realizing the response of the order of this type of function to audio frequency apparatus 102 in these cases.The order that can be performed in response to the instruction from command interpreter 136 by audio frequency apparatus or the example of function can include playing music or other media, increase/reductions speaker 112 volume, generated the certain form of communication etc. that the user of audible voice, startup and similar devices carries out by speaker 112.

It should be noted that voice command service 108 may also be responsive to perform to relate to the function of unshowned entity or equipment in Fig. 1 in the voice gone out from the audible recognition received.Such as, voice command service 108 can interact with other network services to represent user 106 obtaining information or service.Additionally, voice command service 108 itself can have the various elements and function that can the voice that user 106 sends be responded.

In operation, the audio frequency of the voice comprising user 106 is caught or received to the mike 110 of audio frequency apparatus 102.The audio frequency that audio frequency is undertaken processing and processing by audio processing components 120 is received by speech recognition assembly 122.Speech recognition assembly 122 analyzes described audio frequency to detect the appearance triggering statement in the voice that audio frequency comprises.When triggering statement being detected, controller 124 starts to send the audio frequency that receives together with the request that voice command services 108 or stream transmission is to voice command service 108, with identification with explain user speech and start the function corresponding to any explained intention.

Tong Bu with sending the audio to voice command service 108, speech recognition assembly 122 continues to analyze the audio frequency received to detect the appearance of local command statement in user speech.When detecting that local command is stated, controller 124 starts or performs the functions of the equipments stated corresponding to described local command.Such as, stating " stopping " in response to local command, controller 124 can start the function stopping media playback.When starting or perform function, controller 124 can interact with one or more in functional unit 118.

Meanwhile, in response to receiving audio frequency, voice command service 108 synchronizes to be analyzed with identification voice to described audio frequency, it is determined that user view, and determines the function that the service that will realize identifies in response to user view.But, after this locality detection and implementing local command statement, audio frequency apparatus 102 can take action to cancel, abolish and may finally be serviced the function of any service identification of 108 startups by voice command or make it invalid.Such as, audio frequency apparatus 102 can be cancelled message and/or service 108 Streaming audio to voice command and cancel its previous Request by stopping by sending to voice command service 108.As another example, audio frequency apparatus can ignore or abandon the order of any response or the service identification received in response to early stage request from voice command service 108.In some cases, audio frequency apparatus can notify that voice command service 108 is stated in the action locally executed in response to local command, and voice command service 108 can revise its behavior subsequently based on this information.Such as, voice command service 108 can abandon the action that is otherwise likely to perform in response to the voice of identification in received audio frequency.

Fig. 2 illustrates illustrative methods 200, and described method 200 can be combined with voice command service 108 by audio frequency apparatus 102 and perform so that discriminating user voice it is responded.By method 200 described in the context in the system 100 of Fig. 1, although method 200 can also perform in other environments and can realize in a different manner.

Action in the left side of Fig. 2 is to perform at local audio devices 102 place or be executed by.Action on the right side of Fig. 2 is to perform at remote speech command service 108 place or be executed by.

Action 202 includes receiving by mike 110 or the audio signal caught in conjunction with described mike 110.Audio signal comprises or represents the audio frequency from environment 104, and can comprise user speech.Audio signal can be analog electrical signal or can include digital signal, such as digital audio stream.

Action 204 includes detecting the appearance triggering statement in the audio frequency received and/or user speech.This action can be performed by speech recognition assembly 122 as above, and described speech recognition assembly 122 can include keyword direction finder in some embodiments.If being not detected by triggering statement, then palikinesia 204 is so that monitoring triggers the appearance of statement continuously.All the other actions shown in Fig. 2 are in response to and detect what triggering statement performed.

If triggering statement being detected in action 204, so execution action 206, send, to voice command service 108, the audio frequency that receives including with rear and voice command is serviced the service request 208 of 108, in order to voice in identification audio frequency and realize the function of the voice corresponding to institute's identification.The function started by this way by voice command service 108 is referred to herein as the function that service identifies, and can be combined with audio frequency apparatus 102 in some cases and perform.Such as, can pass through to send commands to start function to audio frequency apparatus 102.

Described transmission 206 may be included in after triggering statement being detected, would indicate that or voice command service 108 is transmitted or be otherwise transferred to the digital audio stream 210 that comprises the audio frequency received from mike 110 as a stream.In certain embodiments, action 206 can include opening or start the communication session between audio frequency apparatus 102 and voice command service 108.Specifically, it is possible to use request 208 set up and voice command service 108 communication session, in order to identification voice, understand be intended to and determine the action or function that will perform in response to user speech.Request 208 can be followed by or be attended by streaming audio 210.In some cases, it is provided that multiple parts of the audio frequency received just started can be included in the time said before triggering statement to the audio stream 210 of voice command service 108.

Communication session can be associated with communication or Session ID (ID), and described communication or session id mark service, at audio frequency apparatus 102 and voice command, the communication session set up between 108.Session id can use in the future communications relevant to specific user's language or audio stream or include wherein.In some cases, session id can be generated by audio frequency apparatus 102 and be provided to voice command service 108 in request 208.As an alternative, session id can be serviced 108 generations by voice command and be serviced 108 offers in the confirmation to request 208 by voice command.Term " request (ID) " is in this article for indicating the request with special session ID.The response relevant to same session, request or audio stream servicing 108 from voice command can by term " response (ID) " instruction.

In certain embodiments, each communication session and corresponding session id may correspond to unique user language.Such as, audio frequency apparatus 102 can set up session when triggering statement being detected.Audio frequency apparatus 102 may thereafter continue to part audio streaming being transferred to voice command service 108 as same session, until user spoken utterances terminates.Voice command service 108 can use identical session id to provide response to audio frequency apparatus 102 by session.Response may indicate that the order that will be performed in response to the voice by voice command service 108 identification in the audio frequency 210 received by audio frequency apparatus 102 in some cases.Communication session can stay open until audio frequency apparatus 102 receives the response or until the audio frequency apparatus 102 cancellation request that service 108 from voice command.

In action 212,208 and audio stream 210 are asked in voice command service 108 reception.As response, voice command services 108 execution actions 214: use speech recognition and the natural language understanding assembly 132 and 134 of voice command service 108, voice in the audio frequency that identification receives and determining such as the user view of the phonetic representation by institute's identification.The action 214 performed by command interpreter 136 includes the function identifying and starting service identification to fulfil determined user view.The function that service identifies can be serviced 108 by voice command in some cases and perform independent of audio frequency apparatus 102.In other cases, the recognizable function that will be performed by audio frequency apparatus 102 of voice command service 108, and can send, to audio frequency apparatus 102, the corresponding order performed for audio frequency apparatus 102.

Tong Bu with the action performed by voice command service 108, local audio devices 102 performs other actions to determine whether user says local command statement and the local function performing correspondence in response to any this local command statement said.Specifically, in response to detecting that in action 204 action 218 triggering statement and perform includes analyzing the audio frequency received in action 202 to detect the appearance of the local command statement after triggering statement or immediately after in received speech.This action can be performed by the speech recognition assembly 122 of audio frequency apparatus 102 as above, and described speech recognition assembly 122 can include keyword direction finder in some embodiments.

Detect that local command is stated in response in action 218, perform to start immediately the action 220 of the functions of the equipments being associated with local command statement.Such as, local command statement " stopping " can being associated with the function stopping media playback.

It addition, detect that local command is stated in response in action 218, audio frequency apparatus 102 performs to stop or cancelling the action 222 of the request 208 that voice command services 108.This action can include cancelling or abolish the realization of function that service identifies, the function of described service identification is otherwise likely to be realized in response to received request 208 and the audio frequency 210 enclosed by voice command service 108.

In some implementation, action 222 can include clearly notifying or order to voice command service 108 transmission, and the realization of the function being otherwise likely to any service identification started in response to the voice of institute's identification about any other identification activity and/or cancellation of servicing request 208 is cancelled in request voice command service 108.As an alternative, audio frequency apparatus 102 can notify simply voice command service 108 about in response to local command statement local identification and in any function locally executed, and voice command service 108 can pass through cancel service request 208 or by perform other actions (depending on the circumstances) respond.

In some implementation, voice command service 108 can by identifying the function that the order performed by audio frequency apparatus 102 realizes service identification.In response to receiving the notice that service request 208 will be cancelled, voice command service 108 can be abandoned sending order to audio frequency apparatus 102.As an alternative, voice command service can being allowed to complete it and process and send order to audio frequency apparatus 102, audio frequency apparatus 102 can be ignored described order or abandon performing described order when the time comes.

In some implementations, voice command service can be configured to starting notification audio equipment 102 before the function that service identifies, and can postpone the realization of the function that described service identifies until receive license from audio frequency apparatus 102.In this case, audio frequency apparatus 102 can be configured to when picking out local command statement in this locality and refuse this license.

Kind described above method can use when needing different order waiting time amounts.Such as, waiting that the communication from voice command service can introduce the of a relatively high waiting time, this is probably unacceptable in some cases.This type of communication before realizing function can prevent from repeating or unexpected action.Realize the order statement of local identification immediately and ignore subsequently from the order of voice command service or cancel the request to voice command service subsequently and be more likely to be appropriate for expecting the situation of relatively low latency.

It should be noted that the action 218,220 and 222 of the action of the voice command service 108 shown in Fig. 2 and audio frequency apparatus 102 performs parallel and asynchronously.In some implementations, assume that audio frequency apparatus 102 can relatively quickly detect and implement local command statement, so that it can perform action 222: before the function of the service identification of action 216 has completed or performed, cancel request 208 and the process subsequently of 108 execution will be serviced by voice command.

Fig. 3 illustrates illustrative methods 300, wherein voice command service 108 is to audio frequency apparatus 102 return command, and wherein audio frequency apparatus 102 is configured to ignore described order or abandon performing described order when local command statement is detected by audio frequency apparatus 102 and implemented.Initial actuating is similar with those described above or identical.The action performed by audio frequency apparatus 102 illustrates in left side and the action that performed by voice command service 108 illustrates on right side.

Action 302 includes receiving the audio signal comprising user speech.Action 304 includes the triggering statement analyzing audio signal to detect in user speech.Action subsequently shown in Fig. 3 is in response to and detects what triggering statement performed.

Action 306 includes servicing 108 transmission request 308 and audio frequency 310 to voice command.Action 312 includes servicing 108 places at voice command and receives request 308 and audio frequency 310.Action 314 includes discriminating user voice and the user speech based on institute's identification determines user view.

In response to determined user view, voice command services 108 execution actions 316: send order 318 to audio frequency apparatus 102, and described order 318 is for being performed to realize the function that the service of the user view corresponding to institute's identification identifies by audio frequency apparatus 102.Such as, order can include " stopping " order, and its instruction audio frequency apparatus 102 should stop the playback of music.

The action 320 performed by audio frequency apparatus 102 includes receiving and performing order.Action 320 illustrates in a dotted box, for indicating it to be based on whether audio frequency apparatus 102 detects and implement local command statement and be conditionally executed.Specifically, if audio frequency apparatus 102 has detected that local command is stated, then do not perform action 320.

Tong Bu with the action performed by voice command service 108, audio frequency apparatus 102 performs action 322: analyze the audio frequency that receives with in detection institute receptions user speech trigger state after or the appearance stated of local command immediately after.In response to detecting that local command is stated, perform to start immediately the action 324 of the local device function being associated with local command statement.

It addition, detect that local command is stated in response in action 322, audio frequency apparatus 102 performs to abandon the action 326 of the order 318 that execution receives.More specifically, abandon or ignore any order received from voice command service 108 in response to request 308.Response and order corresponding to request 308 can be identified by the session id being associated with response.

If being not detected by local command statement in action 322, then audio frequency apparatus execution action 320: perform the order 318 received from voice command service 108.

Fig. 4 illustrates illustrative methods 400, and wherein the request that voice command services 108 actively cancelled by audio frequency apparatus 102 after being configured to be in that the statement of locally detected local command.Initial actuating is similar with those described above or identical.The action performed by audio frequency apparatus 102 illustrates in left side and the action that performed by voice command service 108 illustrates on right side.

Action 402 includes receiving the audio signal comprising user speech.Action 404 includes the triggering statement analyzing audio signal to detect in user speech.Action subsequently shown in Fig. 4 is in response to and detects what triggering statement performed.

Action 406 includes servicing 108 transmission request 408 and audio frequency 410 to voice command.Action 412 includes servicing 108 places at voice command and receives request 408 and audio frequency 410.Action 414 includes discriminating user voice and the user speech based on institute's identification determines user view.

Action 416 includes determining whether request 408 is cancelled by audio frequency apparatus 102.For example, audio frequency apparatus 102 can send cancellation message or can terminate present communications session to cancel request.If request is cancelled by audio frequency apparatus 102, then further action is no longer taked in voice command service.If request is not yet cancelled, then execution action 418, described action 418 includes: send order 420 to audio frequency apparatus 102, and described order 420 is for being performed to realize the function that the service of the user view corresponding to institute's identification identifies by audio frequency apparatus 102.

The action 422 performed by audio frequency apparatus 102 includes receiving and performing order.Action 422 illustrates in a dotted box, is used for indicating it to be depending on whether voice command service 108 has sent and received order, then depended on whether audio frequency apparatus 102 has been cancelled request 408 and be conditionally executed.

Tong Bu with the action performed by voice command service 108, audio frequency apparatus 102 performs action 424: analyze the audio frequency that receives with in detection institute receptions user speech trigger state after or the appearance stated of local command immediately after.In response to detecting that local command is stated, perform to start immediately the action 426 of the local device function being associated with local command statement.

Additionally, detect that local command is stated in response in action 424, audio frequency apparatus 102 performs action 428: request voice command service 108 is cancelled request 408 and/or cancels the realization of function of any service identification, and the function of described service identification is otherwise likely to perform in response to the voice by the voice command service 108 institute's identification from the audio frequency that audio frequency apparatus 102 receives.This action can include communicating with voice command service 108, such as by sending cancellation notice or request.

In some cases, the response that can include the realization undetermined making the function that the service undertaken by institute's speech commands service identifies from the communication of voice command service 108 or notice is cancelled.In response to receiving this notice, audio frequency apparatus 102 can respond and can ask to cancel described realization undetermined.As an alternative, audio frequency apparatus 102 can cancel the realization of any function being otherwise likely to perform in response to local command statement being detected, and may indicate that voice command service 108 proceeds the realization of function undetermined.

If being not detected by local command statement in action 424, then audio frequency apparatus 102 performs action 422: perform the order 420 received from voice command service 108.When receiving the order 420 from voice command service, action 422 can occur asynchronously.

The mode that embodiments described above can program realizes, and such as utilizes computer, processor, digital signal processor, analog processor etc..But, in other embodiments, one or more in assembly, function or element use special or special circuit to realize, including analog circuit and/or Digital Logical Circuits.As used herein, term " assembly " is intended to include any hardware of the function for realizing belonging to assembly, software, logic or aforementioned every combination.

Although describing theme with the language specific to architectural feature, but it is to be understood that in claims, the theme of definition is not necessarily limited to described specific features.It practice, specific features discloses as the illustrative form implementing claim.

Clause:

1. the non-transitory computer-readable medium of one or more storage computer executable instructions, described instruction is when executed so that one or more processors perform to include following action:

Receive the audio frequency comprising user speech；

Detect the triggering statement in described user speech；

State in response to described triggering theed detect in described user speech:

The audio streaming received is transferred to remote speech command service；And

Analyzing the audio frequency the received described local command statement triggered after statement to detect in described user speech, the statement of wherein said local command is associated with functions of the equipments；

Described functions of the equipments are started in response to the described described local command statement triggered after stating detected in described user speech；

Receiving the response from described remote speech command service, wherein said response indicates the order that will perform in response to the voice by described remote speech command service identification in described streaming audio；

If the described described local command statement triggered after statement being not detected by described user speech, then perform by the described order of described response instruction；And

If be detected that the described described local command statement triggered after statement in described user speech, then abandon performing by the described order of described response instruction.

2. one or more computer-readable mediums as described in clause 1, wherein said stream transmission is associated with communication identifier and wherein said response indicates described communication identifier.

3. one or more computer-readable mediums as described in clause 1, wherein said functions of the equipments include media control function.

4. one or more computer-readable mediums as described in clause 1, described action also includes in response to detecting that described order statement stops the described stream transmission of received audio frequency.

5. a method, comprising:

Receive the audio frequency comprising user speech；

Detect the triggering statement in described user speech；

Send the voice in the audio frequency that the audio frequency that receives receives with identification to voice command service and realize the first function of the voice corresponding to institute's identification；And

Analyzing the audio frequency that receives to state with the described local command triggered after statement in detection institute receptions audio frequency, wherein said local command is stated and is associated with the second function；

The described described local command statement triggered after statement in response to detecting in received audio frequency:

Start described second function；And

Cancel the realization of described first function.

6. the method as described in clause 5, the realization wherein cancelling described first function includes the speech commands service of request institute to cancel the realization of described first function.

7. the method as described in clause 5, its communication also including receiving the realization undetermined indicating described first function from institute's speech commands service；

The realization wherein cancelling described first function includes the speech commands service of request institute to cancel the realization undetermined of described first function.

8. the method as described in clause 5, it also includes receiving the order corresponding to described first function from institute's speech commands service, and the realization wherein cancelling described first function includes abandoning the described order that execution receives from institute's speech commands service.

9. the method as described in clause 5, it also includes notice institute described second function of speech commands service and has been turned on.

10. the method as described in clause 5, the realization wherein cancelling described first function includes notifying that institute's speech commands services described second function and has been turned on.

11. the method as described in clause 5, wherein said second function includes media control function.

12. the method as described in clause 5, it also includes:

In response to the described communication session triggering statement and foundation and institute's speech commands service detected in described audio frequency；And

The realization wherein cancelling described first function includes terminating described communication session.

13. the method as described in clause 5, it also includes:

The audio frequency making identifier and receive is associated；

Receiving the response from institute's speech commands service, wherein said response indicates described identifier and the order corresponding to described first function；And

The realization wherein cancelling described first function includes abandoning performing described order.

14. a system, comprising:

One or more speech recognition assemblies, the one or more speech recognition assembly is configured to the user speech in the received audio frequency of identification, detect and trigger statement and the local command statement detecting in described user speech in described user speech；

Controlling logic, described control logic is configured to perform action in response to the described triggering statement in the one or more speech recognition component detection to described user speech, and described action includes:

Described audio frequency is sent with the voice in audio frequency described in identification and the first function realizing the voice corresponding to institute's identification to voice command service；And

State in response to the described local command in the one or more speech recognition component detection to described user speech: (a) identifies that the second function stated corresponding to described local command and (b) cancel the realization of at least one in described first function and described second function.

15. the system as described in clause 14, wherein said one or more speech recognition assemblies include one or more keyword direction finder.

16. the system as described in clause 14, wherein cancel the realization of at least one in described first function and described second function and include the speech commands service of request institute to cancel the realization of described first function.

17. the system as described in clause 14, wherein cancel the realization of at least one in described first function and described second function and include ignoring the order received from institute's speech commands service.

18. the system as described in clause 14, wherein said second function includes media control function.

19. the system as described in clause 14, described action also includes the described transmission stopping described audio frequency in response to the described local command statement detecting in described user speech.

20. the system as described in clause 14, wherein cancel the realization of at least one in described first function and described second function and include notifying that institute's speech commands services described second function and has been turned on.

Claims

1. storing an equipment for computer executable instructions, described instruction is when executed so that one or more processors of described equipment perform to include following action:

Receive the audio frequency comprising user speech；

Detect the triggering statement in described user speech；

2. equipment as claimed in claim 1, wherein said stream transmission is associated with communication identifier and wherein said response indicates described communication identifier.

3. equipment as claimed in claim 1, wherein said functions of the equipments include media control function.

4. equipment as claimed in claim 1, described action also includes in response to detecting that described order statement stops the described stream transmission of received audio frequency.

5. a method, comprising:

Receive the audio frequency comprising user speech；

Detect the triggering statement in described user speech；

Start described second function；And

Cancel the realization of described first function.

6. method as claimed in claim 5, the realization wherein cancelling described first function includes the speech commands service of request institute to cancel the realization of described first function.

7. method as claimed in claim 5, its communication also including receiving the realization undetermined indicating described first function from institute's speech commands service；

The realization wherein cancelling described first function includes the speech commands service of request institute to cancel the realization described undetermined of described first function.

8. method as claimed in claim 5, it also includes receiving the order corresponding to described first function from institute's speech commands service, and the realization wherein cancelling described first function includes abandoning the described order that execution receives from institute's speech commands service.

9. method as claimed in claim 5, it also includes notice institute described second function of speech commands service and has been turned on.

10. method as claimed in claim 5, it also includes:

The audio frequency making identifier and receive is associated；

11. a system, comprising:

12. system as claimed in claim 11, wherein cancel the realization of at least one described in described first function and described second function and include the speech commands service of request institute to cancel the realization of described first function.

13. system as claimed in claim 11, wherein cancel the realization of at least one described in described first function and described second function and include ignoring the order received from institute's speech commands service.

14. system as claimed in claim 11, described action also includes the described transmission stopping described audio frequency in response to the described local command statement detecting in described user speech.

15. system as claimed in claim 11, wherein cancel the realization of at least one described in described first function and described second function and include notifying that institute's speech commands services described second function and has been turned on.