CN109791764A - Communication based on speech - Google Patents
Communication based on speech Download PDFInfo
- Publication number
- CN109791764A CN109791764A CN201780060299.1A CN201780060299A CN109791764A CN 109791764 A CN109791764 A CN 109791764A CN 201780060299 A CN201780060299 A CN 201780060299A CN 109791764 A CN109791764 A CN 109791764A
- Authority
- CN
- China
- Prior art keywords
- equipment
- speech
- audio data
- message
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000004891 communication Methods 0.000 title claims description 156
- 238000000034 method Methods 0.000 claims abstract description 93
- 230000004044 response Effects 0.000 claims abstract description 30
- 230000002618 waking effect Effects 0.000 claims abstract description 25
- 238000012545 processing Methods 0.000 claims description 65
- 238000001514 detection method Methods 0.000 claims description 23
- 238000013459 approach Methods 0.000 claims description 12
- 230000000007 visual effect Effects 0.000 claims description 11
- 230000008859 change Effects 0.000 abstract description 43
- 230000003993 interaction Effects 0.000 abstract description 14
- 238000003860 storage Methods 0.000 description 29
- 230000008569 process Effects 0.000 description 23
- 230000033001 locomotion Effects 0.000 description 22
- 230000015654 memory Effects 0.000 description 17
- 238000005516 engineering process Methods 0.000 description 16
- 230000005540 biological transmission Effects 0.000 description 11
- 238000004458 analytical method Methods 0.000 description 10
- 230000033764 rhythmic process Effects 0.000 description 8
- 230000005236 sound signal Effects 0.000 description 8
- 230000003068 static effect Effects 0.000 description 8
- 230000015572 biosynthetic process Effects 0.000 description 7
- 238000003786 synthesis reaction Methods 0.000 description 7
- 230000001755 vocal effect Effects 0.000 description 6
- 230000006870 function Effects 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 230000011664 signaling Effects 0.000 description 4
- 241001269238 Data Species 0.000 description 3
- 230000009471 action Effects 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 3
- 230000008901 benefit Effects 0.000 description 3
- 238000006243 chemical reaction Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 239000003607 modifier Substances 0.000 description 3
- 238000001228 spectrum Methods 0.000 description 3
- 238000013500 data storage Methods 0.000 description 2
- 230000001419 dependent effect Effects 0.000 description 2
- 238000009826 distribution Methods 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000010189 synthetic method Methods 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- MQJKPEGWNLWLTK-UHFFFAOYSA-N Dapsone Chemical compound C1=CC(N)=CC=C1S(=O)(=O)C1=CC=C(N)C=C1 MQJKPEGWNLWLTK-UHFFFAOYSA-N 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 238000013499 data model Methods 0.000 description 1
- 238000003066 decision tree Methods 0.000 description 1
- 238000000151 deposition Methods 0.000 description 1
- 238000011143 downstream manufacturing Methods 0.000 description 1
- 235000013399 edible fruits Nutrition 0.000 description 1
- 239000000945 filler Substances 0.000 description 1
- 239000004615 ingredient Substances 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 230000014759 maintenance of location Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 239000003550 marker Substances 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000006855 networking Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 238000012805 post-processing Methods 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 239000000700 radioactive tracer Substances 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 238000012706 support-vector machine Methods 0.000 description 1
- 230000001256 tonic effect Effects 0.000 description 1
- 238000013518 transcription Methods 0.000 description 1
- 230000035897 transcription Effects 0.000 description 1
- 230000001960 triggered effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L51/00—User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
- H04L51/04—Real-time or near real-time messaging, e.g. instant messaging [IM]
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L51/00—User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
- H04L51/07—User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail characterised by the inclusion of specific contents
- H04L51/18—Commands or executable codes
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04M—TELEPHONIC COMMUNICATION
- H04M3/00—Automatic or semi-automatic exchanges
- H04M3/42—Systems providing special services or facilities to subscribers
- H04M3/42229—Personal communication services, i.e. services related to one subscriber independent of his terminal and/or location
- H04M3/42263—Personal communication services, i.e. services related to one subscriber independent of his terminal and/or location where the same subscriber uses different terminals, i.e. nomadism
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04M—TELEPHONIC COMMUNICATION
- H04M7/00—Arrangements for interconnection between switching centres
- H04M7/0024—Services and arrangements where telephone services are combined with data services
- H04M7/0042—Services and arrangements where telephone services are combined with data services where the data service is a text-based messaging service
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L2015/088—Word spotting
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/223—Execution procedure of a spoken command
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04M—TELEPHONIC COMMUNICATION
- H04M2201/00—Electronic components, circuits, software, systems or apparatus used in telephone systems
- H04M2201/14—Delay circuits; Timers
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04M—TELEPHONIC COMMUNICATION
- H04M2203/00—Aspects of automatic or semi-automatic exchanges
- H04M2203/65—Aspects of automatic or semi-automatic exchanges related to applications where calls are combined with other types of communication
- H04M2203/652—Call initiation triggered by text message
Landscapes
- Engineering & Computer Science (AREA)
- Signal Processing (AREA)
- Computer Networks & Wireless Communication (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- General Engineering & Computer Science (AREA)
- Telephonic Communication Services (AREA)
Abstract
It describes for the system by interaction of the speech-control device upgrade based on speech, method and apparatus.The capture of speech-control equipment includes waking up the audio of word part and payload portions, for being sent to server with the relay message between speech-control equipment.In response to determining the generation of the repetition message between for example identical two equipment of upgrade event, the system can change the mode of speech-control equipment automatically, such as it no longer needs to wake up word, no longer it may be noted that desired recipient, or two speech-control equipment are connected automatically with voice chat mode.In response to the generation of the further upgrade event of determination, the system can star the real-time calls between the speech-control equipment.
Description
The cross reference of related application data
It is entitled this application claims being submitted in September 1st name with Christo Frank Devaraj et al. in 2016
The priority of the U.S. Patent application No.15/254,359 of " Voice-Based Communications ".
It is entitled this application claims being submitted in September 1st name with Christo Frank Devaraj et al. in 2016
The U.S. Patent application No.15/254,458's of " Indicator for Voice-Based Communications " is preferential
Power.
It is entitled this application claims being submitted in September 1st name with Christo Frank Devaraj et al. in 2016
The U.S. Patent application No.15/254,600's of " Indicator for Voice-Based Communications " is preferential
Power.
Above-mentioned application is incorporated herein by reference in their entirety.
Background technique
Speech discrimination system has evolved to the degree that the mankind can interact by speech and calculating equipment.It is such
Various quality of the system based on the audio input received, the word said by human user is identified using multiple technologies.Speech
Language recognizes unified with nature language understanding processing technique, realizes to the user's control based on speech for calculating equipment, to be based on user
Verbal order execute task.The combination of speech discrimination and natural language understanding processing technique is referred to herein as " at speech
Reason ".Speech processing can also relate to the speech of user being converted into text data, then text data can be supplied to various bases
In the software application of text.
Speech processing can be used by computer, handheld device, telephone computer system, telephone booth and various other equipment
To improve human-computer interaction.
Detailed description of the invention
In order to which the disclosure is more fully understood, referring now to being described below in conjunction with attached drawing.
Figure 1A shows the system for changing the interaction based on speech by speech-control equipment.
Figure 1B is shown for passing through system of the speech-control equipment to user's output signal during message transmission.
Fig. 2 is the concept map of speech processing system.
Fig. 3 is the concept map of the multiple domain framework method for natural language understanding.
Fig. 4 shows data stored and associated with user profile.
Fig. 5 A to Fig. 5 D is to show the signal flow graph for changing the interaction based on speech by speech-control equipment.
Fig. 6 A and Fig. 6 B are to show the signal flow graph for changing the interaction based on speech by speech-control equipment.
Fig. 7 is to show the signal flow graph for changing the interaction based on speech by speech-control equipment.
Fig. 8 A and Fig. 8 B are to show the signal flow graph exported by the signaling of the user interface of speech-control equipment.
Fig. 9 is to show the signal flow graph exported by the signaling of the user interface of speech-control equipment.
Figure 10 A to Figure 10 C shows the example signal exported by speech-control equipment to user.
Figure 11 A and Figure 11 B show the example signal exported by speech-control equipment to user.
Figure 12 shows the example signal exported by speech-control equipment to user.
Figure 13 is the block diagram for conceptually illustrating the exemplary components of the speech-control equipment according to the embodiment of the disclosure.
Figure 14 is the block diagram for conceptually illustrating the exemplary components of server of the embodiment according to the disclosure.
Figure 15 shows the example for the computer network being used together with the system of the disclosure.
Specific embodiment
Automatic speech discrimination (ASR) is computer science, artificial intelligence and philological be related to will be associated with speech
Audio data is converted into representing the field of the text of the speech.Equally, natural language understanding (NLU) be computer science, it is artificial
The intelligent and philological field for being related to enabling a computer to obtaining meaning from the text input comprising natural language.ASR and
NLU is often used as a part of speech processing system together.
From the point of view of calculating angle, ASR and NLU can be sufficiently expensive.That is, ASR is handled in range at a reasonable time
A large amount of computing resource may be needed with NLU.Therefore, when executing speech processing, distributed computing environment can be used.It is typical
This distributed environment can be related to the local or other kinds of client device with one or more microphones, it is described
Microphone is configured as capturing the sound from the user to speak and these sound is converted into audio signal.Then, audio is believed
Number can be sent to remote equipment is further processed, such as converts audio signals into final order.Then, it depends on
Order itself, order can be executed by the combination of remote equipment and user equipment.
In certain configurations, speech processing system can be configured as transmits spoken message between devices.That is,
First equipment can send the language of message with capture command system to recipient associated with the second equipment.In response,
The user of two equipment can say language, and language is captured by the second equipment, be then sent to system to be handled, incite somebody to action
Message sends back the user of the first equipment.In this way, speech-control system can promote the spoken message between equipment to pass
It passs.
However, one of this message transmission the disadvantage is that, for each spoken interaction with system, user may need
Both word (with " wake-up " user equipment) and the recipient of message are waken up out, and such system can just be known how in routing language
Including message.This conventional arrangement may rub to the interacting strip between user and system, especially as two users
When exchanging multiple message between them.
Present disclose provides the technologies for changing the interaction based on speech by speech-control equipment.Speech-control equipment
Capture includes waking up the audio of word part and payload portions, for being sent to server between speech-control equipment
After message.The generation for changing triggering (such as repetition message between identical two equipment) is communicated in response to determining, system can be with
The automatic mode for changing speech-control equipment, such as no longer need to wake up word, it is no longer necessary to point out desired recipient, or with
Voice chat mode connects two speech-control equipment automatically.When the mode of speech-control equipment changes, system be can be used
Different agreements come manage how system exchanged between equipment message and other data.For example, when system from equipment it
Between exchange speech information be switched between devices start simultaneous call (for example, call) when, system can be stopped using
Messaging protocol simultaneously activates or calls real-time protocol (RTP) (for example, voice over Internet protocol (VoIP)).In response to determine into
The communication of one step changes the generation of triggering, and system can star the calling of the real-time synchronization between speech-control equipment.Illustrate below
The communication that system carries out changes the various examples of triggering and processing.Communication changes triggering and can be by system as described herein
Based on the threshold value of configuration satisfaction and determination.That is, system can be configured as do not receive it is from the user bright
Really change communication exchange in the case where instruction.
The disclosure additionally provides the skill for exporting vision (or the audio, tactile etc.) instruction about the interaction based on speech
Art.The user interface of the first equipment can be used to provide feedback in such instruction, and feedback points out the input part of the second equipment
(for example, microphone) is in the process that user inputs (such as reply to the message sent from the equipment of the first user) that receives
In.After message content is sent the speech-control equipment of recipient by server, server be can receive from recipient
The equipment of speech-control equipment detecting the instruction of speech.In response, then server sets the first speech-control
Standby output visually indicates, wherein visually indicating indicates that recipient's speech-control equipment is detecting speech.It is understood that, in this way,
Visually indicating may be used to the user of speech-control equipment and " will not rob and say " each other (that is, preventing the use of speech-control equipment
Say message simultaneously in family).
Figure 1A shows the system 100 for being configured as changing the interaction based on speech between speech-control equipment.Although
Figure 1A and the following figure/discussion show the operation of system 100 with particular order, and still, described step can be with different suitable
Sequence executes (and removing or add certain steps) without departing from the intention of the disclosure.As shown in Figure 1A, system 100 may include
One or more speech-control the equipment 110as and 110b local in the first user 5 and second user 7 respectively.System 100 is also wrapped
It includes one or more networks 199 and is connected to one or more servers 120 of equipment 110a and 110b by network 199.Clothes
Business device 120 (can be one or more different physical equipments) can be able to carry out traditional speech processing as described herein
(ASR, NLU, inquiry parsing etc.).Individual server may be able to carry out all speech processing or multiple servers can
Speech processing is executed to combine.In addition, server 120, which can be configured as, executes certain orders, such as answers and used by first
The inquiry that family 5 and/or second user 7 are said.In addition, the detection of certain speeches or order execute function can by equipment 110a and
110b is executed.
As shown in Figure 1A, user 5 can say language (being indicated by input audio 11).Input audio 11 can be by equipment
One or more microphone 103a of 110a and/or the microphone array (not shown) separated with equipment 110a capture.Microphone
Array may be coupled to equipment 110a, so that microphone array will correspond to defeated when microphone array receives input audio 11
The audio data for entering audio 11 is sent to equipment 110a.Optionally, microphone array may be coupled to mobile computing device and (not show
Out, for example, smart phone, tablet computer etc.) adjoint application.In this example, when microphone array captures input audio 11
When, microphone array sends the audio data for corresponding to input audio 11 to application, turns audio data with application
It is dealt into equipment 110a.If equipment 110a captures input audio 11, input audio 11 can be converted to audio by equipment 110a
Data, and server 120 is sent by audio data.Optionally, if equipment 110a connects from microphone array or with application
The audio data corresponding to input audio 11 is received, then the audio data received simply can be forwarded to clothes by equipment 110a
Business device 120.
120 initial response of server includes the audio data for waking up word part and payload portions in reception (150),
Message is transmitted between speech-control equipment.Payload portions may include recipient information and message content.Such message
Transmitting can be carried out by using message field as described in detail herein and associated agreement.Server 120 so transmits
Message, until server 120 determines that (152) first communications change the generation of triggering.Illustrative communication changes triggering and includes whether
Meet or more than between the first speech-control equipment 110a and the second speech-control equipment 110b message exchange number of thresholds,
The number of thresholds of the message exchange occurred in threshold amount of time or the user of two speech-control equipment 110a/110b exist simultaneously
The within the threshold range of its relevant device.After determining the generation that the first communication changes triggering, server 120 is in response to receiving
Audio data including payload data (for example, message content data) transmits between identical speech-control equipment
(154) message.The transmitting of message can be carried out by using message transmission domain as described in detail herein and associated agreement.
Server 120 transmits message using message transmission domain, until server 120 determines that (156) second communications change the hair of triggering
It is raw.After determining the generation that the second communication changes triggering, then server 120 starts between (158) speech-control equipment
Real-time calls.Starting real-time calls can be related to using real-time calls domain as described in detail herein and associated real-time association
View.Real time communication session/call may relate to transmit sound between devices (in operating parameter) when receiving audio data
Frequency evidence.
Optionally, after determination (152) first communicates and changes triggering, it is real-time that server 120 can directly initiate (158)
Calling.This can occur under different configuring conditions, such as when communication changes triggering premised on some recipient.Example
Such as, may indicate that will be by real-time with the communication of " mother " for user profile associated with speech-control equipment 110a is originated
Calling occurs.Therefore, if origination message is issued to " mother ", server 120 can be in response to determining first message
Recipient be " mother " and promote real-time calls.
According to various embodiments, server 120 can make one or two speech-control equipment use corresponding equipment
User interface visually indicates to export, wherein which domain visually indicates expression is used to exchange communications/messages.For example, working as
When needing to wake up word, the lamp in speech-control equipment can issue blue light, and green light can be issued when no longer needing to wake up word, and
Yellow light can be issued when promoting real-time calls.
It, can also be in video communication other than as described above changing into the exchange based on speech based on the calling of speech
Background in use above-mentioned introduction.For example, techniques described herein can be used if two people are exchanging video messaging
In video call is changed into the exchange of video messaging.In another example, if determined in message of the exchange based on speech
Some are in the visual field of video camera, then system, which can be configured as, is in the visual field of video camera based on some, will
Video call is changed into communication.Therefore, the introduction below in relation to detection speech, capture audio etc. also can be applied to detection view
Frequently, video etc. is captured.
Each speech-control equipment may have more than a user.Speaking based on speech can be used in system 100
Person ID or User ID identify the speaker of the audio of capture.Each speaker ID or User ID can be voice signatures,
Enable the system to determine the user of equipment to speak.This is beneficial, because being related to equipment when communication changes triggering
When single user, it allows system to change communication, as described herein.Speaker ID or User ID can be used for determining that who is saying
Words, and the user profile of automatic identification speaker is to be used for subsequent processing.For example, disappearing if the first user of equipment says
Breath, hereafter the second user of equipment says message, then system can distinguish the two users based on voice signature, thus prevent be
System determines that single communication changes triggering based on the message that different user is said.
Figure 1B is shown for passing through device user interface output signal during message transmission to point out setting for recipient
The standby system for detecting response speech.As shown in Figure 1B, system receives (160) input sound from the first speech-control equipment 110a
Frequently.Then, system determines that (162) input audio corresponds to the message content for the second speech-control equipment 110b.Then, it is
Message content is sent (164) to the second speech-control equipment 110b by system.Then, system uses the second speech-control equipment 110b
(166) speech is detected, and makes (168) first speech-control equipment 110a output indicators, wherein indicator indicates that second sets
Standby to detect speech, wherein speech can be in response to message content, and therefore notifies the first speech-control equipment 110a's
User can will reply.Indicator can be vision, audible or the sense of hearing.In this example, indicator is for supporting video
Equipment can be it is visual.
After discussing the entire speech processing system of Fig. 2, be discussed below interaction of the upgrading based on speech into one
Walk details.Fig. 2 is the concept map for traditionally how handling the language said, and allows system acquisition and executes the life that user says
It enables, such as the verbal order for waking up word may be followed.Shown in various parts can be located at identical or different physical equipment on.
Communication between various parts shown in Fig. 2 can occur directly or by network 199.Audio capturing component, such as equipment
110 microphone 103, capture correspond to the audio 11 for the language said.Then, equipment 110 uses wake-up word detection module
220, audio is handled, or corresponding to the audio data of audio, to determine whether to detect that keyword (such as wakes up in audio
Word).After detecting wake-up word, the audio data 111 for corresponding to language is sent the clothes including ASR module 250 by equipment
Business device 120.Can audio data 111 be exported from the acoustics front end (AFE) 256 being located in equipment 110 before being transmitted.Or sound
Frequency can be different form according to 111, so as to by (such as the AFE being located together with ASR module 250 of long-range AFE 256
256) it is handled.
The other component (such as microphone (not shown)) for waking up word detection module 220 and equipment 110 is worked together to examine
Keyword in acoustic frequency 11.For example, audio 11 can be converted to audio data by equipment 110, and mould is detected using word is waken up
Block 220 handles audio data to determine whether to detect speech, and if detecting speech, it is determined that the audio number including speech
According to whether with correspond to special key words audio signature and/or Model Matching.
Various technologies can be used to determine whether audio data includes speech in equipment 110.Some embodiments can answer
With voice activity detection (VAD) technology.This technology can be based on the various quantitative aspects of audio input, such as audio input
Spectrum slope between one frame or multiframe;The energy level of one or more bands of a spectrum sound intermediate frequency inputs;One or more bands of a spectrum sound intermediate frequencies
The signal-to-noise ratio of input;Or other quantitative aspects, to determine in audio input with the presence or absence of speech.In other embodiments, if
Standby 110 may be implemented limited classifier, be configured as distinguishing speech and ambient noise.Classifier can pass through such as line
The technology of property classifier, support vector machines and decision tree etc is realized.In other embodiments, hidden Ma Erke can be applied
Audio input and speech are stored one or more sound in equipment by husband's model (HMM) or gauss hybrid models (GMM) technology
It learns model to be compared, acoustic model may include corresponding to speech, noise (such as ambient noise or ambient noise) or mute
Model.It can also be determined using other technologies in audio input with the presence or absence of speech.
Once detecting speech (or separating with speech detection) in the audio received by equipment 110, equipment 110 can
It wakes up word detection module 220 to use and executes and wake up word detection, to determine that user intends when to equipment 110 to say order.
The process is referred to as keyword search, wherein waking up the particular example that word is keyword.Specifically, usually language is not being executed
Keyword search is executed in the case where speech analysis, text analyzing or semantic analysis.On the contrary, analysis input audio (or audio data)
To determine whether the specific feature of audio matches preconfigured acoustic waveform, audio signature or other data, to determine input
Whether audio " matches " stored audio data corresponding with keyword.
Therefore, waking up word detection module 220 can be compared audio data with stored model or data, with inspection
It surveys and wakes up word.For wake up a kind of method of word detection using the general big continuous speech discrimination of vocabulary (LVCSR) system
Audio signal is decoded, carries out in obtained grid or confusion network waking up word search.LVCSR decoding may need opposite
Higher computing resource.For carrying out waking up the wake-up word word and non-wake-up that another method that word positions is respectively each key
Word Vocal signal constructs hidden Markov model (HMM).Non- wake-up word speech includes other spoken words, ambient noise etc..It can
There are one or more HMM of building, for being modeled to non-wake-up word speech characteristics, it is referred to as and fills model.Dimension
Special ratio decoder is used to search for the optimal path in decoding figure, and is further processed decoded output and is sentenced existing for keyword with making
It is disconnected.By combining mixing DNN-HMM to decode frame, this method be can extend to including discriminant information.In another embodiment
In, waking up word positioning system can directly establish in deep neural network (DNN)/recurrent neural network (RNN) structure, without
It is related to HMM.Such system can be estimated by the stacking frame in the backdrop window of DNN or using RNN using background information
Wake up the posteriority of word.Using subsequent posteriority adjusting thresholds or smoothly, to judge.Also it can be used for waking up word detection
Other technologies, such as technology as known in the art.
Once detecting wake-up word, user equipment 110 " can wake up " and start that the sound of input audio 11 will be corresponded to
Frequency is sent to server 120 according to 111 to carry out speech processing.Audio data corresponding to the audio can be sent to service
Device 120 is to be routed to recipient's equipment, or can be sent to server to carry out speech processing, included to explain
Speech (purpose for realizing Speech Communication and/or for executing the order in speech).Audio data 111 may include corresponding to
In the data for waking up word, or the portion for corresponding to wake-up word of audio data can be removed by user equipment 110 before transmitting
Point.In addition, as described herein, user equipment 110 " can wake up " when detecting speech/spoken audio higher than threshold value.By
When server 120 receives, audio data 111 can be converted to text by ASR module 250.ASR transcribes audio data written
Notebook data, text data expression include the word of the speech in audio data.Then, text data can be used by other component
In various purposes, such as execute system command, input data etc..Utterance in audio data, which is input into, to be configured as holding
The processor of row ASR, then, processor is based on language and the language pre-established being stored in ASR model storage equipment 252c
It says the similitude between model 254, language is explained.For example, ASR process can be by input audio data and sound (example
Such as, sub- word cell or phoneme) and the model of sound sequence be compared, with the sound for identifying with saying in the language of audio data
The word of sound sequences match.
Probability or confidence level can be specified to that can explain each of different modes (that is, different hypothesis) of utterance
Score indicates a possibility that one group of specific word matched those of says word in language.Confidence can be with base
In many factors, including the sound in such as language with language voice model (for example, being stored in ASR model storage equipment 252
Acoustic model 253) similitude, and the possibility of specific position in sentence will be included in the certain words of Sound Match
Property (for example, using language or syntactic model).Therefore, to the potential text interpretation of each of utterance (assuming that) and confidence
It is associated to spend score.Based on the factor and specified confidence considered, ASR process 250 is exported in audio data
The most probable text recognized.ASR process can also export the multiple of grid or N-best form it is assumed that wherein each
Assuming that both corresponding to confidence or other scores (probability score etc.).
The one or more equipment for executing ASR processing may include acoustics front end (AFE) 256 and speech discrimination engine 258.
Audio data from microphone is converted to the data for being used for being handled by speech discrimination engine by acoustics front end (AFE) 256.Speech
Recognition engine 258 is by speech discrimination data and acoustic model 253, language model 254 and is used to recognize and passes in audio data
Other data models and information of the speech reached are compared.AFE can reduce the noise in audio data, and will be digitized
Audio data be divided into indicate time interval frame, AFE determine indicate audio data quality multiple values (referred to as feature) and
Indicate one class value of feature/quality (referred to as feature vector) of audio data in frame.As it is known in the art, can determine perhaps
Mostly different features, and each feature indicates some audio qualities that can be used for ASR processing.AFE can be used a variety of
Method handles audio data, such as mel-frequency cepstrum coefficient (MFCC), perception linear prediction (PLP) technology, neural network
Feature vector technology, linear discriminant analysis, half associated covariance matrix or other methods well known by persons skilled in the art.
Speech discrimination engine 258 can come with reference to the information being stored in speech/model storage equipment (252) to handle
From the output of AFE 256.It is alternatively possible to by execute ASR processing equipment from another source receiving front-end except internal AFE
Data that treated (such as feature vector).For example, audio data can be processed into feature vector (for example, using by equipment 110
AFE 256 in equipment), and server is transmitted this information to by network 199, for carrying out ASR processing.Feature vector can
To reach server encodedly, in this case, carry out handling it in the processor by execution speech discrimination engine 258
Before, described eigenvector can be decoded.
Speech discrimination engine 258 attempts received feature vector and the acoustic model of storage 253 and language model
Known language phoneme and word are matched in 254.Speech discrimination engine 258 is based on acoustic information and language message, to count
Calculate the identification score of feature vector.For calculating acoustics score, which is indicated acoustic information by one group of feature vector
A possibility that set sound is matched with language phoneme.Language message is used for by considering what sound and/or word for each other
Background in adjust acoustics score, thus a possibility that improving the speech result that ASR process grammatically makes sense output.
The particular model used can be universal model, or can be the model corresponding to special domain (such as music, banking etc.).
Speech discrimination engine 258 can be used multiple technologies and come matching characteristic vector and phoneme, such as use hidden Ma Erke
Husband's model (HMM) determines a possibility that feature vector may match phoneme.The sound received can be expressed as the state of HMM
Between path, and multiple paths may indicate multiple possible text matches of same sound.
After carrying out ASR processing, other processing components can be sent by ASR result by speech discrimination engine 258, this
A little processing components can be the equipment local for executing ASR, and/or be distributed on network 199.For example, with the single text of speech
This expression, N-best, grid etc. including multiple hypothesis and corresponding scores form existing for ASR as a result, can be sent
It such as converts text to order for carrying out natural language understanding (NLU) processing to the server of such as server 120 etc
It enables (such as to run the server of the specific application of such as search engine etc by equipment 110, server 120 or another equipment
Deng) execute.
The equipment (for example, server 120) for executing NLU processing 260 may include various parts, including potential dedicated place
Manage device, memory, storage equipment etc..The equipment for being configurable for carrying out NLU processing may include name entity identification (NER)
Module 252 and intent classifier (IC) module 264, result ranking and distribution module 266 and NLU store equipment 273.NLU process
The dictionary of place names information (284a-284n) being stored in entity storage apparatus 282 can also be used.Dictionary of place names information can be used
In entity resolution, for example, ASR result is matched with different entities (song title, name of contact person etc.).Place name diction
Allusion quotation can be linked to user (for example, the specific dictionary of place names can be associated with the music collection of specific user), can be linked to
Certain domains (such as shopping), or can in a manner of various other tissue.
NLU process obtain text input (such as, being handled by ASR 250 based on language 11), and attempt to text into
Row semantic interpretation.That is, NLU process determines the meaning of text behind based on individual word, and then realize the meaning.At NLU
Reason 260 explains text string, and the permission equipment in intention or desired movement and text to derive user is (for example, equipment
110) the relevant a plurality of information of the movement is completed.For example, if handling utterance using ASR 250 and exporting text
" making a phone call to mother ", then NLU process can determine that user intends to activate the phone in his/her equipment, and start to
Contact person with entity " mother " makes a phone call.
NLU can handle several text inputs related with same language.For example, if ASR250 exports N number of text chunk
(a part as N-best), then NLU can handle all N number of outputs to obtain NLU result.
NLU process can be configured as a part that parsing, label and annotation text are handled as NLU.For example, for text
This " making a phone call to mother " " can will make a phone call " labeled as order (execution is made a phone call), and can be by " mother " labeled as life
The special entity and target of order (and can will correspond to the telephone number packet of the entity of storage " mother " in the contact list
It includes in annotation result).
For the NLU processing to speech input is appropriately carried out, NLU process 260 can be configured to determine " domain " of language, with
Just it determines and constriction may be related by which service that endpoint device (for example, server 120 or equipment 110) provides.For example, end
Point device can provide and with telephone service, contacts list service, calendar/schedule service, music player service etc.
The related service of interaction.Word in single text query may relate to more than one service, and some services may
Functionally it is linked (for example, telephone service and calendar service may be by the data from contacts list).
Name entities identification module 262 receives inquiry in the form of ASR result, and attempts to identify and can be used for explaining meaning
Dependent parser and lexical information.For this purpose, name entities identification module 262 can by identification may be related to the inquiry received
Potential domain start.It includes the device databases (274a- for identifying domain associated with particular device that NLU, which stores equipment 273,
274n).For example, equipment 110 can be communicated with for music, phone, calendar, contacts list and device-specific (without including
Video) domain it is associated.In addition, entity library can also include the data base entries about the special services on particular device, institute
Data base entries are stated by device id, User ID or home id or other a certain indicators to index.
Domain can indicate the discrete one group activity with common theme, " shopping ", " music ", " calendar " etc..Cause
This, each domain can be with specific language model and/or grammar database (276a-276n), one group of specifically intention/movement
(278a-278n) and specific personalization lexicon (286) are associated.Each dictionary of place names (284a-284n) may each comprise
Domain Index lexical information associated with specific user and/or equipment.For example, dictionary of place names A (284a) includes Domain Index vocabulary
Information 286aa to 286an.For example, the music domain lexical information of user may include album title, artist name and song title
Claim, and the contacts list lexical information of user may include the name of contact person.Due to the music collection and connection of each user
It is that list is probably different, therefore, the information of this personalization improves the resolution ratio of entity.
Inquiry is handled using rule, model and the information for being suitable for each identification domain.For example, if inquiry potentially relates to
And both communication and music, then the syntactic model for being used for communicating and lexical information will be used to carry out NLU processing to inquiry, and will
Using for music syntactic model and lexical information handle the inquiry.To the response generated based on inquiry by every group model into
Row scoring (further described below), the highest result of an overall ranking of from all application domains are typically selected to be correct knot
Fruit.
The parsing of intent classifier (IC) module 264 inquiry is with one or more being intended in each identification domain of determination, wherein being intended to
Corresponding to the movement to be executed made a response to inquiry.Database (278a- of each domain with the word for being linked to intention
278n) it is associated.For example, music intent data library can will such as " peace and quiet ", " volume closing " and " mute " etc word
It is intended to hyperlink phrase to " mute ".IC module 264 passes through the word in inquiring and word in intent data library 278 and short
Language is compared to identify the potential intention in each identification domain.
In order to generate the response specifically explained, NER 262 is applied and the associated syntactic model in corresponding domain and vocabulary
Information.Each syntactic model 276 is included in the title (that is, noun) for the entity about special domain that would generally be found in speech
(that is, generic term), and be the personalization for user and/or equipment from the lexical information of the dictionary of place names 284 286.Example
Such as, syntactic model associated with shopping domain may include when people discuss the database of word usually used when doing shopping.
Domain specific syntax frame with " time slot " or " field " to be filled is linked to by the intention that IC module 264 identifies
Frame (is included in 276).For example, if " play music " is the intention of identification, one or more grammer (276) frames can be with
Corresponding to such as " playing { artist name } ", " playing { album name } ", " playing { song title } ", " play { artist's surname
Name } { song title } " etc. sentence structure.However, more flexible in order to make to recognize, these frames will not be usually constructed
At sentence, but based on time slot is associated with grammer label.
For example, NER module 260 can parse inquiry before entity is named in identification, it, will to be based on syntax rule and model
Word identification is subject, object, verb, preposition etc..The verb identified can identify intention by IC module 264, then,
NER module 262 carrys out identification framework using the intention.The frame being intended to for " broadcasting ", which can specify, to be suitable for playing being identified
Time slot/field list of " object " and any object modifier (for example, prepositional phrase), such as { artist name }, { album
Title }, { song title } etc..Then, the specific corresponding field in personalization lexicon of 260 region of search of NER module, attempting will mark
It is denoted as the word and expression progress identified in the word and expression and database in the inquiry of grammatical object or object modifier
Match.
This process includes semantic marker, it is to mark word according to word or type/semantic meaning of combinations of words
Or combinations of words.Heuristic syntax rule can be used to execute in parsing, or such as hidden Markov model, most can be used
The technology of big entropy model, log-linear model, condition random field (CRF) etc. constructs NER model.
For example, inquiry " playing the small helper (mother ' s little helper) of the mother of Rolling Stone " may be solved
It analyses and is labeled as { verb }: " broadcasting ", { object }: " the small helper of mother ", { object preposition }: " by " and { object modification
Language }: " Rolling Stone ".In the process this moment, it is based on word database associated with music domain, " broadcasting " is identified as
Verb, IC module 264 will determine that the verb corresponds to " playing music " and is intended to.About " the small helper of mother " and " Rolling Stone "
Meaning, there are no being determined, but according to syntax rule and model, determine that these phrases are related with the grammatical object of inquiry.
Which then, come using the frame for being linked to intention it is determined that searching for Database field to determine these phrases
Similarity in meaning, such as the dictionary of place names of search user with frame time slot.Therefore, " playing music to be intended to " frame may
Show to attempt to parse based on { artist name }, { album name } and { song title } object of identification, and identical intention
Another frame can be shown that trial parses object modifier based on { artist name }, and be identified based on being linked to
{ album name } and { song title } of { artist name } parses object.If the search to the dictionary of place names does not use ground
Name dictionary information parses time slot/field, then the database that NER module 262 may search for general term associated with domain (depositing
It stores up in equipment 273).Thus, for example, failing if inquiry is " song for playing Rolling Stone " through " Rolling Stone "
After the album name or song title that determine entitled " song ", NER 262 can in the vocabulary of domain searching words " song ".
In alternative solution, it can check that general term, or both can be attempted before dictionary of place names information, two may be generated
A different result.
The comparison procedure used by NER module 262 can classify (that is, scoring) data base entries and labeled cargo tracer
The matching degree of word or phrase, the degree of correspondence of the syntactic structure of inquiry and applied syntactic frame, and it is based on database
The relationship whether pointing out entry and being identified as between the information of other time slots of fill frame.
NER module 262 also can be used background operation rule and carry out filler timeslot.For example, if user previously had requested that
Suspend particular songs and hereafter request voice operated device " please restore to play my music ", then NER module 262 can be applied
Song (the song played when user requests pause music for currently wishing to play with user is filled based on the rule of deduction
It is bent) the associated time slot of title.
The result of NLU processing can be labeled to assign and inquire by meaning.Thus, for example, " playing the mother of Rolling Stone
Small helper " following result may be generated: { domain } music, be intended to } and play music, { artist name } " Rolling Stone ",
{ medium type } song and { song title } " the small helper of mother ".As another example, " song of Rolling Stone is played
It is bent " it may generate: { domain } music, { intention } play music, { artist name } " Rolling Stone ", { medium type } song.
It is then possible to send command processor for the output (it may include text, the order etc. of label) of NLU processing
290, a part which can be used as system 100 is located on identical or individual server 120.It can be based on
NLU exports to determine destination command processor 290.For example, if NLU output includes playing the order of music, destination
Command processor 290 can be configured as execute music order music application, such as in equipment 110 or
Application in music player devices.If NLU output include searching request, destination command processor 290 may include by
It is configured to execute the search engine processing device of search command, the search engine such as on search server handles device.
The NLU operation of system as described herein can use the form of multiple domain framework, such as that is more shown in Fig. 3
Domain framework.In multiple domain framework, each domain (its may include define one group of intentions of more major concept such as music, books and
Entity time slot) individually constructed, and the operation that NLU is operated is being executed to text (such as the text exported from ASR component 250)
When operation during for NLU component 260 use.Each domain can have the component of special configuration to execute each of NLU operation
A step.For example, message field 302 (domain A) can have NER component 262-A, identify which time slot (that is, the portion of input text
Point) it can correspond to special entity relevant to the domain.NER component 262-A can be used machine learning model, such as domain is specific
Condition random field (CRF) corresponds to the entity type of textual portions to identify corresponding to the part of entity and identification.For example, right
In text " telling john smith, I says hello to him ", by that can be recognized for the NER 262-A of the training of message field 302
The part [john smith] of text corresponds to entity.Message field 302 can also have portion intent classifier (IC) of their own
Part 264-A determines the intention of text, it is assumed that text is in the domain by defined.It is specific that such as domain can be used in IC component
The model of maximum entropy classifiers etc identifies the intention of text.Message field 302 can also have the time slot filling part of their own
Part 310-A, can using rule or other instruction with by from previous stage label or token be standardized as intention/time slot
It indicates.Accurate conversion is likely to be dependent on domain and (for example, for domain of travelling, refers to that the text reference on " Boston airport " can turn
It is changed to the standard BOS three-letter codes for indicating airport).Message field 302 can also have the entity resolution component 312- of their own
A, can be used to specifically identify and identify in being passed to text with reference to authoritative source (such as domain specific knowledge library), the authority source
The accurate physical quoted in entity reference.Specific intended/time slot combination can also be tied to particular source, and the spy then can be used
Source is determined to parse text (such as order by providing information or executing in response to user query).From entity resolution component
The output of 312-A may include order, information or other NLU result datas, it is indicated that the specific NLU in domain processing how to handle text with
And system should how response text, according to the special domain.
As shown in figure 3, multiple domains can use different domain particular elements, operate substantially in parallel.In addition, each domain
Certain agreements can be realized when exchanging message or other communications.That is, domain B 304 can have for real-time calls
There are NER component 262-B, IC module 264-B, time slot filling component 310-B and the entity resolution component 312-B of their own.This is
System may include additional domain not described herein.The same text being input in the NLU assembly line of domain A 302 can also be input to
In the NLU assembly line of domain B 304, wherein the component of domain B 304 will operate text, just look like that text is related to domain B, right
In the different NLU assembly lines of not same area, and so on.Each specific NLU assembly line in domain is specific by the domain for creating their own
NLU as a result, such as NLU result A (for domain A), NLU result B (for domain B), NLU result C (for domain C), and so on.
This multiple domain framework leads to narrowly-defined intention and time slot, these are intended to and time slot is for each special domain
Specifically.This is partly due to different model and component (such as the specific NER component in domain, IC module etc. and relevant mode
Type) it is trained to be operated only for specified domain.In addition, the separation in domain will lead to similar movement between domain by table respectively
Show, even if there is overlapping in movement.For example, " next song ", " next book " and " next " can be same action
Indicator, but since domain particular procedure limits, will be defined in different ways in not same area.
Server 120 can also include the data about user account, and the user profile storage as shown in Fig. 4 is set
Standby 402 show.User profile, which stores equipment, to be located near server 120, or can otherwise, such as lead to
Network 199 is crossed, is communicated with various parts.It may include being handed over with system 100 that user profile, which stores equipment 402,
The relevant various information such as mutual each user, account.In order to illustrate as shown in figure 4, user profile stores equipment 402
It may include the data about equipment associated with specific each user account 404.In this example, user profile stores
Equipment 402 is storage equipment based on cloud.Such data may include the device identifier (ID) and internet of distinct device
The position of the title and equipment of agreement (IP) address information and user.User profile storage equipment can also comprise spy
Communication due to each equipment changes triggering, instruction preference of each equipment etc..In this example, each equipment instruction to be exported
Type can be not stored in user profile.On the contrary, the type of instruction can depend on background.For example, if system
Video messaging is being exchanged, then instruction can be visual.For another example instruction can if system is exchanging audio message
To be audible.
Each user profile can store one or more communications and change path.In addition, each communication changes road
Diameter can include that single communication change triggering or multiple communications change triggering, these communication change triggerings indicate when send out
Raw communication changes.Match it should be appreciated that changing path with N number of communication that M communication changes triggering and can store in single user
It sets in file.It can be unique for the different individuals that user is in communication with that each communication, which changes path,.For example, working as
User can be used a communication and change path when communicating with its mother, can make when user communicates with its spouse
It is communicated with another and changes path etc..Each communication changes path for a kind of communication type (for example, audio message transmitting, view
Frequency message transmission etc.) it is also possible to uniquely.Each communication changes path and device type involved in communication is also possible to
Uniquely.For example, the first communication that user can have the device configuration for user's car changes path, is in user bedroom
Second communication of device configuration changes path etc..
The some or all of communications of user profile, which change path, can be dynamically.That is, communication changes road
Diameter can depend on external signal.Illustrative external signal includes the degree of approach with equipment.For example, ought with the mother of user
Communicated, and when the mother of user is not near her equipment, a communication can be used and change path, and ought with
The mother at family communicates, and when the mother of user is near her equipment, the second communication can be used and change path.For example,
Speech-control equipment 110 can capture one or more images, and send server 120 for corresponding image data.
Server 120 can determine that image data includes the expression of people.Server 120 is also based on the expression of people in image data
Position determine the degree of approach of people Yu equipment 110.The dynamic select for changing path to communication may also will receive machine learning
Influence.It can be configured as and lead to after user's specific time at night with its mother for example, communication changes path
Real-time calls are changed into communication when letter.Then, system can determine that user changes the time of communication in threshold amount of time
A certain percentage.Based on the determination, system can suggest that user's modification/more new traffic changes path will rapidly not disappear so
Real-time calls are changed into breath transmitting.
Each communication upgrading path can include that one or more communications change.A type of communication change is related to disappearing
Except to wake up word part needs, therefore spoken audio only need include order (for example, make system transmission message language) and
Message content.The communication of second of type changes the needs for being related to eliminating to word part and order is waken up, therefore spoken audio is only
It needs to include message content.The communication of third seed type, which changes, is related to the wake-up word of replacement default, and make the recipient of message
Name (for example, mother, John etc.) is as wake-up word.The communication change of 4th seed type is to change into message exchange to exhale in real time
It cries.
Fig. 5 A to Fig. 5 D, which is shown, changes the interaction based on speech by speech-control equipment.First speech-control equipment
110a capture includes the spoken audio (as indicated at 502) for waking up word part and payload portions.For example, speech-control equipment
110a may be at sleep pattern, and until detecting oral wake-up word, triggering speech-control equipment 110a wakes up and capture sound
Frequently (it may include the oral wake-up word and speech thereafter) is to be handled and be sent to server 120.Speech-control
The audio data for corresponding to captured spoken audio is sent server 120 (as shown by 504) by equipment 110a.
Server 120 executes ASR to the audio data received, to determine text (as illustrated at 506).Server 120 can
To determine wake-up word part and the payload portions of text, and NLU (as indicated at 508) is executed to payload portions.It executes
NLU processing may include that server 120 marks recipient information's (as illustrated at 510) of payload portions, label payload
Partial message content information (as indicated at 512) and it is intended to label with " sending message " to mark entire payload portions
(as indicated at 514).For example, the payload portions of received audio data, which can correspond to text, " tells John's history
It is close this, I says hello to him ".According to the example, " john smith " can be labeled as recipient information by server 120, can
It is labeled as message content information " will say hello ", and label can be intended to " sending message " to mark language.It can be used
Message field 302 with message is intended to label and marks payload portions and/or system can be made for example to be passed using message to execute
Command processor 290 is passed to execute further message transmitting order.
Using the recipient information of label, server 120 determines equipment associated with recipient (for example, speech-control
Equipment 110b) (as shown in 516 in Fig. 5 B).In order to determine recipient's equipment, server 120 can be used to be set with speech-control
Standby 110a and/or the associated user profile of user for saying initial audio.For example, the accessible user of server 120
The table of configuration file is with the text for corresponding to labeled recipient information's (that is, " john smith ") in matching list.One
Denier identifies matched text, and server 120 can identify recipient's equipment associated with the matched text in table.
Server 120 also use with " send message " be intended to the associated server 120 of label domain and associated association
View, to generate output audio data (as indicated at 518).Exporting audio data may include receiving from speech-control equipment 110a
Spoken audio.Optionally, output audio data may include by computer based on receiving from speech-control equipment 110a
The text of the text generation of message content is to speech (TTS) audio data.Server 120 is sent to reception for audio data is exported
Audio data is output to recipient (as shown in 522) by person's equipment (as depicted 520).In this example, the speech control of recipient
Control equipment 110b can not export audio data, until it detects that until the order done so from recipient.It is such
Order can be recipient correspond to " what my message is? ", " I has any message? " Deng utterance.
Server 120 executes message biography between the first speech-control equipment 110a and the second speech-control equipment 110b
It passs, (for example, passing through message field) (as shown at 524) as described in detail by the step 502-522 above with reference to Fig. 5 A and Fig. 5 B,
Until server 120 determines that communication changes the generation (as indicated at 526) of triggering.Communication, which changes triggering, can be such that server 120 makes
Subsequent communications/process is executed with another domain different from for executing earlier communication/process domain and corresponding agreement.It is optional
Ground, the adjustable processing to Future message of system (such as wake up the finger of word or recipient not need certain spoken datas
Show).Identified communication, which changes triggering, to be in any number of forms.Communication, which changes triggering, may be based on whether to meet or more than the
The number of thresholds of message exchange between one speech-control equipment 110a and the second speech-control equipment 110b.For example, message is handed over
The number of thresholds changed can be configured by the user of any of speech-control equipment 110a/110b, and can be in phase
It is indicated in the user profile answered.It should be appreciated that associated with the user profile of the first speech-control equipment 110a
The number of thresholds of message exchange can be different from message associated with the user profile of the second speech-control equipment 110b
The number of thresholds of exchange.In this case, server 120, which is used to determine when occur the threshold value that communication changes, can be
The threshold value (that is, threshold value with small number of required message exchange) for meeting first or being more than.Communication changes triggering can be with
Or it is optionally based on the number of thresholds of the message exchange occurred in threshold amount of time.For example, the number of thresholds of message exchange and/
Or threshold amount of time can be configured by the user of any of speech-control equipment 110a/110b, and can be in phase
It is indicated in the user profile answered.It should be appreciated that associated with the user profile of the first speech-control equipment 110a
The number of thresholds and threshold amount of time of message exchange can be different from the user profile with the second speech-control equipment 110b
The number of thresholds of associated message exchange.In this case, server 120 is used to determine when occur communication change
Threshold value can be the threshold value for meeting first or being more than.Communication change triggering can with or be optionally based on two speech-controls and set
The user of standby 110a/110b is simultaneously in the within the threshold range of its relevant device.It should be appreciated that touching can be changed based on single communication
The satisfaction of hair changes communication occurs.It is also understood that the satisfaction of triggering can be changed based on more than one communication to lead to
Letter changes.
Once it is determined that one or more communications change triggering, implementation is depended on, server 120 reconfigures comes from
The language of first/second speech-control equipment wakes up word part or reception not require to exist in the audio data received
Person's information (as shown in 528).For example, message field 302 and associated agreement can be used to complete in this.In addition, in step 528
What is occurred reconfigures the communication that can indicate that speech-control equipment 110b output is received, without being detected first corresponding to this
The speech for the order that sample is done.In addition, server 120 can be sent to one or two of speech-control equipment 110a/110b
Signal, it is indicated that the communication between the first speech-control equipment 110a and the second speech-control equipment 110b is just changed (such as 530 institutes
Show).Speech-control equipment, which can export, indicates that equipment " is being listened " to attempt the instruction of capture message content.In addition, speech control
Control equipment can also export the instruction that the equipment for indicating recipient is capturing spoken message content.Then, speech-control equipment
110a and/or speech-control equipment 110b, which can be exported, indicates the signal for no longer needing to wake up word audio (such as 532 institutes in Fig. 5 C
Show).Static instruction or movement instruction can be by the signal that one or two of speech-control equipment 110a/110b is exported,
As described below.
Hereafter, speech-control equipment 110a captures the spoken audio (such as 534 from the user for comprising only effect load information
It is shown), and server 120 (as depicted at 536) is sent by the audio data for corresponding to payload information.Server 120 is right
The audio data received executes ASR, to determine text (as shown in 538), and executes NLU processing to payload information text
(as shown in 540).Executing NLU processing may include that server 120 marks the recipient information of payload information text, label
The message content information of payload information text and it is intended to label with instant message to mark entire payload information text
This.For example, the payload information of the audio data received can state " you when finished item? ".According to originally showing
" you when finished item " can be labeled as message content information by example, server 120, and can be used that " transmission is i.e.
When message " be intended to label to mark language.It is intended to label using message to mark payload information text that can make server
120 execute downstream process using message field 302.By not requiring in input audio there are recipient information, server 120 can be with
Assuming that recipient's equipment is identical as recipient's equipment used in earlier communication, determines and receive again without server 120
Person's equipment.
Server 120, which is also used, is intended to mark the domain of associated server 120 and associated with " send instant message "
Agreement, come generate output audio data (as shown in 542).For example, message field 302 can be related to instant message intention label
Connection.Exporting audio data may include the spoken audio received from speech-control equipment 110a.Optionally, audio data is exported
It may include the text that is generated by computer based on the spoken audio that is received from speech-control equipment 110a to speech (TTS) sound
Frequency evidence.Server 120 is sent to recipient's equipment (that is, speech-control equipment 110b) (such as 544 institutes for audio data is exported
Show), by the audio output of audio data to recipient (as shown in 546 in Fig. 5 D).As described above, occurring in step 528
Reconfigure the communication that can indicate that speech-control equipment 110b output is received, it is from the user without receiving first
The order done so.It is understood that, in this way, audio data can be output to reception in step 546 by speech-control equipment 110b
Person, without receiving the order done so first.That is, speech-control equipment 110b can automatic playing audio-fequency data.
Server 120 executes instant message between the first speech-control equipment 110a and the second speech-control equipment 110b
Transmitting, (for example, by instant message domain and not as described in detail by the step 534-546 above with reference to Fig. 5 C to Fig. 5 D
Need to wake up word audio data) (as shown in 548), until server 120 determines that another communication changes the generation (such as 550 of triggering
It is shown).Second communication determined, which changes triggering, to be in any number of forms.It is similar with the first communication change triggering, the second communication
Changing triggering may be based on whether to meet or more than between the first speech-control equipment 110a and the second speech-control equipment 110b
The number of thresholds of message exchange, the number of thresholds based on the message exchange occurred in threshold amount of time and/or based on two speeches
Language controls the user of equipment 110a/110b simultaneously in the within the threshold range of its relevant device.For determining that the first communication changes touching
The threshold value that hair and the second communication change triggering can be identical (for example, each requiring 5 message exchanges) or different (examples
Such as, the first communication change is occurred after being carried out 5 message exchanges using message field 302 and the second communication change occurs to make
After carrying out 7 message exchanges with message field 302).The single counter not reset after the first communication changes can be used
To determine the message exchange for changing triggering for each communication.According to the example of front, the first communication changes can be in counter
Reach occur after 5 message exchanges (that is, using message field 302 carry out 5 message exchanges) and the second communication change can be
Counter occurs after reaching 12 message exchanges (that is, carrying out 7 message exchanges using message field 302).It is alternatively possible to make
Disappeared to determine for what each communication changed with different counters or the single counter reset after the first communication changes
Breath exchange.According to the example of front, the first communication changes and can reach 5 message exchanges (that is, using message field in counter
302 carry out 5 message exchanges) after occur, then counter can reset to zero and second communication change can be in counter
Reach 7 message exchanges (that is, carrying out 7 message exchanges using message field 302) to occur later.Change for the first communication and the
Two communications change, user need where can be identical or different with the threshold distance of speech-control equipment 110a/110b.This
Outside, similar with the first communication change, the second communication, which changes, can change triggering based on single communication or more than one communication changes
The satisfaction of triggering and occur.
Once it is determined that the second communication changes triggering, implementation is depended on, server 120 reconfigures so as to be used in speech
The domain for establishing real-time calls between language control equipment 110a and speech-control equipment 110b and associated agreement are (such as 552 institutes
Show).For example, such domain can be real-time calls domain 304.Real-time calls used herein refer to be existed by server 120
The calling promoted between speech-control equipment 110a/110b, wherein direct communication letter can be opened between speech-control equipment
Road.For example, system can send the second speech from the first speech-control equipment 110a for audio data during real-time calls
Equipment 110b is controlled, without executing speech processing (such as ASR or NLU) to audio data, to make the first speech-control equipment
The user of 110a can be with user's " directly speaking " of the second speech-control equipment 110b.Optionally, system can execute at speech
(such as ASR or NLU) is managed there is no the order for system, audio can be transmitted back and forth between equipment 110a/110b
Data.For example, real-time calls can be terminated as discussed below with reference to Fig. 7.
Server 120 can send signal to one or two of speech-control equipment 110a/110b, it is indicated that have been established
Real-time calls (as shown in 554).Then, speech-control equipment 110a and/or speech-control equipment 110b output indicates that user can
With the signal spoken, as he/her is carrying out point to point call (as shown in 556).It is used herein real-time or point-to-point
Calling/communication refers to the calling promoted between speech-control equipment 110a/110b by server 120.That is, in real time
Calling or point to point call are the such communication of one kind, and sound intermediate frequency is simply captured by equipment, is sent to as audio data
Server, and server only sends recipient's equipment for the audio data received, and recipient's equipment exports audio, is not necessarily to
It is firstly received the order done so.It can be by the signal that one or two of speech-control equipment 110a/110b is exported
Static state instruction or movement instruction, as described below.Then, system executes real time communication session (as shown in 558).It can be by system
Real time communication session is executed, until determining the triggering (as detailed herein) that degrades.
When executing communication between speech-control equipment, control size of data, transmission speed etc. is can be used in system
Various types of agreements.It is, for example, possible to use the first agreements to wake up the logical of word part and recipient's content to control to need to exist
The exchange of letter.Second protocol can be used to control and not need to wake up word part but there is still a need for the exchanges of the communication of recipient's content.
Third agreement can be used to control the exchange not comprising the NLU communication being intended to.That is, ought both not need to wake up word part
When also not needing recipient's content, third agreement can be used, because system is based on simultaneous message exchange in the past come false
Determine recipient.When executing the simultaneous call between speech-control equipment, the real-time protocol (RTP) of such as VoIP etc can be used.
Fig. 6 A and Fig. 6 B show message based intended recipient and change the friendship based on speech by speech-control equipment
Mutually.First speech-control equipment 110a capture includes the spoken audio (as indicated at 502) for waking up word part and payload portions.
For example, speech-control equipment 110a may be at sleep pattern, until detecting oral wake-up word, speech-control equipment is triggered
110a wakes up and captures the audio including oral the wake-up word and speech thereafter.Speech-control equipment 110a will correspond to institute
The audio data of the spoken audio of capture is sent to server 120 (as shown by 504).
Server 120 executes ASR to the audio data received, to determine text (as illustrated at 506).Server 120 is true
Determine wake-up word part and the payload portions of text, and NLU (as indicated at 508) is executed to payload portions.It executes at NLU
Reason may include that server 120 marks recipient information's (as illustrated at 510) of payload portions, label payload portions
Message content information (as indicated at 512) and it is intended to label with " send message " to mark entire payload portions (such as 514 institutes
Show).For example, the payload portions of received audio data, which can be stated, " tells mother, I said that I will arrive that quickly
In." according to the example, " mother " can be labeled as recipient information by server 120, can will be marked " I will quickly thereunto "
It is denoted as message content information, and it is associated language can be intended to label with " sending message ".As described above, communication changes road
Diameter and communication change triggering can be configured by user profile.According to the embodiment, server 120 can be based on disappearing
The intended recipient of breath changes to determine to communicate.For example, recipient information of the server 120 using label, accessible speech
The user profile of equipment 110a is controlled, and determines to point out to be performed by real-time calls with the communication of " mother " and communicate
Change path (as shown in 602 in Fig. 6 B).Hereafter, server 120 reconfigure so that be used in speech-control equipment 110a and
Domain and the associated agreement of real-time calls are established between speech-control equipment 110b (as shown in 552).For example, such domain can
To be real-time calls domain 304.Server 120 can send signal to one or two of speech-control equipment 110a/110b,
It points out that real-time calls have been established (as shown in 554).Then, speech-control equipment 110a and/or speech-control equipment 110b output
The signal that user can speak is indicated, as he/her is carrying out point to point call (as shown in 556).By speech-control equipment
The signal of one or two of 110a/110b output can be static instruction or movement instruction, as described below.Then, it is
System executes real time communication session (as shown in 558).It can be executed by the system real time communication session, until determining that another communication changes
Become triggering (as detailed herein).
Fig. 7, which is shown, changes the interaction based on speech by speech-control equipment.Server 120 by with real-time calls phase
Associated domain and associated agreement exchange communication (as shown in 702) between speech-control equipment 110a/110b, until server
120 determine that communication changes the generation (as indicated by 704) of triggering.For example, such domain can be real-time calls domain 304.Communication changes
Various forms can be presented by becoming triggering.Communication changes triggering can be based on any of speech-control equipment 110a/110b's
User's multitasking (that is, server 120 is made to execute the task unrelated with real-time calls).Communication change triggering can with or can
Selection of land is based on satisfaction or is more than inactive threshold length of time (not exchanging in n minutes for example, determining).Communication changes touching
Return can with or be optionally based on user instruction (for example, any of speech-control equipment 110a/110b user state example
Such as " shutdown call ", " stopping call ", " terminating calling ").Communication change triggering can with or be optionally based on be originated from speech
Language controls the instruction of the user of both equipment 110a/110b (for example, user says " goodbye " in threshold value number of seconds each other, " visits
Visit " etc.).In addition, communication change triggering can with or be optionally based on server 120 and detected in the exchange of real-time calls and call out
Awake word.Communication changes can be occurred based on determining to meet one or more than one communication and change to trigger.
After determining that change should occur, server 120 stops real-time calls (as shown at 706) and will indicate this feelings
The signal of condition is sent to one or two of speech-control equipment 110a/110b (as shown in 708).Then, speech-control is set
Standby 110a and/or speech-control equipment 110b output indicates the signal (as shown by 710) that real-time calls have stopped.By speech control
The signal of one or two of control equipment 110a/110b output can be static instruction or movement instruction, as described below.Change
Flexible letter may relate to stop all communications between speech-control equipment 110a/110b at the time point.Optionally, change logical
Letter, which may relate to communicate, changes into the second form different from real-time calls.For example, the second communication form may relate to service
Device 120 executes instant messaging between the first speech-control equipment 110a and the second speech-control equipment 110b, such as above
With reference to described in detail by the step 534-546 of Fig. 5 C to Fig. 5 D (as shown in 548), until server 120 determines that communication changes
The generation of triggering.
Fig. 8 A and Fig. 8 B are shown to be exported by the signaling of the user interface of speech-control equipment.Speech-control equipment 110a
It captures spoken audio (as shown in 802), the spoken audio of capture is compiled into audio data, and send service for audio data
Device 120 (as shown by 504).
Server 120 executes ASR to audio data, to determine text (for example, " telling john smith, I asks to him
It is good ") (as illustrated at 506), and NLU (as shown in 804) are executed to text.Server 120 positions mark in the text handled through NLU
The recipient information (that is, " john smith ") (as shown in 806) of note, and determine recipient's equipment (such as 808 institutes according to it
Show).For example, the accessible user profile associated with speech-control equipment 110a and/or its user of server 120.
By using user profile, server 120 can position in table corresponds to recipient information (that is, " John Shi Mi
This ") text, and can identify recipient's facility information associated with recipient information in table.Server 120 is also true
Marked message content (for example, " hello ") (as shown in 810) in the fixed text handled through NLU.
Server 120 will point out message content just by or recipient's equipment will be sent to (that is, speech-control equipment
Signal 110b) is sent to the speech-control equipment 110a for issuing initial spoken audio data (as shown in 812).In response to receiving
To message, speech-control equipment 110a output indicates that message content (that is, hello) is just sent or will be sent to recipient and set
Standby visually indicates (as shown in 814).For example, visually indicating may include exporting static indicator (for example, certain color etc.)
Or movement indicator (for example, flashing or stroboscopic element, continuous moving etc.).View can be configured according to user profile preference
Feel instruction output.Optionally, in response to receiving message, speech-control equipment 110 can export tactile and/or audible instruction
(as shown in 816).Tactile instruction may include speech-control equipment 110a vibration and/or communicate with speech-control equipment 110a
Remote equipment (for example, smartwatch) vibration.Remote equipment and speech-control equipment 110a can be by being located at and user configuration
It is communicated in the single table of the associated user equipment of file.It is audible instruction may include computer generate /TTS generate
Speech and/or the speech that generates of user, correspond to such as " message for sending you " or " your message will be sent out at once
It send." audible instruction can be by speech-control equipment 110a, long-range microphone array and/or remote equipment (example if tactile indicates
Such as, smartwatch) output.Remote equipment, microphone array and speech-control equipment 110a can be by being located at and user configuration
It is communicated in the single table of the associated user equipment of file.
Server 120 also sends identified recipient's equipment (that is, speech control for the audio data including message content
Control equipment 110b) (as shown in 818).It should be appreciated that step 814-818 (and other steps of other figures) can be with various suitable
Sequence occurs, and can also occur simultaneously.Then, speech-control equipment 110b output corresponds to the audio (such as 522 of message content
It is shown).When speech-control equipment 110b detects the speech in response to message content (as shown in 820), and speech-control
The signal for indicating such case is sent server 120 by equipment 110b (as shown in Figure 82 2).Then, server 120 is to speech
It controls equipment 110a and sends signal, it is indicated that speech-control equipment 110b is detecting speech (as shown in 824).Server 120 can
To be not need to wake up based on the recipient's name or speech-control equipment 110a/110b pointed out in the speech for example detected
A part of the instant message exchange of word audio data is in response to come the speech confirmly detected in output audio.In addition,
In example, server 120 can make whether speech-control equipment 110b output inquiry user user wants received by reply
The audio of message.Then, server 120 can receive audio data from the second speech-control equipment 110b, hold to audio data
Row ASR determines that text data includes at least one word (for example, being) pointing out response and being intended to determine text data, and
Thereby determine that the audio data hereafter received is the response to origination message.In another example, server 120 can be from
Two speech-control equipment 110b receive audio data, and the audio signature of received audio data is determined using speech processing
It is matched with the speaker ID based on speech of the recipient of origination message, and thereby determines that the audio data hereafter received is pair
The response of origination message.In response to receiving signal, speech-control equipment 110a output is indicating speech-control equipment 110b
Detection speech visually indicates (as shown in Figure 82 6).For example, visually indicate may include export static indicator (for example, certain
Color etc.) or movement indicator (for example, flashing or stroboscopic element, continuous moving etc.).It can be according to user profile preference
Output is visually indicated to configure.In this example, once no longer output visually indicates, recipient is said in response to origination message
Audio can be exported by speech-control equipment 110a.Optionally, in response to receiving signal, speech-control equipment 110a can be defeated
Tactile and/or audible instruction out (as shown in 828).Tactile instruction may include speech-control equipment 110a vibration and/or with speech
Language controls remote equipment (for example, smartwatch) vibration of equipment 110a communication.Remote equipment and speech-control equipment 110a can
By being communicated in the single table for being located at user equipment associated with user profile.Audible instruction may include
Computer generate /TTS generate speech and/or user generate speech, correspond to for example " john smith is being rung
Answer your message " or " john smith is being talked ".Audible instruction can be by speech-control equipment if tactile indicates
110a, long-range microphone array and/or remote equipment (for example, smartwatch) output.Remote equipment, microphone array and speech
Control equipment 110a can be by being communicated in the single table of user equipment associated with user profile.
Fig. 9 is shown to be exported by the signaling of the user interface of speech-control equipment.Speech-control equipment 110a capture packet
Include the spoken audio for waking up word part and recipient information (as shown in 902).The reception that speech-control equipment 110a will be captured
Person's information audio is converted to audio data, and sends server 120 for audio data (as shown in 904).Optionally, speech control
Control equipment 110a can will correspond to the audio data for waking up both word part and recipient information and be sent to server 120.?
In the example, server 120 recipient information's audio data can be isolated with word portion of audio data is waken up, and abandon wake-up
Word portion of audio data.Server 120 can execute speech processing (for example, ASR and NLU) to recipient's information audio data
(as shown in 906).For example, server 120 can execute ASR to recipient's information audio data, to create recipient information's text
Notebook data, and NLU can be executed to recipient's information text data, to identify recipient's name.If received by issuing
Audio data speech-control equipment 110a it is associated with multiple users, then server 120 can execute various processes, with true
Which fixed user has said wake-up word part and recipient information's audio (as shown in Figure 90 8).
By using the recipient information's audio data handled through speech and the speaker for knowing recipient information's audio,
The equipment that server 120 determines recipient using user profile associated with the speaker of recipient information's audio,
To send the equipment (as shown by 910) for Future Data.If recipient only with an equipment in user profile
Associated, then the equipment is the equipment that will be sent to it data.If recipient and the multiple equipment phase in user profile
Various information can be used then to determine and which recipient's equipment will send data in association.For example, it may be determined that recipient
Physical location, and the equipment closest to recipient can be transmitted data to.In another example, recipient can be determined
Which equipment is being currently used, and the equipment being being currently used can be transmitted data to.In yet another example, may be used
To determine which equipment recipient is being currently used, and the equipment closest to being being currently used can be transmitted data to
The second equipment.In another example, the equipment (that is, equipment by Future Data is sent to it) determined by server 120 can
To be distributor equipment (for example, router), wherein determine will be to which of the multiple equipment of recipient for distributor equipment
Send data.
Server 120 points out the upcoming signal of message (as shown at 912) to the transmission of the equipment of identified recipient.
When message content text data is sent TTS component by server 120, recipient's equipment can be sent signal to.For
The purpose of explanation, the equipment of identified recipient can be speech-control equipment 110b.Then, speech-control equipment 110b is defeated
Indicate the upcoming instruction of message out (as shown in 914).It can be as described herein by the instruction that speech-control equipment exports
Visually indicate, it is audible instruction and/or tactile instruction.
The speech-control equipment 110a of sender of the message also captures the spoken audio including message content (as shown in 916).
Message content spoken audio is converted to audio data by speech-control equipment 110a, and sends clothes for message content audio data
It is engaged in device 120 (as shown in 918).In this example, speech-control equipment 110b captures message content sound in speech-control equipment 110a
Instruction is exported when frequency and when server 120 receives message content audio from speech-control equipment 110a.Server 120 can be with
Send message content audio data to previously determined recipient's equipment (as shown in 920), output includes message content
Audio (as shown at 922).Optionally, server 120 can be executed incited somebody to action with determination above with respect to the process as described in step 910
Which recipient's equipment message content audio data is sent to.It will thus be appreciated that depending on situation, output indicates that message is
Recipient's equipment of the instruction of arrival and the recipient's equipment for exporting message content be can be into same equipment, or can be
Different equipment.
Figure 10 A to Figure 10 C shows the example of visual indicator as discussed herein.Speech can be passed through by visually indicating
The ring of light 1002 for controlling equipment 110 exports.The ring of light 1002 can be located in speech-control equipment 110 and make speech-control equipment
Any position that 110 user can sufficiently see.Depending on the message to be transmitted, different face can be exported by the ring of light 1002
Color.For example, the ring of light 1002 can issue green light with point out message by or will be sent to recipient's equipment.In another example
In, the ring of light 1002 can issue blue light to point out that recipient's equipment is detecting or capturing spoken audio.It is also understood that the ring of light
1002 can issue monochromatic different tones to transmit different message.For example, the ring of light (being shown as 1002a in Figure 10 A) can be with
A kind of shade of color is exported to indicate first message, the ring of light (being shown as 1002b in fig. 1 ob) can export a kind of color
Medium tone is to indicate that second message and the ring of light (being shown as 1002c in fig 1 oc) can export a kind of thin shade of color with table
Show third message.Though it is shown that three kinds of tones, it should be appreciated to those skilled in the art that a kind of the more of color may be implemented
In three kinds or less than three kinds of tones, this depends on transmitting how many different message.In addition, although the vision of Figure 10 A to Figure 10 C
Indicator example can be static state, and still, they can also seem and move in some way.For example, visual indicator can be with
Flashing, stroboscopic or the surface continuous moving about/along equipment 110.
Figure 11 A and Figure 11 B show movement instruction as described herein.As shown, the ring of light 1002 can be configured as
The a part for seeming the ring of light 1002 is mobile around speech-control equipment 110.Although it is not shown, it is also understood that
The ring of light 1002 and/or LED 1202/1204 can be configured as flashing, stroboscopic etc..
Figure 12 shows another as described herein visually indicate.According to Figure 11, static vision instruction can pass through LED
1202/1204 or a certain other similar luminaire output.LED 1202/1204 can be located in speech-control equipment 110
Enable any position that the user of speech-control equipment 110 sufficiently sees.Depending on the message to be transmitted, can pass through
LED 1202/1204 exports different colours.For example, LED 1202/1204 can issue green light with point out message by or will
It is sent to recipient's equipment.In another example, LED 1202/1204 can issue blue light to point out recipient's equipment
Detection or capture spoken audio.It is also understood that the different tones that LED 1202/1204 can issue monochrome are different to transmit
Message.For example, LED 1202/1204 can export a kind of shade of color to indicate first message, a kind of color is exported
Medium tone is to indicate second message, and exports a kind of thin shade of color to indicate third message.Although describing three kinds
Tone, it should be appreciated to those skilled in the art that may be implemented a kind of color more than three kinds or less than three kinds of tones, this takes
Certainly in transmitting how many different message.It should be appreciated that the ring of light 1002 and LED 1202/1204 both can be in same speeches
It is realized in control equipment 110, and the different variations of described instruction (and other instructions) can be used.
Although the example discussed above is visual indicator as indicator, such as audio instruction also can be used
Symbol, tactile indicator or the like other indicators point out incoming message, saying reply etc..
Figure 13 is to conceptually illustrate the user equipment 110 that can be used together with described system (for example, as herein
Speech-control the equipment 110a and 110b) block diagram.Figure 14 is to conceptually illustrate that ASR processing, NLU processing can be assisted
Or the block diagram of the exemplary components of the remote equipment of such as remote server 120 of command process etc.It may include more in system
A such server 120, for example, (multiple) server 120 for executing ASR, one for executing NLU it is (more
It is a) server 120 etc..In operation, each of these equipment (or equipment group) equipment can include residing in accordingly
Computer-readable and computer executable instructions in equipment (110/120), this will be discussed further below.
Each of these equipment (110/120) can include one or more controller/processors (1304/
1404), each controller/processor can include the central processing unit for handling data and computer-readable instruction
(CPU), the memory (1306/1406) of the data and instruction and for storing relevant device.Memory (1306/1406) can
To respectively include volatile random access memory (RAM), non-volatile read-only memory (ROM), non-volatile magnetic resistance
(MRAM) and/or other kinds of memory.Each equipment can include also data storage part (1308/1408), be used for
Storing data and controller/processor-executable instruction.Each data storage part can respectively include one or more non-
Volatile storage type, such as magnetic storage, optical storage, solid-state storage etc..Each equipment can also by input accordingly/it is defeated
It is (such as removable to be connected to removable or external non-volatile memory and/or storage equipment for equipment interface (1302/1402) out
Storage card, memory cipher key drivers, network storage equipment etc.).
Computer instruction for operating each equipment (110/120) and its various parts can be by the control of relevant device
Device/processor (1304/1404) executes, and uses memory (1306/1406) as interim " work " storage equipment at runtime.
The computer instruction of equipment can be stored in nonvolatile memory (1306/1406), storage equipment in a manner of non-transitory
(1308/1408) or in external equipment.Optionally, in addition to software or substitution software, some or complete in executable instruction
Portion can be embedded into the hardware or firmware on relevant device.
Each equipment (110/120) includes input-output apparatus interface (1302/1402).Various parts can lead to
Input-output apparatus interface (1302/1402) connection is crossed, this will be discussed further below.In addition, each equipment (110/
It 120) can include address/data bus (1324/1424), for transmitting data between the component of relevant device.In addition to
(or replacement) is connected to other component by bus (1324/1424), and each component in equipment (110/120) also can be straight
It is connected to other component in succession.
With reference to the equipment 110 of Figure 13, equipment 110 may include display 1318, may include being configured as receiving having
Limit the touch interface 1019 of touch input.Or equipment 110 can be " without a head " and can depend on verbal order
To be inputted.As the mode for pointing out to have already turned on the connection between another equipment to user, equipment 110 can be configured with view
Feel indicator, such as LED or like (not shown), can change color, flash of light or otherwise by equipment 110
Offer visually indicates.Equipment 110 can also include input-output apparatus interface 1302, be connected to various parts, such as audio
Output block, such as loudspeaker 101, wired earphone or wireless headset (not shown) or the other component that audio can be exported.If
Standby 110 can also include audio capturing component.Audio capturing component can be such as microphone 103 or microphone array, wired
Earphone or wireless headset (not shown) etc..Microphone 103 can be configured as capture audio.If including microphone array,
Can based on by array different microphones capture sound between time and amplitude difference, determined by acoustics positioning
The approximate distance of the origin of sound.Equipment 110 (using microphone 103, wake-up word detection module 220, ASR module 250 etc.) can
To be configured to determine that the audio data for corresponding to the audio data detected.Equipment 110 (uses input-output apparatus interface
1002, antenna 1014 etc.) it can be additionally configured to send audio data to server 120 to be further processed or using such as
The internal part of word detection module 220 etc is waken up to handle data.
For example, by antenna 1314, input-output apparatus interface 1302 can by WLAN (WLAN) (such as
WiFi) radio, bluetooth and/or wireless network radio, such as can be with cordless communication network (such as long term evolution (LTE)
Network, WiMAX network, 3G network etc.) communication radio, be connected to one or more networks 199.It can also support wired company
It connects, such as Ethernet.By network 199, speech processing system can be distributed in a network environment.
Equipment 110 and/or server 120 may include ASR module 250.ASR module in equipment 110 can have
Limit or extension ability.ASR module 250 may include the language model 254 being stored in ASR model storage unit 252, with
And execute the ASR module 250 of automatic speech discrimination process.If including limited speech discrimination, ASR module 250 can be by
It is configured to the word of identification limited quantity, such as the keyword detected by equipment, and extends speech discrimination and can be configured as
Recognize much bigger word range.
Equipment 110 and/or server 120 may include limited or extension NLU module 260.NLU mould in equipment 110
Block can have limited or extension ability.NLU module 260 may include name entities identification module 262, intent classifier mould
Block 264 and/or other component.NLU module 260 can also include that the knowledge base stored and/or entity library or those storages are set
It is standby can be separated.
Equipment 110 and/or server 120 can also include command processor 290, command processor be configured as executing with
Associated order/the function of verbal order as described above.
Equipment 110 may include waking up word detection module 220, can be individual component or may include in ASR mould
In block 250.Word detection module 220 is waken up to receive audio signal and detect particular expression in audio (such as keyword of configuration)
Occur.This may include the frequency variation detected in special time period, and wherein the variation of frequency will lead to System Discrimination to correspond to
It signs in the specific audio of keyword.Keyword search may include analysis all directions audio signal, such as in applicable feelings
By those of beam forming post-processing signal under condition.Also it can be used in keyword search (also referred to as keyword positioning) field
Known other technologies.In some embodiments, equipment 110 can be configured jointly to identify and wherein detect wake-up expression
Or it may wherein have occurred and that one group of direction audio signal for waking up expression.
Word detection module 220 is waken up to receive the audio of capture and handle audio (for example, using model 232) to determine audio
Whether equipment 110 and/or the cognizable special key words of system 100 are corresponded to.Storage equipment 1308 can store and keyword
Data related with function, so that waking up word detection module 220 is able to carry out above-mentioned algorithm and method.It is configured in equipment 110
Before by customer access network, the language models being locally stored can be pre-configured with based on Given information.For example, model can be with
It is specific for language and/or accent that user equipment was transported to or was predicted the region being located at, or certainly specific to user
Oneself language and/or accent, based on user profile etc..In one aspect, it can be used the user's from another equipment
Speech or audio data carry out pre-training model.For example, user can possess another user that user is operated by verbal order
Equipment, and the speech data can be associated with user profile.Then, user equipment 110 be delivered to user or
It is configured as before network accessible by user, can use the speech data from other users equipment and is used for trained set
Standby 110 language models being locally stored.It wakes up the accessible storage equipment 1308 of word detection module 220 and uses audio ratio
Compared with, the positioning of pattern identification, keyword, audio signature and/or other audio signal processing techniques, by the model of the audio of capture and storage
It is compared with tonic train.
As set forth above, it is possible to use multiple equipment in single speech processing system.In such more device systems, often
A equipment may each comprise the different components of the different aspect for executing speech processing.Multiple equipment may include the portion of overlapping
Part.As shown in Figure 13 and Figure 14, the component of equipment 110 and server 120 is exemplary, and be can be used as autonomous device and put
It sets, or can wholly or partly include the component as larger equipment or system.
In order to create output speech, server 120 can be configured with text to speech (" TTS ") module 1410, will be literary
Notebook data is transformed to indicate the audio data of speech.It is then possible to send equipment 110 for audio data to be played back to use
Family, to create output speech.TTS module 1410 may include storing equipment for that will input the TTS that text conversion is speech.
For example, TTS module 1410 may include the controller/processor and memory of their own, or can be used server 120 or
Controller/the processor and memory of other equipment.Similarly, TTS mould can be located at for operating the instruction of TTS module 1410
In block 1410, in the memory and/or storage equipment of server 120, or it is located in external equipment.
The text for being input to TTS module 1410 be can handle to execute text standardization, language analysis and the life of the language rhythm
At.During text standardization, TTS module 1410 handles text input and generates received text, will such as number, abbreviation
The thing of (such as Apt., St. etc.) and symbol ($, % etc.) etc is converted to the equivalent text of the word write out.
During language analysis, the language in 1410 analytical standard text of TTS module corresponds to input text to generate
Phonetic unit sequence.The process is properly termed as phonetic transcription.Phonetic unit includes that the symbol of acoustic unit indicates, finally by being
System 100 is combined and is exported as speech.In order to carry out the purpose of speech synthesis, various acoustic units can be used to divide text
This.TTS module 1410 can be based on phoneme (each sound), half phoneme, the diphone (latter half and adjacent tone of a phoneme
The first half connection of element), diphones (two continuous phonemes), syllable, word, phrase, sentence or other unit handle speech
Language.Each word may map to one or more phonetic units.It can be used and TTS storage is for example stored in by system 100
Language dictionary in equipment executes this mapping.The language analysis executed by TTS module 1410 can also identify different languages
Method ingredient, such as prefix, suffix, phrase, punctuation mark, syntax boundary etc..TTS module 1410 can be used such grammer at
Divide to make the output of the audio volume control of nature sounding.Language dictionary can also include letter to sound rule and can be used for issuing
Other tools of previous unidentified word or monogram that TTS module 1410 can be potentially encountered.In general, being wrapped in language dictionary
The information included is more, and the quality of speech output is higher.
Based on language analysis, then TTS module 1410 can execute language prosody generation, and wherein speech unit is with desired
Prosody characteristics (also referred to as acoustic feature) are annotated, these prosody characteristics point out desired phonetic unit in final output speech
In how to pronounce.During at this stage, TTS module 1410 is it is contemplated that and combine any rhythm of adjoint text input to annotate.
This acoustic feature may include pitch, energy, duration etc..The application of acoustic feature can be based on TTS module 1410 can
Rhythm model.This rhythm model shows how specific phonetic unit pronounces in some cases.For example, rhythm model
It is contemplated that position of position, word of position, syllable of the phoneme in syllable in word in sentence, phrase or paragraph,
Adjacent phonetic unit etc..As language dictionary, with the rhythm model compared with multi information and with the rhythm mould of less information
Type is compared, and can produce higher-quality speech output.It is appreciated that the major part when textual work can be used for TTS module
When 1410, TTS module 1410 can distribute the more strong and complicated prosodic features across part variation, so that the part be made to listen
Get up more humane, leads to higher-quality audio output.
Symbolic language expression can be generated in TTS module 1410, may include the phonetic unit annotated with prosody characteristics
Sequence.It is then possible to the symbolic language is indicated to be converted to the audio volume control of speech, be output to audio output apparatus (such as
Microphone), and final output is to user.TTS module 1410 can be configured as is by input text conversion in an efficient way
The natural sounding speech of high quality.Such high quality speech can be configured as to be sent out as human speakers as much as possible
Sound, or can be configured as hearer and be understood that and be not intended to imitate specific Human voice.
One or more distinct methods can be used to execute speech synthesis in TTS module 1410.It is further described below
Be known as unit selection a kind of synthetic method in, TTS module 1410 will symbolic language indicate with record speech database (example
Such as the database of speech corpus) it is matched.TTS module 1410 indicates symbolic language and the spoken audio list in database
Position is matched.They are simultaneously joined together to form speech output by selection matching unit.Each unit includes corresponding to
The audio volume control of phonetic unit, such as the short .wav file of specific sound, and various acoustics associated with .wav file are special
The description (such as its pitch, energy etc.) of sign and other information, such as phonetic unit appear in word, sentence or phrase
Position, adjacent phonetic unit etc..By using all information in unit data library, TTS module 1410 can be by unit
(for example, in unit data library) is matched with input text, to create nature sounding waveform.Unit data library may include
Multiple examples of phonetic unit, to provide many different options for unit to be connected into speech to system 100.Unit selection
A benefit be that depending on the size of database, the output of nature sounding speech can be generated.As described above, speech corpus
Unit data library it is bigger, system be more possible to building nature sounding speech.
In another synthetic method of referred to as parameter synthesis, the parameter of such as frequency, volume and noise etc is by TTS mould
Block 1410 changes, to generate the output of artificial speech's waveform.Parameter synthesis can be used acoustic model and various statistical techniques by
Symbolic language expression is matched with desired output speech parameters.Parameter synthesis may include aloft managing under speed accurately
Ability, and in the case where not selecting associated large database with unit handle speech ability, but usually also
Can generate may select unmatched output speech quality with unit.Unit selection and parametric technique can be individually performed or combine
It is combined to produce speech audio output together and/or with other synthetic technologys.
The synthesis of parameter speech can be executed as follows.TTS module 1410 may include acoustic model or other moulds
Type can be manipulated based on audio signal, and symbolic language is indicated to the synthesis acoustic waveform for being converted to text input.Acoustic model
Including the rule that can be used for annotating audio waveform specific parametric distribution to input phonetic unit and/or the rhythm.Rule can be used for
Calculating indicates a possibility that specific audio output parameter (frequency, volume etc.) corresponds to the part that input symbolic language indicates
Score.
As shown in figure 15, multiple equipment (120,110,110c-110f) may include the component of system 100, and equipment
It can be connected by network 199.Network 199 may include local or dedicated network, or may include such as internet etc
Wide area network.Equipment can be connected to network 199 by wired or wireless.For example, speech-control equipment 110, plate
Computer 110e, smart phone 110c, smartwatch 110d and/or vehicle 110f can pass through wireless service provider, WiFi
Or cellular network connection etc. is connected to network 199.Holding equipment including other equipment as networking, such as server 120 are answered
With developer's equipment or other.Holding equipment can be connected to network 199 by wired connection or wireless connection.Networked devices
110 can be used that one or more is built-in or the microphone 103 of connection or audio capturing equipment capture audio, processing by ASR,
Other component (such as one or more servers 120 of NLU or same equipment or another equipment connected by network 199
ASR 250, NLU 260 etc.) it executes.
Concepts disclosed herein can be applied in multiple and different equipment and computer system, including for example general meter
Calculation system, speech processing system and distributed computing environment.
What the above-mentioned everyway of the disclosure was intended to be illustrative.Selecting them is to explain the principle of the disclosure and answer
With, and it is not intended to for the exhaustive or limitation disclosure.The many modifications and variations of disclosed aspect are for this field skill
It may be apparent for art personnel.The those of ordinary skill of computer and speech process field is it should be appreciated that described herein
Component and processing step and can still realize this public affairs with the combining and interchanging of other component or step or component or step
The benefit and advantage opened.In addition, being answered to those skilled in the art it is evident that can be no more disclosed herein
Or implement the disclosure in the case where whole specific details and step.
The various aspects of disclosed system may be implemented as computer approach or be implemented as such as memory devices
Or the product of non-transitory computer-readable storage media.Computer readable storage medium can be computer-readable and can
To include for making computer or other equipment execute the instruction of the process described in the disclosure.Computer readable storage medium
It can be by volatile computer memories, non-volatile computer memory, hard disk drive, solid-state memory, flash drive
Device, removable disk and/or other media are realized.In addition, the component of one or more of module and engine can with firmware or
The form of hardware realizes that such as acoustics front end 256 especially includes analog and/or digital filter (for example, as firmware quilt
It is configured to the filter of digital signal processor (DSP))).
Foregoing teachings can also understand according to following clause.
1. a method of computer implementation comprising:
From the first speech-control equipment associated with the first user profile receive include first wake up word part and
First input audio data of the first command portion;
Speech processing is executed to first command portion, with determine indicates the second title of second user configuration file with
First text data of first message content;
Using first user profile, the second speech control associated with the second user configuration file is determined
Control equipment;
At the first time, Xiang Suoshu the second speech-control equipment sends the first output for corresponding to the first message content
Audio data;
The second time after the first time, receiving from the second speech-control equipment includes the second wake-up word
Second input audio data of part and the second command portion;
Speech processing is executed to second command portion, indicates associated with first user profile with determination
The first title and second message content the second text data;
The third time after second time, Xiang Suoshu the first speech-control equipment, which is sent, corresponds to described second
Second output audio data of message content;
Determine the first time and second time within the first threshold period;
Message transmission connection is established between the first speech-control equipment and the second speech-control equipment;
Signal is sent to the first speech-control equipment to be handled to send other audio data without detecting
Wake up word part;
The 4th time after the third time, receiving from the first speech-control equipment includes in third message
Hold but without the third input audio data of wake-up word part;
Speech processing is executed to the third input audio data, the third message content is indicated with determination but is not indicated
The third text data of second title of the second user;And
The 5th time after the 4th time, it includes that the third disappears that Xiang Suoshu the second speech-control equipment, which is sent,
The third for ceasing content exports audio data.
2. the computer implemented method according to clause 1, further include:
The 6th time after the 5th time, receiving from the second speech-control equipment includes in the 4th message
Hold but does not include the 4th input audio data for waking up first title of word part or first user;
Determine the 6th time and the 5th time within the second threshold period;And
In response to the 6th time and the 5th time within the second threshold period, first speech is opened
Language controls the first real time communication session channel between equipment and the second speech-control equipment, the first real time communication meeting
Words channel be related to it is being received from the first speech-control equipment and the second speech-control equipment, will be at no progress speech
The audio data swapped in the case where reason.
3. the computer implemented method according to clause 2, further include:
The first real time communication session channel is closed when occurring and communicating and change triggering, under the communication change triggering is
Column items at least one of: not from the first speech-control equipment receive audio data third threshold time period,
It detects the wake-up word part from the first speech-control equipment, receive non-communicating from the first speech-control equipment
Order receives further input audio data, the further input audio from the first speech-control equipment
Data include at least part for pointing out to close the first real time communication session channel.
4. the computer implemented method according to clause 1, further include:
Image data is received from the second speech-control equipment;
Determine that described image data include the expression of people;
Based on the position of the expression in described image data, the people and the second speech-control equipment are determined
The degree of approach;And
Second message transmitting connection is established between the first speech-control equipment and the second speech-control equipment,
The required wake-up word part of spoken audio is waken up word from default and changes into connecing for spoken audio by the second message transmitting connection
The name of receipts person.
5. a kind of system comprising:
At least one processor;And
Memory, including instruction, described instruction can be operated to be executed by least one described processor, dynamic to execute one group
Make, so that at least one described processor is configured that
Input audio data is received from the first equipment, the input audio data includes waking up word part and command portion;
Text data is determined based on the input audio data;
Based on the text data, first message is sent to the second equipment;
Determine the second message that first equipment is sent to from the plan of second equipment;
Determine from first equipment be sent to second equipment the first quantity message and from second equipment
It is sent to the message elapsed time amount of the second quantity of first equipment;
Determine that the time quantum is less than the first threshold period;And
First equipment is transmitted data to, it is described that the data are sent to first equipment by audio data
At least one processor detects without first equipment and wakes up word.
6. wherein at least one described processor is further configured that by described instruction according to system described in clause 5
Determine from first equipment be sent to second equipment third quantity message and from second equipment
It is sent to the second time quantum that the message of the 4th quantity of first equipment is passed through;
Determine that second time quantum is less than the second threshold period;And
Real time communication session is established between first equipment and second equipment, the real time communication session includes
In first equipment and the second exchanged between equipment audio data without executing speech processing.
7. wherein at least one described processor is further configured that by described instruction according to system described in clause 5
User profile associated with first equipment is accessed,
Wherein determine that the elapsed time amount includes related to second equipment in the identification user profile
The message of first quantity of connection.
8. wherein at least one described processor is further configured that by described instruction according to system described in clause 5
The second input audio data is received from first equipment;
Determine that second input audio data includes user name;
Using user profile associated with first equipment, the third equipment of the attached user name is determined;
It include the user name based on second input audio data using the user profile, it is determined that
Real time communication session occurs;And
Real time communication session is established between first equipment and the third equipment.
9. wherein at least one described processor is further configured that by described instruction according to system described in clause 8
It determines at least one in the following: being not received by the second threshold period of audio data, receives packet
It includes and wakes up the audio data of word part, receives the audio data including non-communicating order or receive including pointing out close
Close at least part of audio data of the real time communication session;And
Close the real time communication session.
10. according to system described in clause 8, wherein in response to it is the first in first degree of approach of first equipment simultaneously
And second people in second degree of approach of the third equipment, further promote that the real time communication session occurs.
11. wherein at least one described processor is further configured that by described instruction according to system described in clause 5
When second equipment is capturing at least one of audio or text, refer to the first equipment output
Show, the instruction is vision, at least one of audible or tactile.
12. wherein at least one described processor is further configured that by described instruction according to system described in clause 5
Making the first equipment output synthesized speech, it is indicated that audio data will be sent to second equipment in real time, and
It is disabled to wake up word function.
13. a method of computer implementation comprising:
Input audio data is received from the first equipment, the input audio data includes waking up word part and command portion;
Text data is determined based on the input audio data;
Based on the text data, first message is sent to the second equipment;
Determine the second message that first equipment is sent to from the plan of second equipment;
Determine from first equipment be sent to second equipment the first quantity message and from second equipment
It is sent to the message elapsed time amount of the second quantity of first equipment;
Determine that the time quantum is less than the first threshold period;And
First equipment is transmitted data to, the data make first equipment send audio data, without institute
It states the detection of the first equipment and wakes up word.
14. the computer implemented method according to clause 13, further include:
Determine from first equipment be sent to second equipment third quantity message and from second equipment
It is sent to the second time quantum that the message of the 4th quantity of first equipment is passed through;
Determine that second time quantum is less than the second threshold period;And
Real time communication session is established between first equipment and second equipment, the real time communication session includes
In first equipment and the second exchanged between equipment audio data without executing speech processing.
15. the computer implemented method according to clause 13, further include:
User profile associated with first equipment is accessed,
Wherein determine that the elapsed time amount includes related to second equipment in the identification user profile
The message of first quantity of connection.
16. the computer implemented method according to clause 13, further include:
The second input audio data is received from first equipment;
Determine that second input audio data includes user name;
Using user profile associated with first equipment, the third equipment of the attached user name is determined;
It include the user name based on second input audio data using the user profile, it is determined that
Real time communication session occurs;And
Real time communication session is established between first equipment and the third equipment.
17. the computer implemented method according to clause 16, further include:
It determines at least one in the following: being not received by the second threshold period of audio data, receives packet
It includes and wakes up the audio data of word part, receives the audio data including non-communicating order or receive including pointing out close
Close at least part of audio data of the real time communication session;And
Close the real time communication session.
18. the computer implemented method according to clause 16, wherein in response to the first in first equipment
In first degree of approach and the second people is in second degree of approach of the third equipment, further promotes that the real time communication occurs
Session.
19. the computer implemented method according to clause 13, further include:
When second equipment is capturing at least one of audio or text, refer to the first equipment output
Show, the instruction is vision, at least one of audible or tactile.
20. the computer implemented method according to clause 13, further include:
Making the first equipment output synthesized speech, it is indicated that audio data will be sent to second equipment in real time, and
It is disabled to wake up word function.
21. a method of computer implementation comprising:
The first input audio data is received from the first speech-control equipment;
Speech processing is executed to determine text data to first input audio data;
Determine that the first part of the text data corresponds to message recipient name;
Determine that the second part of the text data corresponds to first message content;
The first signal is sent to the first speech-control equipment, first signal makes the first speech-control equipment
First look instruction is exported, the First look instruction indicates sending disappearing corresponding to first input audio data
Breath;
It is determining and the message recipient using user profile associated with the first speech-control equipment
The associated second speech-control equipment of name;
At the first time, Xiang Suoshu the second speech-control equipment sends the first output for corresponding to the first message content
Audio data;
Point out that the second speech-control equipment is detecting the second of speech from the second speech-control equipment reception
Signal;And
The second time to the first speech-control equipment and after the first time sends third signal,
The third signal visually indicates the first speech-control equipment output second, and described second visually indicates and indicate described the
Two speech-control equipment are detecting speech.
22. the computer implemented method according to clause 21, in which:
The First look instruction includes that the first color and described second is visually indicated including first color and first
Movement, first movement includes one of flashing, stroboscopic or moving along the edge of the first speech-control equipment;And
First signal also makes the first speech-control equipment export audible instruction, and the audible instruction indicates
Send the message for corresponding to first input audio data.
23. the computer implemented method according to clause 21, further include:
Make whether the second speech-control equipment output inquires user described in the user of the second speech-control equipment
Want the audio of the reply first message content;
The second input audio data is received from the second speech-control equipment;
ASR is executed to determine the second text data to second input audio data;And
Determine that second text data includes that word is.
24. the computer implemented method according to clause 21, wherein determining that the second speech-control equipment is also wrapped
It includes:
Equipment associated with recipient's name receives image data from the user profile;And
Determine from the described image data that the second speech-control equipment receives include people expression.
25. a kind of system comprising:
At least one processor;And
Memory, including instruction, described instruction can be operated to be executed by least one described processor, dynamic to execute one group
Make, so that at least one described processor is configured that
Input audio data is received from the first equipment;
The input audio data is handled to determine message content;
To the second equipment and at the first time, the output audio data for corresponding to the message content is sent;
The second time from second equipment and after the first time receives second equipment and has detected
To the instruction of the speech in the reply to the output audio data;And
The third time after second time exports visual indicator by first equipment, and the vision refers to
Show that symbol indicates that second equipment is receiving the reply to the message content.
26. according to system described in clause 25, wherein it is described visually indicate including in the first color or the first movement extremely
Few one kind.
27. according to system described in clause 25, wherein described instruction further configure at least one described processor so that
Second equipment is identified with user profile associated with first equipment.
28. wherein at least one described processor is further configured that by described instruction according to system described in clause 25
The second equipment output is set to handle the audio data of creation by text to speech (TTS);
The second input audio data is received from second equipment;
ASR is executed to determine the second text data to second input audio data;
Determine that second text data includes that word is;And
It include the word based on determination second text data is to determine that the speech is to the output audio
In the reply of data.
29. wherein at least one described processor is further configured that by described instruction according to system described in clause 25
The third time after second time, by first equipment export listening indicator, it is described can
Indicator is listened to indicate that second equipment has detected that the speech in the reply to the output audio data.
30. according to system described in clause 29, wherein generating the listening indicator, institute to speech processing using text
State the speech that text had previously been said to speech processing using user.
31. wherein at least one described processor is further configured that by described instruction according to system described in clause 25
The second equipment output is set to handle the audio data of creation by text to speech (TTS);
The second input audio data is received from second equipment;
Using determined based on the speaker ID of speech second input audio data correspond to by the message content
The audio said of recipient;And
Based on the audio that the recipient that second input audio data corresponds to the message content is said, really
The fixed speech is in the reply to the output audio data.
32. according to system described in clause 25, wherein the input audio data includes waking up word part and message content.
33. a method of computer implementation comprising:
Input audio data is received from the first equipment;
The input audio data is handled to determine message content;
To the second equipment and at the first time, the output audio data for corresponding to the message content is sent;
The second time from second equipment and after the first time receives second equipment and has detected
To the instruction of the speech in the reply to the output audio data;And
The third time after second time exports visual indicator by first equipment, and the vision refers to
Show that symbol indicates that second equipment is receiving the reply to the message content.
34. the computer implemented method according to clause 33, wherein described visually indicate including the first color or the
At least one of one movement.
35. the computer implemented method according to clause 34 further includes using associated with first equipment
User profile identify second equipment.
36. the computer implemented method according to clause 35, wherein described instruction further will it is described at least one
Processor is configured that
The second equipment output is set to handle the audio data of creation by text to speech (TTS);
The second input audio data is received from second equipment;
ASR is executed to determine the second text data to second input audio data;
Determine that second text data includes that word is,
Wherein based on determining that second text data include the word is to determine that the speech is to the output
In the reply of audio data.
37. the computer implemented method according to clause 33, further include:
The third time after second time, by first equipment export listening indicator, it is described can
Indicator is listened to indicate that second equipment has detected that the speech in the reply to the output audio data.
38. the computer implemented method according to clause 37, wherein being generated using text to speech processing described
Listening indicator, the speech that the text had previously been said to speech processing using user.
39. the computer implemented method according to clause 33, further include:
The second equipment output is set to handle the audio data of creation by text to speech (TTS);
The second input audio data is received from second equipment;
Using determined based on the speaker ID of speech second input audio data correspond to by the message content
The audio said of recipient;And
Based on the audio that the recipient that second input audio data corresponds to the message content is said, really
The fixed speech is in the reply to the output audio data.
40. the computer implemented method according to clause 33, wherein the input audio data includes waking up word portion
Point and message content.
41. a method of computer implementation comprising:
In first time period:
The first input audio data including recipient information is received from the first speech-control equipment;
Speech processing is executed to first input audio data, to determine the first text data, first textual data
According to including recipient's name;
Using user profile associated with the first speech-control equipment, determination and recipient's name phase
Associated second speech-control equipment;
The second speech-control equipment output is set to indicate the upcoming instruction of message content;
In second time period:
The second input audio data including the message content is received from the first speech-control equipment;
Using user profile associated with the first speech-control equipment, determination and recipient's name phase
Associated third speech-control equipment;And
The third speech-control equipment is set to export the message content.
42. the computer implemented method according to clause 41, wherein the instruction includes the first color, has first
At least one of first color or the first audio of movement, first movement is including flashing, stroboscopic or along described first
One of speech-control device end movement, first audio are to be generated using text to speech processing.
43. the computer implemented method according to clause 41, further include:
Natural language processing is executed to identify recipient's name;And
Signal is sent to the second speech-control equipment, the signal makes described in the second speech-control equipment output
Instruction, while text is sent to speech component by the second text data for corresponding to second input audio data.
44. the computer implemented method according to clause 41, wherein determining that the second speech-control equipment is also wrapped
It includes:
Equipment associated with recipient's name receives image data from the user profile;And
Determine from the image data that the second speech-control equipment receives include people expression.
45. a kind of system comprising:
At least one processor;And
Memory, including instruction, described instruction can be operated to be executed by least one described processor, dynamic to execute one group
Make, so that at least one described processor is configured that
The first input audio data including recipient information is received from the first equipment;
Determine the second equipment associated with the recipient information;
The second equipment output is set to indicate the upcoming instruction of message content;
The second input audio data including the message content is received from first equipment;And
Second equipment is set to export the message content.
46. according to system described in clause 45, wherein determining that second equipment includes:
Access user profile associated with first equipment;And
Identify the recipient information in the user profile.
47. according to system described in clause 45, wherein determining that second equipment includes:
Determine the position of the recipient;And
Based on second equipment close to the recipient, selected from multiple equipment associated with recipient's configuration file
Select second equipment.
48. according to system described in clause 45, wherein determining that second equipment includes determining that second equipment is current
In being used.
49. according to system described in clause 45, wherein determining that second equipment includes:
From determining that third equipment is being currently used in the multiple equipment including second equipment;And
Second equipment is selected based on the degree of approach of second equipment and the third equipment.
50. according to system described in clause 45, wherein the instruction includes color or the color with movement.
51. according to system described in clause 50, wherein receiving the second input audio number from first equipment
According to while the instruction exported by second equipment.
52., wherein the instruction is audible instruction, the audible instruction is using text according to system described in clause 55
It is generated to speech (TTS) processing.
53. a method of computer implementation comprising:
The first input audio data including recipient information is received from the first equipment;
Determine the second equipment associated with the recipient information;
The second equipment output is set to indicate the upcoming instruction of message content;
The second input audio data including message content is received from first equipment;And
Second equipment is set to export the message content.
54. the computer implemented method according to clause 53, wherein determining that second equipment includes:
Access user profile associated with first equipment;And
Identify the recipient information in the user profile.
55. the computer implemented method according to clause 53, wherein determining that second equipment includes:
Determine the position of the recipient;And
Based on second equipment close to the recipient, selected from multiple equipment associated with recipient's configuration file
Select second equipment.
56. the computer implemented method according to clause 53, wherein determining that second equipment includes described in determination
During second equipment is being currently used.
57. the computer implemented method according to clause 53, wherein determining that second equipment includes:
From determining that third equipment is being currently used in the multiple equipment including second equipment;And
Second equipment is selected based on the degree of approach of second equipment and the third equipment.
58. the computer implemented method according to clause 53, wherein the instruction is including color or with movement
The color.
59. the computer implemented method according to clause 58, wherein receiving described from first equipment
The instruction is exported by second equipment while two input audio datas.
60. the computer implemented method according to clause 53, wherein the instruction is audible instruction, the audible finger
Show it is to handle to generate using text to speech (TTS).
As used in the disclosure, unless otherwise expressly specified, otherwise term "a" or "an" may include one
A or multiple projects.In addition, unless otherwise expressly specified, otherwise wording " being based on " is intended to indicate that " being at least partially based on ".
Claims (15)
1. a method of computer implementation comprising:
Input audio data is received from the first equipment, the input audio data includes waking up word part and command portion;
Text data is determined based on the input audio data;
Based on the text data, first message is sent to the second equipment;
Determine the second message that first equipment is sent to from the plan of second equipment;
Determine from first equipment be sent to second equipment the first quantity message and from second equipment send
To the message elapsed time amount of the second quantity of first equipment;
Determine that the time quantum is less than the first threshold period;And
Transmit data to first equipment, the data make first equipment send audio data, without described the
The detection of one equipment wakes up word.
2. computer implemented method as described in claim 1 comprising:
The second input audio data is received from first equipment;
Second input audio data is handled to determine message content;
To second equipment and at the first time, the output audio data for corresponding to the message content is sent;
The second time from second equipment and after the first time receives second equipment and has detected that pair
The instruction of speech in the reply of the output audio data;And
The third time after second time exports visual indicator, the visual indicator by first equipment
Indicate that second equipment is receiving the reply to the message content.
3. computer implemented method as claimed in claim 1 or 2, wherein described instruction also configures at least one processor
Are as follows:
The second equipment output is set to handle the audio data of creation by text to speech (TTS);
Third input audio data is received from second equipment;
ASR is executed to determine the second text data to the third input audio data;And
Determine that second text data includes that word is, wherein determining that the speech is the reply to the output audio data
It is that second text data includes that the word is based on the determination.
4. computer implemented method as claimed in claim 1 or 2, further include:
The third time after second time exports listening indicator, the audible finger by first equipment
Show that symbol indicates that second equipment has detected that the speech in the reply to the output audio data.
5. computer implemented method as claimed in claim 4, wherein described audible to generate to speech processing using text
Indicator, the speech that the text had previously been said to speech processing using user.
6. computer implemented method as claimed in claim 1 or 2, further include:
The second equipment output is set to handle the audio data of creation by text to speech (TTS);
The 4th input audio data is received from second equipment;
Using determined based on the speaker ID of speech the 4th input audio data correspond to connecing by the message content
The audio that receipts person says;And
Based on the audio that the recipient that the 4th input audio data corresponds to the message content is said, institute is determined
Stating speech is in the reply to the output audio data.
7. computer implemented method as described in claim 1, further include:
The second input audio data including recipient information is received from first equipment;
Determine second equipment associated with the recipient information;
The second equipment output is set to indicate the upcoming instruction of message content;
The third input audio data including message content is received from first equipment;And
Second equipment is set to export the message content.
8. computer implemented method as claimed in claim 7, wherein determining that second equipment includes:
Determine the position of the recipient;And
Based on second equipment close to the recipient, institute is selected from multiple equipment associated with recipient's configuration file
State the second equipment.
9. computer implemented method as described in claim 1, further include:
Determine from first equipment be sent to second equipment third quantity message and from second equipment send
The second time quantum passed through to the message of the 4th quantity of first equipment;
Determine that second time quantum is less than the second threshold period;And
Real time communication session is established between first equipment and second equipment, the real time communication session is included in institute
The first equipment and the second exchanged between equipment audio data are stated without executing speech processing.
10. computer implemented method as described in claim 1, further include:
User profile associated with first equipment is accessed,
Wherein determine that the elapsed time amount includes associated with second equipment in the identification user profile
The message of first quantity.
11. computer implemented method as described in claim 1, further include:
The second input audio data is received from first equipment;
Determine that second input audio data includes user name;
Using user profile associated with first equipment, the third equipment of the attached user name is determined;
It include the user name based on second input audio data using the user profile, it is determined that occurring
Real time communication session;And
Real time communication session is established between first equipment and the third equipment.
12. computer implemented method as claimed in claim 11, further include:
It determines at least one in the following: being not received by the second threshold period of audio data, receives including calling out
The audio data of awake word part receives the audio data including non-communicating order or receives including pointing out close institute
State at least part of audio data of real time communication session;And
Close the real time communication session.
13. computer implemented method as claimed in claim 12, wherein in response to the first the of first equipment
In one degree of approach and the second people is in second degree of approach of the third equipment, further promotes that the real time communication meeting occurs
Words.
14. computer implemented method as described in claim 1, further include:
When second equipment is capturing at least one of audio or text, make the first equipment output instruction, institute
It states and at least one of indicates it is vision, is audible or tactile.
15. computer implemented method as claimed in claim 13, further include:
Make the first equipment output synthesized speech, it is indicated that audio data will be sent to second equipment in real time, and wake up
Word function is disabled.
Applications Claiming Priority (7)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US15/254,600 US10580404B2 (en) | 2016-09-01 | 2016-09-01 | Indicator for voice-based communications |
US15/254,600 | 2016-09-01 | ||
US15/254,359 US10074369B2 (en) | 2016-09-01 | 2016-09-01 | Voice-based communications |
US15/254,359 | 2016-09-01 | ||
US15/254,458 | 2016-09-01 | ||
US15/254,458 US10453449B2 (en) | 2016-09-01 | 2016-09-01 | Indicator for voice-based communications |
PCT/US2017/049578 WO2018045154A1 (en) | 2016-09-01 | 2017-08-31 | Voice-based communications |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109791764A true CN109791764A (en) | 2019-05-21 |
Family
ID=59846711
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201780060299.1A Pending CN109791764A (en) | 2016-09-01 | 2017-08-31 | Communication based on speech |
Country Status (4)
Country | Link |
---|---|
EP (1) | EP3507796A1 (en) |
KR (1) | KR20190032557A (en) |
CN (1) | CN109791764A (en) |
WO (1) | WO2018045154A1 (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10692496B2 (en) * | 2018-05-22 | 2020-06-23 | Google Llc | Hotword suppression |
US10701006B2 (en) * | 2018-08-27 | 2020-06-30 | VoiceCTRL Oy | Method and system for facilitating computer-generated communication with user |
CN109658924B (en) * | 2018-10-29 | 2020-09-01 | 百度在线网络技术(北京)有限公司 | Session message processing method and device and intelligent equipment |
WO2021142040A1 (en) * | 2020-01-06 | 2021-07-15 | Strengths, Inc. | Precision recall in voice computing |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102447647A (en) * | 2010-10-13 | 2012-05-09 | 腾讯科技(深圳)有限公司 | Notification method, device and system based on new information |
US20140257821A1 (en) * | 2013-03-07 | 2014-09-11 | Analog Devices Technology | System and method for processor wake-up based on sensor data |
CN104662600A (en) * | 2012-06-25 | 2015-05-27 | 亚马逊技术公司 | Using gaze determination with device input |
CN105027194A (en) * | 2012-12-20 | 2015-11-04 | 亚马逊技术有限公司 | Identification of utterance subjects |
US20150371638A1 (en) * | 2013-08-28 | 2015-12-24 | Texas Instruments Incorporated | Context Aware Sound Signature Detection |
CN105376397A (en) * | 2014-08-07 | 2016-03-02 | 恩智浦有限公司 | Low-power environment monitoring and activation triggering for mobile devices through ultrasound echo analysis |
CN105556592A (en) * | 2013-06-27 | 2016-05-04 | 亚马逊技术股份有限公司 | Detecting self-generated wake expressions |
CN105700363A (en) * | 2016-01-19 | 2016-06-22 | 深圳创维-Rgb电子有限公司 | Method and system for waking up smart home equipment voice control device |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120259633A1 (en) * | 2011-04-07 | 2012-10-11 | Microsoft Corporation | Audio-interactive message exchange |
US8468022B2 (en) * | 2011-09-30 | 2013-06-18 | Google Inc. | Voice control for asynchronous notifications |
US9026176B2 (en) * | 2013-05-12 | 2015-05-05 | Shyh-Jye Wang | Message-triggered voice command interface in portable electronic devices |
US10235996B2 (en) * | 2014-10-01 | 2019-03-19 | XBrain, Inc. | Voice and connection platform |
-
2017
- 2017-08-31 KR KR1020197005828A patent/KR20190032557A/en active IP Right Grant
- 2017-08-31 EP EP17765015.7A patent/EP3507796A1/en not_active Withdrawn
- 2017-08-31 CN CN201780060299.1A patent/CN109791764A/en active Pending
- 2017-08-31 WO PCT/US2017/049578 patent/WO2018045154A1/en unknown
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102447647A (en) * | 2010-10-13 | 2012-05-09 | 腾讯科技(深圳)有限公司 | Notification method, device and system based on new information |
CN104662600A (en) * | 2012-06-25 | 2015-05-27 | 亚马逊技术公司 | Using gaze determination with device input |
CN105027194A (en) * | 2012-12-20 | 2015-11-04 | 亚马逊技术有限公司 | Identification of utterance subjects |
US20140257821A1 (en) * | 2013-03-07 | 2014-09-11 | Analog Devices Technology | System and method for processor wake-up based on sensor data |
CN105556592A (en) * | 2013-06-27 | 2016-05-04 | 亚马逊技术股份有限公司 | Detecting self-generated wake expressions |
US20150371638A1 (en) * | 2013-08-28 | 2015-12-24 | Texas Instruments Incorporated | Context Aware Sound Signature Detection |
CN105376397A (en) * | 2014-08-07 | 2016-03-02 | 恩智浦有限公司 | Low-power environment monitoring and activation triggering for mobile devices through ultrasound echo analysis |
CN105700363A (en) * | 2016-01-19 | 2016-06-22 | 深圳创维-Rgb电子有限公司 | Method and system for waking up smart home equipment voice control device |
Non-Patent Citations (2)
Title |
---|
TOBI DELBRUCK,等: "Fully integrated 500uW speech detection wake-up circuit", 《PROCEEDINGS OF 2010 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS》 * |
杨旭,等: "基于凌阳单片机的多路音频采集系统", 《自动化与仪器仪表》 * |
Also Published As
Publication number | Publication date |
---|---|
KR20190032557A (en) | 2019-03-27 |
WO2018045154A1 (en) | 2018-03-08 |
EP3507796A1 (en) | 2019-07-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US12100396B2 (en) | Indicator for voice-based communications | |
US10074369B2 (en) | Voice-based communications | |
US11908472B1 (en) | Connected accessory for a voice-controlled device | |
US10453449B2 (en) | Indicator for voice-based communications | |
US11776540B2 (en) | Voice control of remote device | |
US10365887B1 (en) | Generating commands based on location and wakeword | |
US10326869B2 (en) | Enabling voice control of telephone device | |
US11763808B2 (en) | Temporary account association with voice-enabled devices | |
US11184412B1 (en) | Modifying constraint-based communication sessions | |
US10714085B2 (en) | Temporary account association with voice-enabled devices | |
CN109155132A (en) | Speaker verification method and system | |
CN109074806A (en) | Distributed audio output is controlled to realize voice output | |
US10148912B1 (en) | User interface for communications systems | |
US11798559B2 (en) | Voice-controlled communication requests and responses | |
CN109791764A (en) | Communication based on speech | |
US10143027B1 (en) | Device selection for routing of communications | |
CN116917984A (en) | Interactive content output | |
US11856674B1 (en) | Content-based light illumination | |
US10854196B1 (en) | Functional prerequisites and acknowledgments | |
US11172527B2 (en) | Routing of communications to a device | |
CN117882131A (en) | Multiple wake word detection | |
US11176930B1 (en) | Storing audio commands for time-delayed execution | |
WO2019236745A1 (en) | Temporary account association with voice-enabled devices |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20190521 |