CN109791764A

CN109791764A - Communication based on speech

Info

Publication number: CN109791764A
Application number: CN201780060299.1A
Authority: CN
Inventors: 克里斯托·弗兰克·德瓦拉杰; 曼尼什·库马·达米亚; 托尼·罗伊·哈迪; 尼克·丘博塔留; 桑德拉·莱蒙
Original assignee: Amazon Technologies Inc
Current assignee: Amazon Technologies Inc
Priority date: 2016-09-01
Filing date: 2017-08-31
Publication date: 2019-05-21
Also published as: KR20190032557A; WO2018045154A1; EP3507796A1

Abstract

It describes for the system by interaction of the speech-control device upgrade based on speech, method and apparatus.The capture of speech-control equipment includes waking up the audio of word part and payload portions, for being sent to server with the relay message between speech-control equipment.In response to determining the generation of the repetition message between for example identical two equipment of upgrade event, the system can change the mode of speech-control equipment automatically, such as it no longer needs to wake up word, no longer it may be noted that desired recipient, or two speech-control equipment are connected automatically with voice chat mode.In response to the generation of the further upgrade event of determination, the system can star the real-time calls between the speech-control equipment.

Description

Communication based on speech

The cross reference of related application data

It is entitled this application claims being submitted in September 1st name with Christo Frank Devaraj et al. in 2016 The priority of the U.S. Patent application No.15/254,359 of " Voice-Based Communications ".

It is entitled this application claims being submitted in September 1st name with Christo Frank Devaraj et al. in 2016 The U.S. Patent application No.15/254,458's of " Indicator for Voice-Based Communications " is preferential Power.

It is entitled this application claims being submitted in September 1st name with Christo Frank Devaraj et al. in 2016 The U.S. Patent application No.15/254,600's of " Indicator for Voice-Based Communications " is preferential Power.

Above-mentioned application is incorporated herein by reference in their entirety.

Background technique

Speech discrimination system has evolved to the degree that the mankind can interact by speech and calculating equipment.It is such Various quality of the system based on the audio input received, the word said by human user is identified using multiple technologies.Speech Language recognizes unified with nature language understanding processing technique, realizes to the user's control based on speech for calculating equipment, to be based on user Verbal order execute task.The combination of speech discrimination and natural language understanding processing technique is referred to herein as " at speech Reason ".Speech processing can also relate to the speech of user being converted into text data, then text data can be supplied to various bases In the software application of text.

Speech processing can be used by computer, handheld device, telephone computer system, telephone booth and various other equipment To improve human-computer interaction.

Detailed description of the invention

In order to which the disclosure is more fully understood, referring now to being described below in conjunction with attached drawing.

Figure 1A shows the system for changing the interaction based on speech by speech-control equipment.

Figure 1B is shown for passing through system of the speech-control equipment to user's output signal during message transmission.

Fig. 2 is the concept map of speech processing system.

Fig. 3 is the concept map of the multiple domain framework method for natural language understanding.

Fig. 4 shows data stored and associated with user profile.

Fig. 5 A to Fig. 5 D is to show the signal flow graph for changing the interaction based on speech by speech-control equipment.

Fig. 6 A and Fig. 6 B are to show the signal flow graph for changing the interaction based on speech by speech-control equipment.

Fig. 7 is to show the signal flow graph for changing the interaction based on speech by speech-control equipment.

Fig. 8 A and Fig. 8 B are to show the signal flow graph exported by the signaling of the user interface of speech-control equipment.

Fig. 9 is to show the signal flow graph exported by the signaling of the user interface of speech-control equipment.

Figure 10 A to Figure 10 C shows the example signal exported by speech-control equipment to user.

Figure 11 A and Figure 11 B show the example signal exported by speech-control equipment to user.

Figure 12 shows the example signal exported by speech-control equipment to user.

Figure 13 is the block diagram for conceptually illustrating the exemplary components of the speech-control equipment according to the embodiment of the disclosure.

Figure 14 is the block diagram for conceptually illustrating the exemplary components of server of the embodiment according to the disclosure.

Figure 15 shows the example for the computer network being used together with the system of the disclosure.

Specific embodiment

Automatic speech discrimination (ASR) is computer science, artificial intelligence and philological be related to will be associated with speech Audio data is converted into representing the field of the text of the speech.Equally, natural language understanding (NLU) be computer science, it is artificial The intelligent and philological field for being related to enabling a computer to obtaining meaning from the text input comprising natural language.ASR and NLU is often used as a part of speech processing system together.

From the point of view of calculating angle, ASR and NLU can be sufficiently expensive.That is, ASR is handled in range at a reasonable time A large amount of computing resource may be needed with NLU.Therefore, when executing speech processing, distributed computing environment can be used.It is typical This distributed environment can be related to the local or other kinds of client device with one or more microphones, it is described Microphone is configured as capturing the sound from the user to speak and these sound is converted into audio signal.Then, audio is believed Number can be sent to remote equipment is further processed, such as converts audio signals into final order.Then, it depends on Order itself, order can be executed by the combination of remote equipment and user equipment.

In certain configurations, speech processing system can be configured as transmits spoken message between devices.That is, First equipment can send the language of message with capture command system to recipient associated with the second equipment.In response, The user of two equipment can say language, and language is captured by the second equipment, be then sent to system to be handled, incite somebody to action Message sends back the user of the first equipment.In this way, speech-control system can promote the spoken message between equipment to pass It passs.

However, one of this message transmission the disadvantage is that, for each spoken interaction with system, user may need Both word (with " wake-up " user equipment) and the recipient of message are waken up out, and such system can just be known how in routing language Including message.This conventional arrangement may rub to the interacting strip between user and system, especially as two users When exchanging multiple message between them.

Present disclose provides the technologies for changing the interaction based on speech by speech-control equipment.Speech-control equipment Capture includes waking up the audio of word part and payload portions, for being sent to server between speech-control equipment After message.The generation for changing triggering (such as repetition message between identical two equipment) is communicated in response to determining, system can be with The automatic mode for changing speech-control equipment, such as no longer need to wake up word, it is no longer necessary to point out desired recipient, or with Voice chat mode connects two speech-control equipment automatically.When the mode of speech-control equipment changes, system be can be used Different agreements come manage how system exchanged between equipment message and other data.For example, when system from equipment it Between exchange speech information be switched between devices start simultaneous call (for example, call) when, system can be stopped using Messaging protocol simultaneously activates or calls real-time protocol (RTP) (for example, voice over Internet protocol (VoIP)).In response to determine into The communication of one step changes the generation of triggering, and system can star the calling of the real-time synchronization between speech-control equipment.Illustrate below The communication that system carries out changes the various examples of triggering and processing.Communication changes triggering and can be by system as described herein Based on the threshold value of configuration satisfaction and determination.That is, system can be configured as do not receive it is from the user bright Really change communication exchange in the case where instruction.

The disclosure additionally provides the skill for exporting vision (or the audio, tactile etc.) instruction about the interaction based on speech Art.The user interface of the first equipment can be used to provide feedback in such instruction, and feedback points out the input part of the second equipment (for example, microphone) is in the process that user inputs (such as reply to the message sent from the equipment of the first user) that receives In.After message content is sent the speech-control equipment of recipient by server, server be can receive from recipient The equipment of speech-control equipment detecting the instruction of speech.In response, then server sets the first speech-control Standby output visually indicates, wherein visually indicating indicates that recipient's speech-control equipment is detecting speech.It is understood that, in this way, Visually indicating may be used to the user of speech-control equipment and " will not rob and say " each other (that is, preventing the use of speech-control equipment Say message simultaneously in family).

Figure 1A shows the system 100 for being configured as changing the interaction based on speech between speech-control equipment.Although Figure 1A and the following figure/discussion show the operation of system 100 with particular order, and still, described step can be with different suitable Sequence executes (and removing or add certain steps) without departing from the intention of the disclosure.As shown in Figure 1A, system 100 may include One or more speech-control the equipment 110as and 110b local in the first user 5 and second user 7 respectively.System 100 is also wrapped It includes one or more networks 199 and is connected to one or more servers 120 of equipment 110a and 110b by network 199.Clothes Business device 120 (can be one or more different physical equipments) can be able to carry out traditional speech processing as described herein (ASR, NLU, inquiry parsing etc.).Individual server may be able to carry out all speech processing or multiple servers can Speech processing is executed to combine.In addition, server 120, which can be configured as, executes certain orders, such as answers and used by first The inquiry that family 5 and/or second user 7 are said.In addition, the detection of certain speeches or order execute function can by equipment 110a and 110b is executed.

As shown in Figure 1A, user 5 can say language (being indicated by input audio 11).Input audio 11 can be by equipment One or more microphone 103a of 110a and/or the microphone array (not shown) separated with equipment 110a capture.Microphone Array may be coupled to equipment 110a, so that microphone array will correspond to defeated when microphone array receives input audio 11 The audio data for entering audio 11 is sent to equipment 110a.Optionally, microphone array may be coupled to mobile computing device and (not show Out, for example, smart phone, tablet computer etc.) adjoint application.In this example, when microphone array captures input audio 11 When, microphone array sends the audio data for corresponding to input audio 11 to application, turns audio data with application It is dealt into equipment 110a.If equipment 110a captures input audio 11, input audio 11 can be converted to audio by equipment 110a Data, and server 120 is sent by audio data.Optionally, if equipment 110a connects from microphone array or with application The audio data corresponding to input audio 11 is received, then the audio data received simply can be forwarded to clothes by equipment 110a Business device 120.

120 initial response of server includes the audio data for waking up word part and payload portions in reception (150), Message is transmitted between speech-control equipment.Payload portions may include recipient information and message content.Such message Transmitting can be carried out by using message field as described in detail herein and associated agreement.Server 120 so transmits Message, until server 120 determines that (152) first communications change the generation of triggering.Illustrative communication changes triggering and includes whether Meet or more than between the first speech-control equipment 110a and the second speech-control equipment 110b message exchange number of thresholds, The number of thresholds of the message exchange occurred in threshold amount of time or the user of two speech-control equipment 110a/110b exist simultaneously The within the threshold range of its relevant device.After determining the generation that the first communication changes triggering, server 120 is in response to receiving Audio data including payload data (for example, message content data) transmits between identical speech-control equipment (154) message.The transmitting of message can be carried out by using message transmission domain as described in detail herein and associated agreement. Server 120 transmits message using message transmission domain, until server 120 determines that (156) second communications change the hair of triggering It is raw.After determining the generation that the second communication changes triggering, then server 120 starts between (158) speech-control equipment Real-time calls.Starting real-time calls can be related to using real-time calls domain as described in detail herein and associated real-time association View.Real time communication session/call may relate to transmit sound between devices (in operating parameter) when receiving audio data Frequency evidence.

Optionally, after determination (152) first communicates and changes triggering, it is real-time that server 120 can directly initiate (158) Calling.This can occur under different configuring conditions, such as when communication changes triggering premised on some recipient.Example Such as, may indicate that will be by real-time with the communication of " mother " for user profile associated with speech-control equipment 110a is originated Calling occurs.Therefore, if origination message is issued to " mother ", server 120 can be in response to determining first message Recipient be " mother " and promote real-time calls.

According to various embodiments, server 120 can make one or two speech-control equipment use corresponding equipment User interface visually indicates to export, wherein which domain visually indicates expression is used to exchange communications/messages.For example, working as When needing to wake up word, the lamp in speech-control equipment can issue blue light, and green light can be issued when no longer needing to wake up word, and Yellow light can be issued when promoting real-time calls.

It, can also be in video communication other than as described above changing into the exchange based on speech based on the calling of speech Background in use above-mentioned introduction.For example, techniques described herein can be used if two people are exchanging video messaging In video call is changed into the exchange of video messaging.In another example, if determined in message of the exchange based on speech Some are in the visual field of video camera, then system, which can be configured as, is in the visual field of video camera based on some, will Video call is changed into communication.Therefore, the introduction below in relation to detection speech, capture audio etc. also can be applied to detection view Frequently, video etc. is captured.

Each speech-control equipment may have more than a user.Speaking based on speech can be used in system 100 Person ID or User ID identify the speaker of the audio of capture.Each speaker ID or User ID can be voice signatures, Enable the system to determine the user of equipment to speak.This is beneficial, because being related to equipment when communication changes triggering When single user, it allows system to change communication, as described herein.Speaker ID or User ID can be used for determining that who is saying Words, and the user profile of automatic identification speaker is to be used for subsequent processing.For example, disappearing if the first user of equipment says Breath, hereafter the second user of equipment says message, then system can distinguish the two users based on voice signature, thus prevent be System determines that single communication changes triggering based on the message that different user is said.

Figure 1B is shown for passing through device user interface output signal during message transmission to point out setting for recipient The standby system for detecting response speech.As shown in Figure 1B, system receives (160) input sound from the first speech-control equipment 110a Frequently.Then, system determines that (162) input audio corresponds to the message content for the second speech-control equipment 110b.Then, it is Message content is sent (164) to the second speech-control equipment 110b by system.Then, system uses the second speech-control equipment 110b (166) speech is detected, and makes (168) first speech-control equipment 110a output indicators, wherein indicator indicates that second sets Standby to detect speech, wherein speech can be in response to message content, and therefore notifies the first speech-control equipment 110a's User can will reply.Indicator can be vision, audible or the sense of hearing.In this example, indicator is for supporting video Equipment can be it is visual.

After discussing the entire speech processing system of Fig. 2, be discussed below interaction of the upgrading based on speech into one Walk details.Fig. 2 is the concept map for traditionally how handling the language said, and allows system acquisition and executes the life that user says It enables, such as the verbal order for waking up word may be followed.Shown in various parts can be located at identical or different physical equipment on. Communication between various parts shown in Fig. 2 can occur directly or by network 199.Audio capturing component, such as equipment 110 microphone 103, capture correspond to the audio 11 for the language said.Then, equipment 110 uses wake-up word detection module 220, audio is handled, or corresponding to the audio data of audio, to determine whether to detect that keyword (such as wakes up in audio Word).After detecting wake-up word, the audio data 111 for corresponding to language is sent the clothes including ASR module 250 by equipment Business device 120.Can audio data 111 be exported from the acoustics front end (AFE) 256 being located in equipment 110 before being transmitted.Or sound Frequency can be different form according to 111, so as to by (such as the AFE being located together with ASR module 250 of long-range AFE 256 256) it is handled.

The other component (such as microphone (not shown)) for waking up word detection module 220 and equipment 110 is worked together to examine Keyword in acoustic frequency 11.For example, audio 11 can be converted to audio data by equipment 110, and mould is detected using word is waken up Block 220 handles audio data to determine whether to detect speech, and if detecting speech, it is determined that the audio number including speech According to whether with correspond to special key words audio signature and/or Model Matching.

Various technologies can be used to determine whether audio data includes speech in equipment 110.Some embodiments can answer With voice activity detection (VAD) technology.This technology can be based on the various quantitative aspects of audio input, such as audio input Spectrum slope between one frame or multiframe；The energy level of one or more bands of a spectrum sound intermediate frequency inputs；One or more bands of a spectrum sound intermediate frequencies The signal-to-noise ratio of input；Or other quantitative aspects, to determine in audio input with the presence or absence of speech.In other embodiments, if Standby 110 may be implemented limited classifier, be configured as distinguishing speech and ambient noise.Classifier can pass through such as line The technology of property classifier, support vector machines and decision tree etc is realized.In other embodiments, hidden Ma Erke can be applied Audio input and speech are stored one or more sound in equipment by husband's model (HMM) or gauss hybrid models (GMM) technology It learns model to be compared, acoustic model may include corresponding to speech, noise (such as ambient noise or ambient noise) or mute Model.It can also be determined using other technologies in audio input with the presence or absence of speech.

Once detecting speech (or separating with speech detection) in the audio received by equipment 110, equipment 110 can It wakes up word detection module 220 to use and executes and wake up word detection, to determine that user intends when to equipment 110 to say order. The process is referred to as keyword search, wherein waking up the particular example that word is keyword.Specifically, usually language is not being executed Keyword search is executed in the case where speech analysis, text analyzing or semantic analysis.On the contrary, analysis input audio (or audio data) To determine whether the specific feature of audio matches preconfigured acoustic waveform, audio signature or other data, to determine input Whether audio " matches " stored audio data corresponding with keyword.

Therefore, waking up word detection module 220 can be compared audio data with stored model or data, with inspection It surveys and wakes up word.For wake up a kind of method of word detection using the general big continuous speech discrimination of vocabulary (LVCSR) system Audio signal is decoded, carries out in obtained grid or confusion network waking up word search.LVCSR decoding may need opposite Higher computing resource.For carrying out waking up the wake-up word word and non-wake-up that another method that word positions is respectively each key Word Vocal signal constructs hidden Markov model (HMM).Non- wake-up word speech includes other spoken words, ambient noise etc..It can There are one or more HMM of building, for being modeled to non-wake-up word speech characteristics, it is referred to as and fills model.Dimension Special ratio decoder is used to search for the optimal path in decoding figure, and is further processed decoded output and is sentenced existing for keyword with making It is disconnected.By combining mixing DNN-HMM to decode frame, this method be can extend to including discriminant information.In another embodiment In, waking up word positioning system can directly establish in deep neural network (DNN)/recurrent neural network (RNN) structure, without It is related to HMM.Such system can be estimated by the stacking frame in the backdrop window of DNN or using RNN using background information Wake up the posteriority of word.Using subsequent posteriority adjusting thresholds or smoothly, to judge.Also it can be used for waking up word detection Other technologies, such as technology as known in the art.

Once detecting wake-up word, user equipment 110 " can wake up " and start that the sound of input audio 11 will be corresponded to Frequency is sent to server 120 according to 111 to carry out speech processing.Audio data corresponding to the audio can be sent to service Device 120 is to be routed to recipient's equipment, or can be sent to server to carry out speech processing, included to explain Speech (purpose for realizing Speech Communication and/or for executing the order in speech).Audio data 111 may include corresponding to In the data for waking up word, or the portion for corresponding to wake-up word of audio data can be removed by user equipment 110 before transmitting Point.In addition, as described herein, user equipment 110 " can wake up " when detecting speech/spoken audio higher than threshold value.By When server 120 receives, audio data 111 can be converted to text by ASR module 250.ASR transcribes audio data written Notebook data, text data expression include the word of the speech in audio data.Then, text data can be used by other component In various purposes, such as execute system command, input data etc..Utterance in audio data, which is input into, to be configured as holding The processor of row ASR, then, processor is based on language and the language pre-established being stored in ASR model storage equipment 252c It says the similitude between model 254, language is explained.For example, ASR process can be by input audio data and sound (example Such as, sub- word cell or phoneme) and the model of sound sequence be compared, with the sound for identifying with saying in the language of audio data The word of sound sequences match.

Probability or confidence level can be specified to that can explain each of different modes (that is, different hypothesis) of utterance Score indicates a possibility that one group of specific word matched those of says word in language.Confidence can be with base In many factors, including the sound in such as language with language voice model (for example, being stored in ASR model storage equipment 252 Acoustic model 253) similitude, and the possibility of specific position in sentence will be included in the certain words of Sound Match Property (for example, using language or syntactic model).Therefore, to the potential text interpretation of each of utterance (assuming that) and confidence It is associated to spend score.Based on the factor and specified confidence considered, ASR process 250 is exported in audio data The most probable text recognized.ASR process can also export the multiple of grid or N-best form it is assumed that wherein each Assuming that both corresponding to confidence or other scores (probability score etc.).

The one or more equipment for executing ASR processing may include acoustics front end (AFE) 256 and speech discrimination engine 258. Audio data from microphone is converted to the data for being used for being handled by speech discrimination engine by acoustics front end (AFE) 256.Speech Recognition engine 258 is by speech discrimination data and acoustic model 253, language model 254 and is used to recognize and passes in audio data Other data models and information of the speech reached are compared.AFE can reduce the noise in audio data, and will be digitized Audio data be divided into indicate time interval frame, AFE determine indicate audio data quality multiple values (referred to as feature) and Indicate one class value of feature/quality (referred to as feature vector) of audio data in frame.As it is known in the art, can determine perhaps Mostly different features, and each feature indicates some audio qualities that can be used for ASR processing.AFE can be used a variety of Method handles audio data, such as mel-frequency cepstrum coefficient (MFCC), perception linear prediction (PLP) technology, neural network Feature vector technology, linear discriminant analysis, half associated covariance matrix or other methods well known by persons skilled in the art.

Speech discrimination engine 258 can come with reference to the information being stored in speech/model storage equipment (252) to handle From the output of AFE 256.It is alternatively possible to by execute ASR processing equipment from another source receiving front-end except internal AFE Data that treated (such as feature vector).For example, audio data can be processed into feature vector (for example, using by equipment 110 AFE 256 in equipment), and server is transmitted this information to by network 199, for carrying out ASR processing.Feature vector can To reach server encodedly, in this case, carry out handling it in the processor by execution speech discrimination engine 258 Before, described eigenvector can be decoded.

Speech discrimination engine 258 attempts received feature vector and the acoustic model of storage 253 and language model Known language phoneme and word are matched in 254.Speech discrimination engine 258 is based on acoustic information and language message, to count Calculate the identification score of feature vector.For calculating acoustics score, which is indicated acoustic information by one group of feature vector A possibility that set sound is matched with language phoneme.Language message is used for by considering what sound and/or word for each other Background in adjust acoustics score, thus a possibility that improving the speech result that ASR process grammatically makes sense output. The particular model used can be universal model, or can be the model corresponding to special domain (such as music, banking etc.).

Speech discrimination engine 258 can be used multiple technologies and come matching characteristic vector and phoneme, such as use hidden Ma Erke Husband's model (HMM) determines a possibility that feature vector may match phoneme.The sound received can be expressed as the state of HMM Between path, and multiple paths may indicate multiple possible text matches of same sound.

After carrying out ASR processing, other processing components can be sent by ASR result by speech discrimination engine 258, this A little processing components can be the equipment local for executing ASR, and/or be distributed on network 199.For example, with the single text of speech This expression, N-best, grid etc. including multiple hypothesis and corresponding scores form existing for ASR as a result, can be sent It such as converts text to order for carrying out natural language understanding (NLU) processing to the server of such as server 120 etc It enables (such as to run the server of the specific application of such as search engine etc by equipment 110, server 120 or another equipment Deng) execute.

The equipment (for example, server 120) for executing NLU processing 260 may include various parts, including potential dedicated place Manage device, memory, storage equipment etc..The equipment for being configurable for carrying out NLU processing may include name entity identification (NER) Module 252 and intent classifier (IC) module 264, result ranking and distribution module 266 and NLU store equipment 273.NLU process The dictionary of place names information (284a-284n) being stored in entity storage apparatus 282 can also be used.Dictionary of place names information can be used In entity resolution, for example, ASR result is matched with different entities (song title, name of contact person etc.).Place name diction Allusion quotation can be linked to user (for example, the specific dictionary of place names can be associated with the music collection of specific user), can be linked to Certain domains (such as shopping), or can in a manner of various other tissue.

NLU process obtain text input (such as, being handled by ASR 250 based on language 11), and attempt to text into Row semantic interpretation.That is, NLU process determines the meaning of text behind based on individual word, and then realize the meaning.At NLU Reason 260 explains text string, and the permission equipment in intention or desired movement and text to derive user is (for example, equipment 110) the relevant a plurality of information of the movement is completed.For example, if handling utterance using ASR 250 and exporting text " making a phone call to mother ", then NLU process can determine that user intends to activate the phone in his/her equipment, and start to Contact person with entity " mother " makes a phone call.

NLU can handle several text inputs related with same language.For example, if ASR250 exports N number of text chunk (a part as N-best), then NLU can handle all N number of outputs to obtain NLU result.

NLU process can be configured as a part that parsing, label and annotation text are handled as NLU.For example, for text This " making a phone call to mother " " can will make a phone call " labeled as order (execution is made a phone call), and can be by " mother " labeled as life The special entity and target of order (and can will correspond to the telephone number packet of the entity of storage " mother " in the contact list It includes in annotation result).

For the NLU processing to speech input is appropriately carried out, NLU process 260 can be configured to determine " domain " of language, with Just it determines and constriction may be related by which service that endpoint device (for example, server 120 or equipment 110) provides.For example, end Point device can provide and with telephone service, contacts list service, calendar/schedule service, music player service etc. The related service of interaction.Word in single text query may relate to more than one service, and some services may Functionally it is linked (for example, telephone service and calendar service may be by the data from contacts list).

Name entities identification module 262 receives inquiry in the form of ASR result, and attempts to identify and can be used for explaining meaning Dependent parser and lexical information.For this purpose, name entities identification module 262 can by identification may be related to the inquiry received Potential domain start.It includes the device databases (274a- for identifying domain associated with particular device that NLU, which stores equipment 273, 274n).For example, equipment 110 can be communicated with for music, phone, calendar, contacts list and device-specific (without including Video) domain it is associated.In addition, entity library can also include the data base entries about the special services on particular device, institute Data base entries are stated by device id, User ID or home id or other a certain indicators to index.

Domain can indicate the discrete one group activity with common theme, " shopping ", " music ", " calendar " etc..Cause This, each domain can be with specific language model and/or grammar database (276a-276n), one group of specifically intention/movement (278a-278n) and specific personalization lexicon (286) are associated.Each dictionary of place names (284a-284n) may each comprise Domain Index lexical information associated with specific user and/or equipment.For example, dictionary of place names A (284a) includes Domain Index vocabulary Information 286aa to 286an.For example, the music domain lexical information of user may include album title, artist name and song title Claim, and the contacts list lexical information of user may include the name of contact person.Due to the music collection and connection of each user It is that list is probably different, therefore, the information of this personalization improves the resolution ratio of entity.

Inquiry is handled using rule, model and the information for being suitable for each identification domain.For example, if inquiry potentially relates to And both communication and music, then the syntactic model for being used for communicating and lexical information will be used to carry out NLU processing to inquiry, and will Using for music syntactic model and lexical information handle the inquiry.To the response generated based on inquiry by every group model into Row scoring (further described below), the highest result of an overall ranking of from all application domains are typically selected to be correct knot Fruit.

The parsing of intent classifier (IC) module 264 inquiry is with one or more being intended in each identification domain of determination, wherein being intended to Corresponding to the movement to be executed made a response to inquiry.Database (278a- of each domain with the word for being linked to intention 278n) it is associated.For example, music intent data library can will such as " peace and quiet ", " volume closing " and " mute " etc word It is intended to hyperlink phrase to " mute ".IC module 264 passes through the word in inquiring and word in intent data library 278 and short Language is compared to identify the potential intention in each identification domain.

In order to generate the response specifically explained, NER 262 is applied and the associated syntactic model in corresponding domain and vocabulary Information.Each syntactic model 276 is included in the title (that is, noun) for the entity about special domain that would generally be found in speech (that is, generic term), and be the personalization for user and/or equipment from the lexical information of the dictionary of place names 284 286.Example Such as, syntactic model associated with shopping domain may include when people discuss the database of word usually used when doing shopping.

Domain specific syntax frame with " time slot " or " field " to be filled is linked to by the intention that IC module 264 identifies Frame (is included in 276).For example, if " play music " is the intention of identification, one or more grammer (276) frames can be with Corresponding to such as " playing { artist name } ", " playing { album name } ", " playing { song title } ", " play { artist's surname Name } { song title } " etc. sentence structure.However, more flexible in order to make to recognize, these frames will not be usually constructed At sentence, but based on time slot is associated with grammer label.

For example, NER module 260 can parse inquiry before entity is named in identification, it, will to be based on syntax rule and model Word identification is subject, object, verb, preposition etc..The verb identified can identify intention by IC module 264, then, NER module 262 carrys out identification framework using the intention.The frame being intended to for " broadcasting ", which can specify, to be suitable for playing being identified Time slot/field list of " object " and any object modifier (for example, prepositional phrase), such as { artist name }, { album Title }, { song title } etc..Then, the specific corresponding field in personalization lexicon of 260 region of search of NER module, attempting will mark It is denoted as the word and expression progress identified in the word and expression and database in the inquiry of grammatical object or object modifier Match.

This process includes semantic marker, it is to mark word according to word or type/semantic meaning of combinations of words Or combinations of words.Heuristic syntax rule can be used to execute in parsing, or such as hidden Markov model, most can be used The technology of big entropy model, log-linear model, condition random field (CRF) etc. constructs NER model.

For example, inquiry " playing the small helper (mother ' s little helper) of the mother of Rolling Stone " may be solved It analyses and is labeled as { verb }: " broadcasting ", { object }: " the small helper of mother ", { object preposition }: " by " and { object modification Language }: " Rolling Stone ".In the process this moment, it is based on word database associated with music domain, " broadcasting " is identified as Verb, IC module 264 will determine that the verb corresponds to " playing music " and is intended to.About " the small helper of mother " and " Rolling Stone " Meaning, there are no being determined, but according to syntax rule and model, determine that these phrases are related with the grammatical object of inquiry.

Which then, come using the frame for being linked to intention it is determined that searching for Database field to determine these phrases Similarity in meaning, such as the dictionary of place names of search user with frame time slot.Therefore, " playing music to be intended to " frame may Show to attempt to parse based on { artist name }, { album name } and { song title } object of identification, and identical intention Another frame can be shown that trial parses object modifier based on { artist name }, and be identified based on being linked to { album name } and { song title } of { artist name } parses object.If the search to the dictionary of place names does not use ground Name dictionary information parses time slot/field, then the database that NER module 262 may search for general term associated with domain (depositing It stores up in equipment 273).Thus, for example, failing if inquiry is " song for playing Rolling Stone " through " Rolling Stone " After the album name or song title that determine entitled " song ", NER 262 can in the vocabulary of domain searching words " song ". In alternative solution, it can check that general term, or both can be attempted before dictionary of place names information, two may be generated A different result.

The comparison procedure used by NER module 262 can classify (that is, scoring) data base entries and labeled cargo tracer The matching degree of word or phrase, the degree of correspondence of the syntactic structure of inquiry and applied syntactic frame, and it is based on database The relationship whether pointing out entry and being identified as between the information of other time slots of fill frame.

NER module 262 also can be used background operation rule and carry out filler timeslot.For example, if user previously had requested that Suspend particular songs and hereafter request voice operated device " please restore to play my music ", then NER module 262 can be applied Song (the song played when user requests pause music for currently wishing to play with user is filled based on the rule of deduction It is bent) the associated time slot of title.

The result of NLU processing can be labeled to assign and inquire by meaning.Thus, for example, " playing the mother of Rolling Stone Small helper " following result may be generated: { domain } music, be intended to } and play music, { artist name } " Rolling Stone ", { medium type } song and { song title } " the small helper of mother ".As another example, " song of Rolling Stone is played It is bent " it may generate: { domain } music, { intention } play music, { artist name } " Rolling Stone ", { medium type } song.

It is then possible to send command processor for the output (it may include text, the order etc. of label) of NLU processing 290, a part which can be used as system 100 is located on identical or individual server 120.It can be based on NLU exports to determine destination command processor 290.For example, if NLU output includes playing the order of music, destination Command processor 290 can be configured as execute music order music application, such as in equipment 110 or Application in music player devices.If NLU output include searching request, destination command processor 290 may include by It is configured to execute the search engine processing device of search command, the search engine such as on search server handles device.

The NLU operation of system as described herein can use the form of multiple domain framework, such as that is more shown in Fig. 3 Domain framework.In multiple domain framework, each domain (its may include define one group of intentions of more major concept such as music, books and Entity time slot) individually constructed, and the operation that NLU is operated is being executed to text (such as the text exported from ASR component 250) When operation during for NLU component 260 use.Each domain can have the component of special configuration to execute each of NLU operation A step.For example, message field 302 (domain A) can have NER component 262-A, identify which time slot (that is, the portion of input text Point) it can correspond to special entity relevant to the domain.NER component 262-A can be used machine learning model, such as domain is specific Condition random field (CRF) corresponds to the entity type of textual portions to identify corresponding to the part of entity and identification.For example, right In text " telling john smith, I says hello to him ", by that can be recognized for the NER 262-A of the training of message field 302 The part [john smith] of text corresponds to entity.Message field 302 can also have portion intent classifier (IC) of their own Part 264-A determines the intention of text, it is assumed that text is in the domain by defined.It is specific that such as domain can be used in IC component The model of maximum entropy classifiers etc identifies the intention of text.Message field 302 can also have the time slot filling part of their own Part 310-A, can using rule or other instruction with by from previous stage label or token be standardized as intention/time slot It indicates.Accurate conversion is likely to be dependent on domain and (for example, for domain of travelling, refers to that the text reference on " Boston airport " can turn It is changed to the standard BOS three-letter codes for indicating airport).Message field 302 can also have the entity resolution component 312- of their own A, can be used to specifically identify and identify in being passed to text with reference to authoritative source (such as domain specific knowledge library), the authority source The accurate physical quoted in entity reference.Specific intended/time slot combination can also be tied to particular source, and the spy then can be used Source is determined to parse text (such as order by providing information or executing in response to user query).From entity resolution component The output of 312-A may include order, information or other NLU result datas, it is indicated that the specific NLU in domain processing how to handle text with And system should how response text, according to the special domain.

As shown in figure 3, multiple domains can use different domain particular elements, operate substantially in parallel.In addition, each domain Certain agreements can be realized when exchanging message or other communications.That is, domain B 304 can have for real-time calls There are NER component 262-B, IC module 264-B, time slot filling component 310-B and the entity resolution component 312-B of their own.This is System may include additional domain not described herein.The same text being input in the NLU assembly line of domain A 302 can also be input to In the NLU assembly line of domain B 304, wherein the component of domain B 304 will operate text, just look like that text is related to domain B, right In the different NLU assembly lines of not same area, and so on.Each specific NLU assembly line in domain is specific by the domain for creating their own NLU as a result, such as NLU result A (for domain A), NLU result B (for domain B), NLU result C (for domain C), and so on.

This multiple domain framework leads to narrowly-defined intention and time slot, these are intended to and time slot is for each special domain Specifically.This is partly due to different model and component (such as the specific NER component in domain, IC module etc. and relevant mode Type) it is trained to be operated only for specified domain.In addition, the separation in domain will lead to similar movement between domain by table respectively Show, even if there is overlapping in movement.For example, " next song ", " next book " and " next " can be same action Indicator, but since domain particular procedure limits, will be defined in different ways in not same area.

Server 120 can also include the data about user account, and the user profile storage as shown in Fig. 4 is set Standby 402 show.User profile, which stores equipment, to be located near server 120, or can otherwise, such as lead to Network 199 is crossed, is communicated with various parts.It may include being handed over with system 100 that user profile, which stores equipment 402, The relevant various information such as mutual each user, account.In order to illustrate as shown in figure 4, user profile stores equipment 402 It may include the data about equipment associated with specific each user account 404.In this example, user profile stores Equipment 402 is storage equipment based on cloud.Such data may include the device identifier (ID) and internet of distinct device The position of the title and equipment of agreement (IP) address information and user.User profile storage equipment can also comprise spy Communication due to each equipment changes triggering, instruction preference of each equipment etc..In this example, each equipment instruction to be exported Type can be not stored in user profile.On the contrary, the type of instruction can depend on background.For example, if system Video messaging is being exchanged, then instruction can be visual.For another example instruction can if system is exchanging audio message To be audible.

Each user profile can store one or more communications and change path.In addition, each communication changes road Diameter can include that single communication change triggering or multiple communications change triggering, these communication change triggerings indicate when send out Raw communication changes.Match it should be appreciated that changing path with N number of communication that M communication changes triggering and can store in single user It sets in file.It can be unique for the different individuals that user is in communication with that each communication, which changes path,.For example, working as User can be used a communication and change path when communicating with its mother, can make when user communicates with its spouse It is communicated with another and changes path etc..Each communication changes path for a kind of communication type (for example, audio message transmitting, view Frequency message transmission etc.) it is also possible to uniquely.Each communication changes path and device type involved in communication is also possible to Uniquely.For example, the first communication that user can have the device configuration for user's car changes path, is in user bedroom Second communication of device configuration changes path etc..

The some or all of communications of user profile, which change path, can be dynamically.That is, communication changes road Diameter can depend on external signal.Illustrative external signal includes the degree of approach with equipment.For example, ought with the mother of user Communicated, and when the mother of user is not near her equipment, a communication can be used and change path, and ought with The mother at family communicates, and when the mother of user is near her equipment, the second communication can be used and change path.For example, Speech-control equipment 110 can capture one or more images, and send server 120 for corresponding image data. Server 120 can determine that image data includes the expression of people.Server 120 is also based on the expression of people in image data Position determine the degree of approach of people Yu equipment 110.The dynamic select for changing path to communication may also will receive machine learning Influence.It can be configured as and lead to after user's specific time at night with its mother for example, communication changes path Real-time calls are changed into communication when letter.Then, system can determine that user changes the time of communication in threshold amount of time A certain percentage.Based on the determination, system can suggest that user's modification/more new traffic changes path will rapidly not disappear so Real-time calls are changed into breath transmitting.

Each communication upgrading path can include that one or more communications change.A type of communication change is related to disappearing Except to wake up word part needs, therefore spoken audio only need include order (for example, make system transmission message language) and Message content.The communication of second of type changes the needs for being related to eliminating to word part and order is waken up, therefore spoken audio is only It needs to include message content.The communication of third seed type, which changes, is related to the wake-up word of replacement default, and make the recipient of message Name (for example, mother, John etc.) is as wake-up word.The communication change of 4th seed type is to change into message exchange to exhale in real time It cries.

Fig. 5 A to Fig. 5 D, which is shown, changes the interaction based on speech by speech-control equipment.First speech-control equipment 110a capture includes the spoken audio (as indicated at 502) for waking up word part and payload portions.For example, speech-control equipment 110a may be at sleep pattern, and until detecting oral wake-up word, triggering speech-control equipment 110a wakes up and capture sound Frequently (it may include the oral wake-up word and speech thereafter) is to be handled and be sent to server 120.Speech-control The audio data for corresponding to captured spoken audio is sent server 120 (as shown by 504) by equipment 110a.

Server 120 executes ASR to the audio data received, to determine text (as illustrated at 506).Server 120 can To determine wake-up word part and the payload portions of text, and NLU (as indicated at 508) is executed to payload portions.It executes NLU processing may include that server 120 marks recipient information's (as illustrated at 510) of payload portions, label payload Partial message content information (as indicated at 512) and it is intended to label with " sending message " to mark entire payload portions (as indicated at 514).For example, the payload portions of received audio data, which can correspond to text, " tells John's history It is close this, I says hello to him ".According to the example, " john smith " can be labeled as recipient information by server 120, can It is labeled as message content information " will say hello ", and label can be intended to " sending message " to mark language.It can be used Message field 302 with message is intended to label and marks payload portions and/or system can be made for example to be passed using message to execute Command processor 290 is passed to execute further message transmitting order.

Using the recipient information of label, server 120 determines equipment associated with recipient (for example, speech-control Equipment 110b) (as shown in 516 in Fig. 5 B).In order to determine recipient's equipment, server 120 can be used to be set with speech-control Standby 110a and/or the associated user profile of user for saying initial audio.For example, the accessible user of server 120 The table of configuration file is with the text for corresponding to labeled recipient information's (that is, " john smith ") in matching list.One Denier identifies matched text, and server 120 can identify recipient's equipment associated with the matched text in table.

Server 120 also use with " send message " be intended to the associated server 120 of label domain and associated association View, to generate output audio data (as indicated at 518).Exporting audio data may include receiving from speech-control equipment 110a Spoken audio.Optionally, output audio data may include by computer based on receiving from speech-control equipment 110a The text of the text generation of message content is to speech (TTS) audio data.Server 120 is sent to reception for audio data is exported Audio data is output to recipient (as shown in 522) by person's equipment (as depicted 520).In this example, the speech control of recipient Control equipment 110b can not export audio data, until it detects that until the order done so from recipient.It is such Order can be recipient correspond to " what my message is? ", " I has any message? " Deng utterance.

Server 120 executes message biography between the first speech-control equipment 110a and the second speech-control equipment 110b It passs, (for example, passing through message field) (as shown at 524) as described in detail by the step 502-522 above with reference to Fig. 5 A and Fig. 5 B, Until server 120 determines that communication changes the generation (as indicated at 526) of triggering.Communication, which changes triggering, can be such that server 120 makes Subsequent communications/process is executed with another domain different from for executing earlier communication/process domain and corresponding agreement.It is optional Ground, the adjustable processing to Future message of system (such as wake up the finger of word or recipient not need certain spoken datas Show).Identified communication, which changes triggering, to be in any number of forms.Communication, which changes triggering, may be based on whether to meet or more than the The number of thresholds of message exchange between one speech-control equipment 110a and the second speech-control equipment 110b.For example, message is handed over The number of thresholds changed can be configured by the user of any of speech-control equipment 110a/110b, and can be in phase It is indicated in the user profile answered.It should be appreciated that associated with the user profile of the first speech-control equipment 110a The number of thresholds of message exchange can be different from message associated with the user profile of the second speech-control equipment 110b The number of thresholds of exchange.In this case, server 120, which is used to determine when occur the threshold value that communication changes, can be The threshold value (that is, threshold value with small number of required message exchange) for meeting first or being more than.Communication changes triggering can be with Or it is optionally based on the number of thresholds of the message exchange occurred in threshold amount of time.For example, the number of thresholds of message exchange and/ Or threshold amount of time can be configured by the user of any of speech-control equipment 110a/110b, and can be in phase It is indicated in the user profile answered.It should be appreciated that associated with the user profile of the first speech-control equipment 110a The number of thresholds and threshold amount of time of message exchange can be different from the user profile with the second speech-control equipment 110b The number of thresholds of associated message exchange.In this case, server 120 is used to determine when occur communication change Threshold value can be the threshold value for meeting first or being more than.Communication change triggering can with or be optionally based on two speech-controls and set The user of standby 110a/110b is simultaneously in the within the threshold range of its relevant device.It should be appreciated that touching can be changed based on single communication The satisfaction of hair changes communication occurs.It is also understood that the satisfaction of triggering can be changed based on more than one communication to lead to Letter changes.

Once it is determined that one or more communications change triggering, implementation is depended on, server 120 reconfigures comes from The language of first/second speech-control equipment wakes up word part or reception not require to exist in the audio data received Person's information (as shown in 528).For example, message field 302 and associated agreement can be used to complete in this.In addition, in step 528 What is occurred reconfigures the communication that can indicate that speech-control equipment 110b output is received, without being detected first corresponding to this The speech for the order that sample is done.In addition, server 120 can be sent to one or two of speech-control equipment 110a/110b Signal, it is indicated that the communication between the first speech-control equipment 110a and the second speech-control equipment 110b is just changed (such as 530 institutes Show).Speech-control equipment, which can export, indicates that equipment " is being listened " to attempt the instruction of capture message content.In addition, speech control Control equipment can also export the instruction that the equipment for indicating recipient is capturing spoken message content.Then, speech-control equipment 110a and/or speech-control equipment 110b, which can be exported, indicates the signal for no longer needing to wake up word audio (such as 532 institutes in Fig. 5 C Show).Static instruction or movement instruction can be by the signal that one or two of speech-control equipment 110a/110b is exported, As described below.

Hereafter, speech-control equipment 110a captures the spoken audio (such as 534 from the user for comprising only effect load information It is shown), and server 120 (as depicted at 536) is sent by the audio data for corresponding to payload information.Server 120 is right The audio data received executes ASR, to determine text (as shown in 538), and executes NLU processing to payload information text (as shown in 540).Executing NLU processing may include that server 120 marks the recipient information of payload information text, label The message content information of payload information text and it is intended to label with instant message to mark entire payload information text This.For example, the payload information of the audio data received can state " you when finished item? ".According to originally showing " you when finished item " can be labeled as message content information by example, server 120, and can be used that " transmission is i.e. When message " be intended to label to mark language.It is intended to label using message to mark payload information text that can make server 120 execute downstream process using message field 302.By not requiring in input audio there are recipient information, server 120 can be with Assuming that recipient's equipment is identical as recipient's equipment used in earlier communication, determines and receive again without server 120 Person's equipment.

Server 120, which is also used, is intended to mark the domain of associated server 120 and associated with " send instant message " Agreement, come generate output audio data (as shown in 542).For example, message field 302 can be related to instant message intention label Connection.Exporting audio data may include the spoken audio received from speech-control equipment 110a.Optionally, audio data is exported It may include the text that is generated by computer based on the spoken audio that is received from speech-control equipment 110a to speech (TTS) sound Frequency evidence.Server 120 is sent to recipient's equipment (that is, speech-control equipment 110b) (such as 544 institutes for audio data is exported Show), by the audio output of audio data to recipient (as shown in 546 in Fig. 5 D).As described above, occurring in step 528 Reconfigure the communication that can indicate that speech-control equipment 110b output is received, it is from the user without receiving first The order done so.It is understood that, in this way, audio data can be output to reception in step 546 by speech-control equipment 110b Person, without receiving the order done so first.That is, speech-control equipment 110b can automatic playing audio-fequency data.

Server 120 executes instant message between the first speech-control equipment 110a and the second speech-control equipment 110b Transmitting, (for example, by instant message domain and not as described in detail by the step 534-546 above with reference to Fig. 5 C to Fig. 5 D Need to wake up word audio data) (as shown in 548), until server 120 determines that another communication changes the generation (such as 550 of triggering It is shown).Second communication determined, which changes triggering, to be in any number of forms.It is similar with the first communication change triggering, the second communication Changing triggering may be based on whether to meet or more than between the first speech-control equipment 110a and the second speech-control equipment 110b The number of thresholds of message exchange, the number of thresholds based on the message exchange occurred in threshold amount of time and/or based on two speeches Language controls the user of equipment 110a/110b simultaneously in the within the threshold range of its relevant device.For determining that the first communication changes touching The threshold value that hair and the second communication change triggering can be identical (for example, each requiring 5 message exchanges) or different (examples Such as, the first communication change is occurred after being carried out 5 message exchanges using message field 302 and the second communication change occurs to make After carrying out 7 message exchanges with message field 302).The single counter not reset after the first communication changes can be used To determine the message exchange for changing triggering for each communication.According to the example of front, the first communication changes can be in counter Reach occur after 5 message exchanges (that is, using message field 302 carry out 5 message exchanges) and the second communication change can be Counter occurs after reaching 12 message exchanges (that is, carrying out 7 message exchanges using message field 302).It is alternatively possible to make Disappeared to determine for what each communication changed with different counters or the single counter reset after the first communication changes Breath exchange.According to the example of front, the first communication changes and can reach 5 message exchanges (that is, using message field in counter 302 carry out 5 message exchanges) after occur, then counter can reset to zero and second communication change can be in counter Reach 7 message exchanges (that is, carrying out 7 message exchanges using message field 302) to occur later.Change for the first communication and the Two communications change, user need where can be identical or different with the threshold distance of speech-control equipment 110a/110b.This Outside, similar with the first communication change, the second communication, which changes, can change triggering based on single communication or more than one communication changes The satisfaction of triggering and occur.

Once it is determined that the second communication changes triggering, implementation is depended on, server 120 reconfigures so as to be used in speech The domain for establishing real-time calls between language control equipment 110a and speech-control equipment 110b and associated agreement are (such as 552 institutes Show).For example, such domain can be real-time calls domain 304.Real-time calls used herein refer to be existed by server 120 The calling promoted between speech-control equipment 110a/110b, wherein direct communication letter can be opened between speech-control equipment Road.For example, system can send the second speech from the first speech-control equipment 110a for audio data during real-time calls Equipment 110b is controlled, without executing speech processing (such as ASR or NLU) to audio data, to make the first speech-control equipment The user of 110a can be with user's " directly speaking " of the second speech-control equipment 110b.Optionally, system can execute at speech (such as ASR or NLU) is managed there is no the order for system, audio can be transmitted back and forth between equipment 110a/110b Data.For example, real-time calls can be terminated as discussed below with reference to Fig. 7.

Server 120 can send signal to one or two of speech-control equipment 110a/110b, it is indicated that have been established Real-time calls (as shown in 554).Then, speech-control equipment 110a and/or speech-control equipment 110b output indicates that user can With the signal spoken, as he/her is carrying out point to point call (as shown in 556).It is used herein real-time or point-to-point Calling/communication refers to the calling promoted between speech-control equipment 110a/110b by server 120.That is, in real time Calling or point to point call are the such communication of one kind, and sound intermediate frequency is simply captured by equipment, is sent to as audio data Server, and server only sends recipient's equipment for the audio data received, and recipient's equipment exports audio, is not necessarily to It is firstly received the order done so.It can be by the signal that one or two of speech-control equipment 110a/110b is exported Static state instruction or movement instruction, as described below.Then, system executes real time communication session (as shown in 558).It can be by system Real time communication session is executed, until determining the triggering (as detailed herein) that degrades.

When executing communication between speech-control equipment, control size of data, transmission speed etc. is can be used in system Various types of agreements.It is, for example, possible to use the first agreements to wake up the logical of word part and recipient's content to control to need to exist The exchange of letter.Second protocol can be used to control and not need to wake up word part but there is still a need for the exchanges of the communication of recipient's content. Third agreement can be used to control the exchange not comprising the NLU communication being intended to.That is, ought both not need to wake up word part When also not needing recipient's content, third agreement can be used, because system is based on simultaneous message exchange in the past come false Determine recipient.When executing the simultaneous call between speech-control equipment, the real-time protocol (RTP) of such as VoIP etc can be used.

Fig. 6 A and Fig. 6 B show message based intended recipient and change the friendship based on speech by speech-control equipment Mutually.First speech-control equipment 110a capture includes the spoken audio (as indicated at 502) for waking up word part and payload portions. For example, speech-control equipment 110a may be at sleep pattern, until detecting oral wake-up word, speech-control equipment is triggered 110a wakes up and captures the audio including oral the wake-up word and speech thereafter.Speech-control equipment 110a will correspond to institute The audio data of the spoken audio of capture is sent to server 120 (as shown by 504).

Server 120 executes ASR to the audio data received, to determine text (as illustrated at 506).Server 120 is true Determine wake-up word part and the payload portions of text, and NLU (as indicated at 508) is executed to payload portions.It executes at NLU Reason may include that server 120 marks recipient information's (as illustrated at 510) of payload portions, label payload portions Message content information (as indicated at 512) and it is intended to label with " send message " to mark entire payload portions (such as 514 institutes Show).For example, the payload portions of received audio data, which can be stated, " tells mother, I said that I will arrive that quickly In." according to the example, " mother " can be labeled as recipient information by server 120, can will be marked " I will quickly thereunto " It is denoted as message content information, and it is associated language can be intended to label with " sending message ".As described above, communication changes road Diameter and communication change triggering can be configured by user profile.According to the embodiment, server 120 can be based on disappearing The intended recipient of breath changes to determine to communicate.For example, recipient information of the server 120 using label, accessible speech The user profile of equipment 110a is controlled, and determines to point out to be performed by real-time calls with the communication of " mother " and communicate Change path (as shown in 602 in Fig. 6 B).Hereafter, server 120 reconfigure so that be used in speech-control equipment 110a and Domain and the associated agreement of real-time calls are established between speech-control equipment 110b (as shown in 552).For example, such domain can To be real-time calls domain 304.Server 120 can send signal to one or two of speech-control equipment 110a/110b, It points out that real-time calls have been established (as shown in 554).Then, speech-control equipment 110a and/or speech-control equipment 110b output The signal that user can speak is indicated, as he/her is carrying out point to point call (as shown in 556).By speech-control equipment The signal of one or two of 110a/110b output can be static instruction or movement instruction, as described below.Then, it is System executes real time communication session (as shown in 558).It can be executed by the system real time communication session, until determining that another communication changes Become triggering (as detailed herein).

Fig. 7, which is shown, changes the interaction based on speech by speech-control equipment.Server 120 by with real-time calls phase Associated domain and associated agreement exchange communication (as shown in 702) between speech-control equipment 110a/110b, until server 120 determine that communication changes the generation (as indicated by 704) of triggering.For example, such domain can be real-time calls domain 304.Communication changes Various forms can be presented by becoming triggering.Communication changes triggering can be based on any of speech-control equipment 110a/110b's User's multitasking (that is, server 120 is made to execute the task unrelated with real-time calls).Communication change triggering can with or can Selection of land is based on satisfaction or is more than inactive threshold length of time (not exchanging in n minutes for example, determining).Communication changes touching Return can with or be optionally based on user instruction (for example, any of speech-control equipment 110a/110b user state example Such as " shutdown call ", " stopping call ", " terminating calling ").Communication change triggering can with or be optionally based on be originated from speech Language controls the instruction of the user of both equipment 110a/110b (for example, user says " goodbye " in threshold value number of seconds each other, " visits Visit " etc.).In addition, communication change triggering can with or be optionally based on server 120 and detected in the exchange of real-time calls and call out Awake word.Communication changes can be occurred based on determining to meet one or more than one communication and change to trigger.

After determining that change should occur, server 120 stops real-time calls (as shown at 706) and will indicate this feelings The signal of condition is sent to one or two of speech-control equipment 110a/110b (as shown in 708).Then, speech-control is set Standby 110a and/or speech-control equipment 110b output indicates the signal (as shown by 710) that real-time calls have stopped.By speech control The signal of one or two of control equipment 110a/110b output can be static instruction or movement instruction, as described below.Change Flexible letter may relate to stop all communications between speech-control equipment 110a/110b at the time point.Optionally, change logical Letter, which may relate to communicate, changes into the second form different from real-time calls.For example, the second communication form may relate to service Device 120 executes instant messaging between the first speech-control equipment 110a and the second speech-control equipment 110b, such as above With reference to described in detail by the step 534-546 of Fig. 5 C to Fig. 5 D (as shown in 548), until server 120 determines that communication changes The generation of triggering.

Fig. 8 A and Fig. 8 B are shown to be exported by the signaling of the user interface of speech-control equipment.Speech-control equipment 110a It captures spoken audio (as shown in 802), the spoken audio of capture is compiled into audio data, and send service for audio data Device 120 (as shown by 504).

Server 120 executes ASR to audio data, to determine text (for example, " telling john smith, I asks to him It is good ") (as illustrated at 506), and NLU (as shown in 804) are executed to text.Server 120 positions mark in the text handled through NLU The recipient information (that is, " john smith ") (as shown in 806) of note, and determine recipient's equipment (such as 808 institutes according to it Show).For example, the accessible user profile associated with speech-control equipment 110a and/or its user of server 120. By using user profile, server 120 can position in table corresponds to recipient information (that is, " John Shi Mi This ") text, and can identify recipient's facility information associated with recipient information in table.Server 120 is also true Marked message content (for example, " hello ") (as shown in 810) in the fixed text handled through NLU.

Server 120 will point out message content just by or recipient's equipment will be sent to (that is, speech-control equipment Signal 110b) is sent to the speech-control equipment 110a for issuing initial spoken audio data (as shown in 812).In response to receiving To message, speech-control equipment 110a output indicates that message content (that is, hello) is just sent or will be sent to recipient and set Standby visually indicates (as shown in 814).For example, visually indicating may include exporting static indicator (for example, certain color etc.) Or movement indicator (for example, flashing or stroboscopic element, continuous moving etc.).View can be configured according to user profile preference Feel instruction output.Optionally, in response to receiving message, speech-control equipment 110 can export tactile and/or audible instruction (as shown in 816).Tactile instruction may include speech-control equipment 110a vibration and/or communicate with speech-control equipment 110a Remote equipment (for example, smartwatch) vibration.Remote equipment and speech-control equipment 110a can be by being located at and user configuration It is communicated in the single table of the associated user equipment of file.It is audible instruction may include computer generate /TTS generate Speech and/or the speech that generates of user, correspond to such as " message for sending you " or " your message will be sent out at once It send." audible instruction can be by speech-control equipment 110a, long-range microphone array and/or remote equipment (example if tactile indicates Such as, smartwatch) output.Remote equipment, microphone array and speech-control equipment 110a can be by being located at and user configuration It is communicated in the single table of the associated user equipment of file.

Server 120 also sends identified recipient's equipment (that is, speech control for the audio data including message content Control equipment 110b) (as shown in 818).It should be appreciated that step 814-818 (and other steps of other figures) can be with various suitable Sequence occurs, and can also occur simultaneously.Then, speech-control equipment 110b output corresponds to the audio (such as 522 of message content It is shown).When speech-control equipment 110b detects the speech in response to message content (as shown in 820), and speech-control The signal for indicating such case is sent server 120 by equipment 110b (as shown in Figure 82 2).Then, server 120 is to speech It controls equipment 110a and sends signal, it is indicated that speech-control equipment 110b is detecting speech (as shown in 824).Server 120 can To be not need to wake up based on the recipient's name or speech-control equipment 110a/110b pointed out in the speech for example detected A part of the instant message exchange of word audio data is in response to come the speech confirmly detected in output audio.In addition, In example, server 120 can make whether speech-control equipment 110b output inquiry user user wants received by reply The audio of message.Then, server 120 can receive audio data from the second speech-control equipment 110b, hold to audio data Row ASR determines that text data includes at least one word (for example, being) pointing out response and being intended to determine text data, and Thereby determine that the audio data hereafter received is the response to origination message.In another example, server 120 can be from Two speech-control equipment 110b receive audio data, and the audio signature of received audio data is determined using speech processing It is matched with the speaker ID based on speech of the recipient of origination message, and thereby determines that the audio data hereafter received is pair The response of origination message.In response to receiving signal, speech-control equipment 110a output is indicating speech-control equipment 110b Detection speech visually indicates (as shown in Figure 82 6).For example, visually indicate may include export static indicator (for example, certain Color etc.) or movement indicator (for example, flashing or stroboscopic element, continuous moving etc.).It can be according to user profile preference Output is visually indicated to configure.In this example, once no longer output visually indicates, recipient is said in response to origination message Audio can be exported by speech-control equipment 110a.Optionally, in response to receiving signal, speech-control equipment 110a can be defeated Tactile and/or audible instruction out (as shown in 828).Tactile instruction may include speech-control equipment 110a vibration and/or with speech Language controls remote equipment (for example, smartwatch) vibration of equipment 110a communication.Remote equipment and speech-control equipment 110a can By being communicated in the single table for being located at user equipment associated with user profile.Audible instruction may include Computer generate /TTS generate speech and/or user generate speech, correspond to for example " john smith is being rung Answer your message " or " john smith is being talked ".Audible instruction can be by speech-control equipment if tactile indicates 110a, long-range microphone array and/or remote equipment (for example, smartwatch) output.Remote equipment, microphone array and speech Control equipment 110a can be by being communicated in the single table of user equipment associated with user profile.

Fig. 9 is shown to be exported by the signaling of the user interface of speech-control equipment.Speech-control equipment 110a capture packet Include the spoken audio for waking up word part and recipient information (as shown in 902).The reception that speech-control equipment 110a will be captured Person's information audio is converted to audio data, and sends server 120 for audio data (as shown in 904).Optionally, speech control Control equipment 110a can will correspond to the audio data for waking up both word part and recipient information and be sent to server 120.? In the example, server 120 recipient information's audio data can be isolated with word portion of audio data is waken up, and abandon wake-up Word portion of audio data.Server 120 can execute speech processing (for example, ASR and NLU) to recipient's information audio data (as shown in 906).For example, server 120 can execute ASR to recipient's information audio data, to create recipient information's text Notebook data, and NLU can be executed to recipient's information text data, to identify recipient's name.If received by issuing Audio data speech-control equipment 110a it is associated with multiple users, then server 120 can execute various processes, with true Which fixed user has said wake-up word part and recipient information's audio (as shown in Figure 90 8).

By using the recipient information's audio data handled through speech and the speaker for knowing recipient information's audio, The equipment that server 120 determines recipient using user profile associated with the speaker of recipient information's audio, To send the equipment (as shown by 910) for Future Data.If recipient only with an equipment in user profile Associated, then the equipment is the equipment that will be sent to it data.If recipient and the multiple equipment phase in user profile Various information can be used then to determine and which recipient's equipment will send data in association.For example, it may be determined that recipient Physical location, and the equipment closest to recipient can be transmitted data to.In another example, recipient can be determined Which equipment is being currently used, and the equipment being being currently used can be transmitted data to.In yet another example, may be used To determine which equipment recipient is being currently used, and the equipment closest to being being currently used can be transmitted data to The second equipment.In another example, the equipment (that is, equipment by Future Data is sent to it) determined by server 120 can To be distributor equipment (for example, router), wherein determine will be to which of the multiple equipment of recipient for distributor equipment Send data.

Server 120 points out the upcoming signal of message (as shown at 912) to the transmission of the equipment of identified recipient. When message content text data is sent TTS component by server 120, recipient's equipment can be sent signal to.For The purpose of explanation, the equipment of identified recipient can be speech-control equipment 110b.Then, speech-control equipment 110b is defeated Indicate the upcoming instruction of message out (as shown in 914).It can be as described herein by the instruction that speech-control equipment exports Visually indicate, it is audible instruction and/or tactile instruction.

The speech-control equipment 110a of sender of the message also captures the spoken audio including message content (as shown in 916). Message content spoken audio is converted to audio data by speech-control equipment 110a, and sends clothes for message content audio data It is engaged in device 120 (as shown in 918).In this example, speech-control equipment 110b captures message content sound in speech-control equipment 110a Instruction is exported when frequency and when server 120 receives message content audio from speech-control equipment 110a.Server 120 can be with Send message content audio data to previously determined recipient's equipment (as shown in 920), output includes message content Audio (as shown at 922).Optionally, server 120 can be executed incited somebody to action with determination above with respect to the process as described in step 910 Which recipient's equipment message content audio data is sent to.It will thus be appreciated that depending on situation, output indicates that message is Recipient's equipment of the instruction of arrival and the recipient's equipment for exporting message content be can be into same equipment, or can be Different equipment.

Figure 10 A to Figure 10 C shows the example of visual indicator as discussed herein.Speech can be passed through by visually indicating The ring of light 1002 for controlling equipment 110 exports.The ring of light 1002 can be located in speech-control equipment 110 and make speech-control equipment Any position that 110 user can sufficiently see.Depending on the message to be transmitted, different face can be exported by the ring of light 1002 Color.For example, the ring of light 1002 can issue green light with point out message by or will be sent to recipient's equipment.In another example In, the ring of light 1002 can issue blue light to point out that recipient's equipment is detecting or capturing spoken audio.It is also understood that the ring of light 1002 can issue monochromatic different tones to transmit different message.For example, the ring of light (being shown as 1002a in Figure 10 A) can be with A kind of shade of color is exported to indicate first message, the ring of light (being shown as 1002b in fig. 1 ob) can export a kind of color Medium tone is to indicate that second message and the ring of light (being shown as 1002c in fig 1 oc) can export a kind of thin shade of color with table Show third message.Though it is shown that three kinds of tones, it should be appreciated to those skilled in the art that a kind of the more of color may be implemented In three kinds or less than three kinds of tones, this depends on transmitting how many different message.In addition, although the vision of Figure 10 A to Figure 10 C Indicator example can be static state, and still, they can also seem and move in some way.For example, visual indicator can be with Flashing, stroboscopic or the surface continuous moving about/along equipment 110.

Figure 11 A and Figure 11 B show movement instruction as described herein.As shown, the ring of light 1002 can be configured as The a part for seeming the ring of light 1002 is mobile around speech-control equipment 110.Although it is not shown, it is also understood that The ring of light 1002 and/or LED 1202/1204 can be configured as flashing, stroboscopic etc..

Figure 12 shows another as described herein visually indicate.According to Figure 11, static vision instruction can pass through LED 1202/1204 or a certain other similar luminaire output.LED 1202/1204 can be located in speech-control equipment 110 Enable any position that the user of speech-control equipment 110 sufficiently sees.Depending on the message to be transmitted, can pass through LED 1202/1204 exports different colours.For example, LED 1202/1204 can issue green light with point out message by or will It is sent to recipient's equipment.In another example, LED 1202/1204 can issue blue light to point out recipient's equipment Detection or capture spoken audio.It is also understood that the different tones that LED 1202/1204 can issue monochrome are different to transmit Message.For example, LED 1202/1204 can export a kind of shade of color to indicate first message, a kind of color is exported Medium tone is to indicate second message, and exports a kind of thin shade of color to indicate third message.Although describing three kinds Tone, it should be appreciated to those skilled in the art that may be implemented a kind of color more than three kinds or less than three kinds of tones, this takes Certainly in transmitting how many different message.It should be appreciated that the ring of light 1002 and LED 1202/1204 both can be in same speeches It is realized in control equipment 110, and the different variations of described instruction (and other instructions) can be used.

Although the example discussed above is visual indicator as indicator, such as audio instruction also can be used Symbol, tactile indicator or the like other indicators point out incoming message, saying reply etc..

Figure 13 is to conceptually illustrate the user equipment 110 that can be used together with described system (for example, as herein Speech-control the equipment 110a and 110b) block diagram.Figure 14 is to conceptually illustrate that ASR processing, NLU processing can be assisted Or the block diagram of the exemplary components of the remote equipment of such as remote server 120 of command process etc.It may include more in system A such server 120, for example, (multiple) server 120 for executing ASR, one for executing NLU it is (more It is a) server 120 etc..In operation, each of these equipment (or equipment group) equipment can include residing in accordingly Computer-readable and computer executable instructions in equipment (110/120), this will be discussed further below.

Each of these equipment (110/120) can include one or more controller/processors (1304/ 1404), each controller/processor can include the central processing unit for handling data and computer-readable instruction (CPU), the memory (1306/1406) of the data and instruction and for storing relevant device.Memory (1306/1406) can To respectively include volatile random access memory (RAM), non-volatile read-only memory (ROM), non-volatile magnetic resistance (MRAM) and/or other kinds of memory.Each equipment can include also data storage part (1308/1408), be used for Storing data and controller/processor-executable instruction.Each data storage part can respectively include one or more non- Volatile storage type, such as magnetic storage, optical storage, solid-state storage etc..Each equipment can also by input accordingly/it is defeated It is (such as removable to be connected to removable or external non-volatile memory and/or storage equipment for equipment interface (1302/1402) out Storage card, memory cipher key drivers, network storage equipment etc.).

Computer instruction for operating each equipment (110/120) and its various parts can be by the control of relevant device Device/processor (1304/1404) executes, and uses memory (1306/1406) as interim " work " storage equipment at runtime. The computer instruction of equipment can be stored in nonvolatile memory (1306/1406), storage equipment in a manner of non-transitory (1308/1408) or in external equipment.Optionally, in addition to software or substitution software, some or complete in executable instruction Portion can be embedded into the hardware or firmware on relevant device.

Each equipment (110/120) includes input-output apparatus interface (1302/1402).Various parts can lead to Input-output apparatus interface (1302/1402) connection is crossed, this will be discussed further below.In addition, each equipment (110/ It 120) can include address/data bus (1324/1424), for transmitting data between the component of relevant device.In addition to (or replacement) is connected to other component by bus (1324/1424), and each component in equipment (110/120) also can be straight It is connected to other component in succession.

With reference to the equipment 110 of Figure 13, equipment 110 may include display 1318, may include being configured as receiving having Limit the touch interface 1019 of touch input.Or equipment 110 can be " without a head " and can depend on verbal order To be inputted.As the mode for pointing out to have already turned on the connection between another equipment to user, equipment 110 can be configured with view Feel indicator, such as LED or like (not shown), can change color, flash of light or otherwise by equipment 110 Offer visually indicates.Equipment 110 can also include input-output apparatus interface 1302, be connected to various parts, such as audio Output block, such as loudspeaker 101, wired earphone or wireless headset (not shown) or the other component that audio can be exported.If Standby 110 can also include audio capturing component.Audio capturing component can be such as microphone 103 or microphone array, wired Earphone or wireless headset (not shown) etc..Microphone 103 can be configured as capture audio.If including microphone array, Can based on by array different microphones capture sound between time and amplitude difference, determined by acoustics positioning The approximate distance of the origin of sound.Equipment 110 (using microphone 103, wake-up word detection module 220, ASR module 250 etc.) can To be configured to determine that the audio data for corresponding to the audio data detected.Equipment 110 (uses input-output apparatus interface 1002, antenna 1014 etc.) it can be additionally configured to send audio data to server 120 to be further processed or using such as The internal part of word detection module 220 etc is waken up to handle data.

For example, by antenna 1314, input-output apparatus interface 1302 can by WLAN (WLAN) (such as WiFi) radio, bluetooth and/or wireless network radio, such as can be with cordless communication network (such as long term evolution (LTE) Network, WiMAX network, 3G network etc.) communication radio, be connected to one or more networks 199.It can also support wired company It connects, such as Ethernet.By network 199, speech processing system can be distributed in a network environment.

Equipment 110 and/or server 120 may include ASR module 250.ASR module in equipment 110 can have Limit or extension ability.ASR module 250 may include the language model 254 being stored in ASR model storage unit 252, with And execute the ASR module 250 of automatic speech discrimination process.If including limited speech discrimination, ASR module 250 can be by It is configured to the word of identification limited quantity, such as the keyword detected by equipment, and extends speech discrimination and can be configured as Recognize much bigger word range.

Equipment 110 and/or server 120 may include limited or extension NLU module 260.NLU mould in equipment 110 Block can have limited or extension ability.NLU module 260 may include name entities identification module 262, intent classifier mould Block 264 and/or other component.NLU module 260 can also include that the knowledge base stored and/or entity library or those storages are set It is standby can be separated.

Equipment 110 and/or server 120 can also include command processor 290, command processor be configured as executing with Associated order/the function of verbal order as described above.

Equipment 110 may include waking up word detection module 220, can be individual component or may include in ASR mould In block 250.Word detection module 220 is waken up to receive audio signal and detect particular expression in audio (such as keyword of configuration) Occur.This may include the frequency variation detected in special time period, and wherein the variation of frequency will lead to System Discrimination to correspond to It signs in the specific audio of keyword.Keyword search may include analysis all directions audio signal, such as in applicable feelings By those of beam forming post-processing signal under condition.Also it can be used in keyword search (also referred to as keyword positioning) field Known other technologies.In some embodiments, equipment 110 can be configured jointly to identify and wherein detect wake-up expression Or it may wherein have occurred and that one group of direction audio signal for waking up expression.

Word detection module 220 is waken up to receive the audio of capture and handle audio (for example, using model 232) to determine audio Whether equipment 110 and/or the cognizable special key words of system 100 are corresponded to.Storage equipment 1308 can store and keyword Data related with function, so that waking up word detection module 220 is able to carry out above-mentioned algorithm and method.It is configured in equipment 110 Before by customer access network, the language models being locally stored can be pre-configured with based on Given information.For example, model can be with It is specific for language and/or accent that user equipment was transported to or was predicted the region being located at, or certainly specific to user Oneself language and/or accent, based on user profile etc..In one aspect, it can be used the user's from another equipment Speech or audio data carry out pre-training model.For example, user can possess another user that user is operated by verbal order Equipment, and the speech data can be associated with user profile.Then, user equipment 110 be delivered to user or It is configured as before network accessible by user, can use the speech data from other users equipment and is used for trained set Standby 110 language models being locally stored.It wakes up the accessible storage equipment 1308 of word detection module 220 and uses audio ratio Compared with, the positioning of pattern identification, keyword, audio signature and/or other audio signal processing techniques, by the model of the audio of capture and storage It is compared with tonic train.

As set forth above, it is possible to use multiple equipment in single speech processing system.In such more device systems, often A equipment may each comprise the different components of the different aspect for executing speech processing.Multiple equipment may include the portion of overlapping Part.As shown in Figure 13 and Figure 14, the component of equipment 110 and server 120 is exemplary, and be can be used as autonomous device and put It sets, or can wholly or partly include the component as larger equipment or system.

In order to create output speech, server 120 can be configured with text to speech (" TTS ") module 1410, will be literary Notebook data is transformed to indicate the audio data of speech.It is then possible to send equipment 110 for audio data to be played back to use Family, to create output speech.TTS module 1410 may include storing equipment for that will input the TTS that text conversion is speech. For example, TTS module 1410 may include the controller/processor and memory of their own, or can be used server 120 or Controller/the processor and memory of other equipment.Similarly, TTS mould can be located at for operating the instruction of TTS module 1410 In block 1410, in the memory and/or storage equipment of server 120, or it is located in external equipment.

The text for being input to TTS module 1410 be can handle to execute text standardization, language analysis and the life of the language rhythm At.During text standardization, TTS module 1410 handles text input and generates received text, will such as number, abbreviation The thing of (such as Apt., St. etc.) and symbol ($, % etc.) etc is converted to the equivalent text of the word write out.

During language analysis, the language in 1410 analytical standard text of TTS module corresponds to input text to generate Phonetic unit sequence.The process is properly termed as phonetic transcription.Phonetic unit includes that the symbol of acoustic unit indicates, finally by being System 100 is combined and is exported as speech.In order to carry out the purpose of speech synthesis, various acoustic units can be used to divide text This.TTS module 1410 can be based on phoneme (each sound), half phoneme, the diphone (latter half and adjacent tone of a phoneme The first half connection of element), diphones (two continuous phonemes), syllable, word, phrase, sentence or other unit handle speech Language.Each word may map to one or more phonetic units.It can be used and TTS storage is for example stored in by system 100 Language dictionary in equipment executes this mapping.The language analysis executed by TTS module 1410 can also identify different languages Method ingredient, such as prefix, suffix, phrase, punctuation mark, syntax boundary etc..TTS module 1410 can be used such grammer at Divide to make the output of the audio volume control of nature sounding.Language dictionary can also include letter to sound rule and can be used for issuing Other tools of previous unidentified word or monogram that TTS module 1410 can be potentially encountered.In general, being wrapped in language dictionary The information included is more, and the quality of speech output is higher.

Based on language analysis, then TTS module 1410 can execute language prosody generation, and wherein speech unit is with desired Prosody characteristics (also referred to as acoustic feature) are annotated, these prosody characteristics point out desired phonetic unit in final output speech In how to pronounce.During at this stage, TTS module 1410 is it is contemplated that and combine any rhythm of adjoint text input to annotate. This acoustic feature may include pitch, energy, duration etc..The application of acoustic feature can be based on TTS module 1410 can Rhythm model.This rhythm model shows how specific phonetic unit pronounces in some cases.For example, rhythm model It is contemplated that position of position, word of position, syllable of the phoneme in syllable in word in sentence, phrase or paragraph, Adjacent phonetic unit etc..As language dictionary, with the rhythm model compared with multi information and with the rhythm mould of less information Type is compared, and can produce higher-quality speech output.It is appreciated that the major part when textual work can be used for TTS module When 1410, TTS module 1410 can distribute the more strong and complicated prosodic features across part variation, so that the part be made to listen Get up more humane, leads to higher-quality audio output.

Symbolic language expression can be generated in TTS module 1410, may include the phonetic unit annotated with prosody characteristics Sequence.It is then possible to the symbolic language is indicated to be converted to the audio volume control of speech, be output to audio output apparatus (such as Microphone), and final output is to user.TTS module 1410 can be configured as is by input text conversion in an efficient way The natural sounding speech of high quality.Such high quality speech can be configured as to be sent out as human speakers as much as possible Sound, or can be configured as hearer and be understood that and be not intended to imitate specific Human voice.

One or more distinct methods can be used to execute speech synthesis in TTS module 1410.It is further described below Be known as unit selection a kind of synthetic method in, TTS module 1410 will symbolic language indicate with record speech database (example Such as the database of speech corpus) it is matched.TTS module 1410 indicates symbolic language and the spoken audio list in database Position is matched.They are simultaneously joined together to form speech output by selection matching unit.Each unit includes corresponding to The audio volume control of phonetic unit, such as the short .wav file of specific sound, and various acoustics associated with .wav file are special The description (such as its pitch, energy etc.) of sign and other information, such as phonetic unit appear in word, sentence or phrase Position, adjacent phonetic unit etc..By using all information in unit data library, TTS module 1410 can be by unit (for example, in unit data library) is matched with input text, to create nature sounding waveform.Unit data library may include Multiple examples of phonetic unit, to provide many different options for unit to be connected into speech to system 100.Unit selection A benefit be that depending on the size of database, the output of nature sounding speech can be generated.As described above, speech corpus Unit data library it is bigger, system be more possible to building nature sounding speech.

In another synthetic method of referred to as parameter synthesis, the parameter of such as frequency, volume and noise etc is by TTS mould Block 1410 changes, to generate the output of artificial speech's waveform.Parameter synthesis can be used acoustic model and various statistical techniques by Symbolic language expression is matched with desired output speech parameters.Parameter synthesis may include aloft managing under speed accurately Ability, and in the case where not selecting associated large database with unit handle speech ability, but usually also Can generate may select unmatched output speech quality with unit.Unit selection and parametric technique can be individually performed or combine It is combined to produce speech audio output together and/or with other synthetic technologys.

The synthesis of parameter speech can be executed as follows.TTS module 1410 may include acoustic model or other moulds Type can be manipulated based on audio signal, and symbolic language is indicated to the synthesis acoustic waveform for being converted to text input.Acoustic model Including the rule that can be used for annotating audio waveform specific parametric distribution to input phonetic unit and/or the rhythm.Rule can be used for Calculating indicates a possibility that specific audio output parameter (frequency, volume etc.) corresponds to the part that input symbolic language indicates Score.

As shown in figure 15, multiple equipment (120,110,110c-110f) may include the component of system 100, and equipment It can be connected by network 199.Network 199 may include local or dedicated network, or may include such as internet etc Wide area network.Equipment can be connected to network 199 by wired or wireless.For example, speech-control equipment 110, plate Computer 110e, smart phone 110c, smartwatch 110d and/or vehicle 110f can pass through wireless service provider, WiFi Or cellular network connection etc. is connected to network 199.Holding equipment including other equipment as networking, such as server 120 are answered With developer's equipment or other.Holding equipment can be connected to network 199 by wired connection or wireless connection.Networked devices 110 can be used that one or more is built-in or the microphone 103 of connection or audio capturing equipment capture audio, processing by ASR, Other component (such as one or more servers 120 of NLU or same equipment or another equipment connected by network 199 ASR 250, NLU 260 etc.) it executes.

Concepts disclosed herein can be applied in multiple and different equipment and computer system, including for example general meter Calculation system, speech processing system and distributed computing environment.

What the above-mentioned everyway of the disclosure was intended to be illustrative.Selecting them is to explain the principle of the disclosure and answer With, and it is not intended to for the exhaustive or limitation disclosure.The many modifications and variations of disclosed aspect are for this field skill It may be apparent for art personnel.The those of ordinary skill of computer and speech process field is it should be appreciated that described herein Component and processing step and can still realize this public affairs with the combining and interchanging of other component or step or component or step The benefit and advantage opened.In addition, being answered to those skilled in the art it is evident that can be no more disclosed herein Or implement the disclosure in the case where whole specific details and step.

The various aspects of disclosed system may be implemented as computer approach or be implemented as such as memory devices Or the product of non-transitory computer-readable storage media.Computer readable storage medium can be computer-readable and can To include for making computer or other equipment execute the instruction of the process described in the disclosure.Computer readable storage medium It can be by volatile computer memories, non-volatile computer memory, hard disk drive, solid-state memory, flash drive Device, removable disk and/or other media are realized.In addition, the component of one or more of module and engine can with firmware or The form of hardware realizes that such as acoustics front end 256 especially includes analog and/or digital filter (for example, as firmware quilt It is configured to the filter of digital signal processor (DSP))).

Foregoing teachings can also understand according to following clause.

1. a method of computer implementation comprising:

From the first speech-control equipment associated with the first user profile receive include first wake up word part and First input audio data of the first command portion；

Speech processing is executed to first command portion, with determine indicates the second title of second user configuration file with First text data of first message content；

Using first user profile, the second speech control associated with the second user configuration file is determined Control equipment；

At the first time, Xiang Suoshu the second speech-control equipment sends the first output for corresponding to the first message content Audio data；

The second time after the first time, receiving from the second speech-control equipment includes the second wake-up word Second input audio data of part and the second command portion；

Speech processing is executed to second command portion, indicates associated with first user profile with determination The first title and second message content the second text data；

The third time after second time, Xiang Suoshu the first speech-control equipment, which is sent, corresponds to described second Second output audio data of message content；

Determine the first time and second time within the first threshold period；

Message transmission connection is established between the first speech-control equipment and the second speech-control equipment；

Signal is sent to the first speech-control equipment to be handled to send other audio data without detecting Wake up word part；

The 4th time after the third time, receiving from the first speech-control equipment includes in third message Hold but without the third input audio data of wake-up word part；

Speech processing is executed to the third input audio data, the third message content is indicated with determination but is not indicated The third text data of second title of the second user；And

The 5th time after the 4th time, it includes that the third disappears that Xiang Suoshu the second speech-control equipment, which is sent, The third for ceasing content exports audio data.

2. the computer implemented method according to clause 1, further include:

The 6th time after the 5th time, receiving from the second speech-control equipment includes in the 4th message Hold but does not include the 4th input audio data for waking up first title of word part or first user；

Determine the 6th time and the 5th time within the second threshold period；And

In response to the 6th time and the 5th time within the second threshold period, first speech is opened Language controls the first real time communication session channel between equipment and the second speech-control equipment, the first real time communication meeting Words channel be related to it is being received from the first speech-control equipment and the second speech-control equipment, will be at no progress speech The audio data swapped in the case where reason.

3. the computer implemented method according to clause 2, further include:

The first real time communication session channel is closed when occurring and communicating and change triggering, under the communication change triggering is Column items at least one of: not from the first speech-control equipment receive audio data third threshold time period, It detects the wake-up word part from the first speech-control equipment, receive non-communicating from the first speech-control equipment Order receives further input audio data, the further input audio from the first speech-control equipment Data include at least part for pointing out to close the first real time communication session channel.

4. the computer implemented method according to clause 1, further include:

Image data is received from the second speech-control equipment；

Determine that described image data include the expression of people；

Based on the position of the expression in described image data, the people and the second speech-control equipment are determined The degree of approach；And

Second message transmitting connection is established between the first speech-control equipment and the second speech-control equipment, The required wake-up word part of spoken audio is waken up word from default and changes into connecing for spoken audio by the second message transmitting connection The name of receipts person.

5. a kind of system comprising:

At least one processor；And

Memory, including instruction, described instruction can be operated to be executed by least one described processor, dynamic to execute one group Make, so that at least one described processor is configured that

Input audio data is received from the first equipment, the input audio data includes waking up word part and command portion；

Text data is determined based on the input audio data；

Based on the text data, first message is sent to the second equipment；

Determine the second message that first equipment is sent to from the plan of second equipment；

Determine from first equipment be sent to second equipment the first quantity message and from second equipment It is sent to the message elapsed time amount of the second quantity of first equipment；

Determine that the time quantum is less than the first threshold period；And

First equipment is transmitted data to, it is described that the data are sent to first equipment by audio data At least one processor detects without first equipment and wakes up word.

6. wherein at least one described processor is further configured that by described instruction according to system described in clause 5

Determine from first equipment be sent to second equipment third quantity message and from second equipment It is sent to the second time quantum that the message of the 4th quantity of first equipment is passed through；

Determine that second time quantum is less than the second threshold period；And

Real time communication session is established between first equipment and second equipment, the real time communication session includes In first equipment and the second exchanged between equipment audio data without executing speech processing.

7. wherein at least one described processor is further configured that by described instruction according to system described in clause 5

User profile associated with first equipment is accessed,

Wherein determine that the elapsed time amount includes related to second equipment in the identification user profile The message of first quantity of connection.

8. wherein at least one described processor is further configured that by described instruction according to system described in clause 5

The second input audio data is received from first equipment；

Determine that second input audio data includes user name；

Using user profile associated with first equipment, the third equipment of the attached user name is determined；

It include the user name based on second input audio data using the user profile, it is determined that Real time communication session occurs；And

Real time communication session is established between first equipment and the third equipment.

9. wherein at least one described processor is further configured that by described instruction according to system described in clause 8

It determines at least one in the following: being not received by the second threshold period of audio data, receives packet It includes and wakes up the audio data of word part, receives the audio data including non-communicating order or receive including pointing out close Close at least part of audio data of the real time communication session；And

Close the real time communication session.

10. according to system described in clause 8, wherein in response to it is the first in first degree of approach of first equipment simultaneously And second people in second degree of approach of the third equipment, further promote that the real time communication session occurs.

11. wherein at least one described processor is further configured that by described instruction according to system described in clause 5

When second equipment is capturing at least one of audio or text, refer to the first equipment output Show, the instruction is vision, at least one of audible or tactile.

12. wherein at least one described processor is further configured that by described instruction according to system described in clause 5

Making the first equipment output synthesized speech, it is indicated that audio data will be sent to second equipment in real time, and It is disabled to wake up word function.

13. a method of computer implementation comprising:

Text data is determined based on the input audio data；

Based on the text data, first message is sent to the second equipment；

Determine that the time quantum is less than the first threshold period；And

First equipment is transmitted data to, the data make first equipment send audio data, without institute It states the detection of the first equipment and wakes up word.

14. the computer implemented method according to clause 13, further include:

15. the computer implemented method according to clause 13, further include:

User profile associated with first equipment is accessed,

16. the computer implemented method according to clause 13, further include:

The second input audio data is received from first equipment；

Determine that second input audio data includes user name；

17. the computer implemented method according to clause 16, further include:

Close the real time communication session.

18. the computer implemented method according to clause 16, wherein in response to the first in first equipment In first degree of approach and the second people is in second degree of approach of the third equipment, further promotes that the real time communication occurs Session.

19. the computer implemented method according to clause 13, further include:

20. the computer implemented method according to clause 13, further include:

21. a method of computer implementation comprising:

The first input audio data is received from the first speech-control equipment；

Speech processing is executed to determine text data to first input audio data；

Determine that the first part of the text data corresponds to message recipient name；

Determine that the second part of the text data corresponds to first message content；

The first signal is sent to the first speech-control equipment, first signal makes the first speech-control equipment First look instruction is exported, the First look instruction indicates sending disappearing corresponding to first input audio data Breath；

It is determining and the message recipient using user profile associated with the first speech-control equipment The associated second speech-control equipment of name；

Point out that the second speech-control equipment is detecting the second of speech from the second speech-control equipment reception Signal；And

The second time to the first speech-control equipment and after the first time sends third signal, The third signal visually indicates the first speech-control equipment output second, and described second visually indicates and indicate described the Two speech-control equipment are detecting speech.

22. the computer implemented method according to clause 21, in which:

The First look instruction includes that the first color and described second is visually indicated including first color and first Movement, first movement includes one of flashing, stroboscopic or moving along the edge of the first speech-control equipment；And

First signal also makes the first speech-control equipment export audible instruction, and the audible instruction indicates Send the message for corresponding to first input audio data.

23. the computer implemented method according to clause 21, further include:

Make whether the second speech-control equipment output inquires user described in the user of the second speech-control equipment Want the audio of the reply first message content；

The second input audio data is received from the second speech-control equipment；

ASR is executed to determine the second text data to second input audio data；And

Determine that second text data includes that word is.

24. the computer implemented method according to clause 21, wherein determining that the second speech-control equipment is also wrapped It includes:

Equipment associated with recipient's name receives image data from the user profile；And

Determine from the described image data that the second speech-control equipment receives include people expression.

25. a kind of system comprising:

At least one processor；And

Input audio data is received from the first equipment；

The input audio data is handled to determine message content；

To the second equipment and at the first time, the output audio data for corresponding to the message content is sent；

The second time from second equipment and after the first time receives second equipment and has detected To the instruction of the speech in the reply to the output audio data；And

The third time after second time exports visual indicator by first equipment, and the vision refers to Show that symbol indicates that second equipment is receiving the reply to the message content.

26. according to system described in clause 25, wherein it is described visually indicate including in the first color or the first movement extremely Few one kind.

27. according to system described in clause 25, wherein described instruction further configure at least one described processor so that Second equipment is identified with user profile associated with first equipment.

28. wherein at least one described processor is further configured that by described instruction according to system described in clause 25

The second equipment output is set to handle the audio data of creation by text to speech (TTS)；

The second input audio data is received from second equipment；

ASR is executed to determine the second text data to second input audio data；

Determine that second text data includes that word is；And

It include the word based on determination second text data is to determine that the speech is to the output audio In the reply of data.

29. wherein at least one described processor is further configured that by described instruction according to system described in clause 25

The third time after second time, by first equipment export listening indicator, it is described can Indicator is listened to indicate that second equipment has detected that the speech in the reply to the output audio data.

30. according to system described in clause 29, wherein generating the listening indicator, institute to speech processing using text State the speech that text had previously been said to speech processing using user.

31. wherein at least one described processor is further configured that by described instruction according to system described in clause 25

The second input audio data is received from second equipment；

Using determined based on the speaker ID of speech second input audio data correspond to by the message content The audio said of recipient；And

Based on the audio that the recipient that second input audio data corresponds to the message content is said, really The fixed speech is in the reply to the output audio data.

32. according to system described in clause 25, wherein the input audio data includes waking up word part and message content.

33. a method of computer implementation comprising:

Input audio data is received from the first equipment；

The input audio data is handled to determine message content；

34. the computer implemented method according to clause 33, wherein described visually indicate including the first color or the At least one of one movement.

35. the computer implemented method according to clause 34 further includes using associated with first equipment User profile identify second equipment.

36. the computer implemented method according to clause 35, wherein described instruction further will it is described at least one Processor is configured that

The second input audio data is received from second equipment；

ASR is executed to determine the second text data to second input audio data；

Determine that second text data includes that word is,

Wherein based on determining that second text data include the word is to determine that the speech is to the output In the reply of audio data.

37. the computer implemented method according to clause 33, further include:

38. the computer implemented method according to clause 37, wherein being generated using text to speech processing described Listening indicator, the speech that the text had previously been said to speech processing using user.

39. the computer implemented method according to clause 33, further include:

The second input audio data is received from second equipment；

40. the computer implemented method according to clause 33, wherein the input audio data includes waking up word portion Point and message content.

41. a method of computer implementation comprising:

In first time period:

The first input audio data including recipient information is received from the first speech-control equipment；

Speech processing is executed to first input audio data, to determine the first text data, first textual data According to including recipient's name；

Using user profile associated with the first speech-control equipment, determination and recipient's name phase Associated second speech-control equipment；

The second speech-control equipment output is set to indicate the upcoming instruction of message content；

In second time period:

The second input audio data including the message content is received from the first speech-control equipment；

Using user profile associated with the first speech-control equipment, determination and recipient's name phase Associated third speech-control equipment；And

The third speech-control equipment is set to export the message content.

42. the computer implemented method according to clause 41, wherein the instruction includes the first color, has first At least one of first color or the first audio of movement, first movement is including flashing, stroboscopic or along described first One of speech-control device end movement, first audio are to be generated using text to speech processing.

43. the computer implemented method according to clause 41, further include:

Natural language processing is executed to identify recipient's name；And

Signal is sent to the second speech-control equipment, the signal makes described in the second speech-control equipment output Instruction, while text is sent to speech component by the second text data for corresponding to second input audio data.

44. the computer implemented method according to clause 41, wherein determining that the second speech-control equipment is also wrapped It includes:

Determine from the image data that the second speech-control equipment receives include people expression.

45. a kind of system comprising:

At least one processor；And

The first input audio data including recipient information is received from the first equipment；

Determine the second equipment associated with the recipient information；

The second equipment output is set to indicate the upcoming instruction of message content；

The second input audio data including the message content is received from first equipment；And

Second equipment is set to export the message content.

46. according to system described in clause 45, wherein determining that second equipment includes:

Access user profile associated with first equipment；And

Identify the recipient information in the user profile.

47. according to system described in clause 45, wherein determining that second equipment includes:

Determine the position of the recipient；And

Based on second equipment close to the recipient, selected from multiple equipment associated with recipient's configuration file Select second equipment.

48. according to system described in clause 45, wherein determining that second equipment includes determining that second equipment is current In being used.

49. according to system described in clause 45, wherein determining that second equipment includes:

From determining that third equipment is being currently used in the multiple equipment including second equipment；And

Second equipment is selected based on the degree of approach of second equipment and the third equipment.

50. according to system described in clause 45, wherein the instruction includes color or the color with movement.

51. according to system described in clause 50, wherein receiving the second input audio number from first equipment According to while the instruction exported by second equipment.

52., wherein the instruction is audible instruction, the audible instruction is using text according to system described in clause 55 It is generated to speech (TTS) processing.

53. a method of computer implementation comprising:

Determine the second equipment associated with the recipient information；

The second input audio data including message content is received from first equipment；And

Second equipment is set to export the message content.

54. the computer implemented method according to clause 53, wherein determining that second equipment includes:

Access user profile associated with first equipment；And

Identify the recipient information in the user profile.

55. the computer implemented method according to clause 53, wherein determining that second equipment includes:

Determine the position of the recipient；And

56. the computer implemented method according to clause 53, wherein determining that second equipment includes described in determination During second equipment is being currently used.

57. the computer implemented method according to clause 53, wherein determining that second equipment includes:

58. the computer implemented method according to clause 53, wherein the instruction is including color or with movement The color.

59. the computer implemented method according to clause 58, wherein receiving described from first equipment The instruction is exported by second equipment while two input audio datas.

60. the computer implemented method according to clause 53, wherein the instruction is audible instruction, the audible finger Show it is to handle to generate using text to speech (TTS).

As used in the disclosure, unless otherwise expressly specified, otherwise term "a" or "an" may include one A or multiple projects.In addition, unless otherwise expressly specified, otherwise wording " being based on " is intended to indicate that " being at least partially based on ".

Claims

1. a method of computer implementation comprising:

Text data is determined based on the input audio data；

Based on the text data, first message is sent to the second equipment；

Determine from first equipment be sent to second equipment the first quantity message and from second equipment send To the message elapsed time amount of the second quantity of first equipment；

Determine that the time quantum is less than the first threshold period；And

Transmit data to first equipment, the data make first equipment send audio data, without described the The detection of one equipment wakes up word.

2. computer implemented method as described in claim 1 comprising:

The second input audio data is received from first equipment；

Second input audio data is handled to determine message content；

To second equipment and at the first time, the output audio data for corresponding to the message content is sent；

The second time from second equipment and after the first time receives second equipment and has detected that pair The instruction of speech in the reply of the output audio data；And

The third time after second time exports visual indicator, the visual indicator by first equipment Indicate that second equipment is receiving the reply to the message content.

3. computer implemented method as claimed in claim 1 or 2, wherein described instruction also configures at least one processor Are as follows:

Third input audio data is received from second equipment；

ASR is executed to determine the second text data to the third input audio data；And

Determine that second text data includes that word is, wherein determining that the speech is the reply to the output audio data It is that second text data includes that the word is based on the determination.

4. computer implemented method as claimed in claim 1 or 2, further include:

The third time after second time exports listening indicator, the audible finger by first equipment Show that symbol indicates that second equipment has detected that the speech in the reply to the output audio data.

5. computer implemented method as claimed in claim 4, wherein described audible to generate to speech processing using text Indicator, the speech that the text had previously been said to speech processing using user.

6. computer implemented method as claimed in claim 1 or 2, further include:

The 4th input audio data is received from second equipment；

Using determined based on the speaker ID of speech the 4th input audio data correspond to connecing by the message content The audio that receipts person says；And

Based on the audio that the recipient that the 4th input audio data corresponds to the message content is said, institute is determined Stating speech is in the reply to the output audio data.

7. computer implemented method as described in claim 1, further include:

The second input audio data including recipient information is received from first equipment；

Determine second equipment associated with the recipient information；

The third input audio data including message content is received from first equipment；And

Second equipment is set to export the message content.

8. computer implemented method as claimed in claim 7, wherein determining that second equipment includes:

Determine the position of the recipient；And

Based on second equipment close to the recipient, institute is selected from multiple equipment associated with recipient's configuration file State the second equipment.

9. computer implemented method as described in claim 1, further include:

Determine from first equipment be sent to second equipment third quantity message and from second equipment send The second time quantum passed through to the message of the 4th quantity of first equipment；

Real time communication session is established between first equipment and second equipment, the real time communication session is included in institute The first equipment and the second exchanged between equipment audio data are stated without executing speech processing.

10. computer implemented method as described in claim 1, further include:

User profile associated with first equipment is accessed,

Wherein determine that the elapsed time amount includes associated with second equipment in the identification user profile The message of first quantity.

11. computer implemented method as described in claim 1, further include:

The second input audio data is received from first equipment；

Determine that second input audio data includes user name；

It include the user name based on second input audio data using the user profile, it is determined that occurring Real time communication session；And

12. computer implemented method as claimed in claim 11, further include:

It determines at least one in the following: being not received by the second threshold period of audio data, receives including calling out The audio data of awake word part receives the audio data including non-communicating order or receives including pointing out close institute State at least part of audio data of real time communication session；And

Close the real time communication session.

13. computer implemented method as claimed in claim 12, wherein in response to the first the of first equipment In one degree of approach and the second people is in second degree of approach of the third equipment, further promotes that the real time communication meeting occurs Words.

14. computer implemented method as described in claim 1, further include:

When second equipment is capturing at least one of audio or text, make the first equipment output instruction, institute It states and at least one of indicates it is vision, is audible or tactile.

15. computer implemented method as claimed in claim 13, further include:

Make the first equipment output synthesized speech, it is indicated that audio data will be sent to second equipment in real time, and wake up Word function is disabled.