CN110121744A - Handle the voice from distributed microphone - Google Patents
Handle the voice from distributed microphone Download PDFInfo
- Publication number
- CN110121744A CN110121744A CN201780075396.8A CN201780075396A CN110121744A CN 110121744 A CN110121744 A CN 110121744A CN 201780075396 A CN201780075396 A CN 201780075396A CN 110121744 A CN110121744 A CN 110121744A
- Authority
- CN
- China
- Prior art keywords
- audio signal
- equipment
- microphone
- derived
- audio
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000005236 sound signal Effects 0.000 claims abstract description 202
- 238000012986 modification Methods 0.000 claims abstract description 32
- 230000004048 modification Effects 0.000 claims abstract description 32
- 238000004891 communication Methods 0.000 claims abstract description 8
- 230000004044 response Effects 0.000 claims description 69
- 230000005540 biological transmission Effects 0.000 claims description 13
- 239000012141 concentrate Substances 0.000 claims description 8
- 230000000630 rising effect Effects 0.000 claims 1
- 238000012545 processing Methods 0.000 description 36
- 238000000034 method Methods 0.000 description 9
- 230000002618 waking effect Effects 0.000 description 7
- 238000001514 detection method Methods 0.000 description 5
- 230000001755 vocal effect Effects 0.000 description 5
- 230000008901 benefit Effects 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 230000003595 spectral effect Effects 0.000 description 3
- 230000000694 effects Effects 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 238000009434 installation Methods 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- 230000018199 S phase Effects 0.000 description 1
- 230000003213 activating effect Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 235000013399 edible fruits Nutrition 0.000 description 1
- 238000005242 forging Methods 0.000 description 1
- 238000005286 illumination Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 230000009897 systematic effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L2021/02161—Number of inputs available containing the signal or the noise to be suppressed
- G10L2021/02165—Two microphones, one receiving mainly the noise signal and the other one mainly the speech signal
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Quality & Reliability (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Circuit For Audible Band Transducer (AREA)
Abstract
The present invention provides a kind of system, the system has the multiple microphones of positioning at different locations and the modification system with the mi crophone communication.The modification system is configured as exporting multiple audio signals from the multiple microphone;Calculate the confidence score of each derived audio signal;And it is based on confidence score calculated, another audio signal is modified using a derived audio signal.
Description
Background technique
This disclosure relates to handle the voice from distributed microphone.
Current speech identifying system assumes that a microphone or microphone array are listening to user and speaking and based on language
Sound takes movement.The movement may include local voice identification and response, identification based on cloud and response or these combination.One
In a little situations, local identification " waking up words ", and further processing is remotely provided based on the wake-up words.
Distributed loudspeaker system tunable is located in the audio playback at multiple loudspeakers around family, so that sound
Playback is synchronous between each position.
Summary of the invention
In general, in one aspect, system includes the multiple microphones and and microphone of positioning at different locations
The scheduling system of communication.Scheduling system exports multiple audio signals from multiple microphones, calculates each derived audio signal
Confidence score, and compare the confidence score of calculating.Based on the comparison, in audio signal derived from scheduling Systematic selection
At least one, with for further processing.
Specific implementation can include one or more of the following terms with any combination.Scheduling system may include multiple locals
Processor, multiple native processor are connected respectively at least one of microphone.Scheduling system may include at least first
Ground processor and at least second processor that can be used for first processor on network.Calculate each derived audio signal
Confidence score may include calculating whether signal may include voice, in signal whether may include waking up words, may include in signal
Which kind of wakes up words, the quality including voice in the signal, its sound may be recorded in the user in signal identity and
The confidence level of one or more of position of the user relative to microphone position.Calculate the confidence of each derived audio signal
Degree score may also include determining that audio signal shows as including whether language and the language include wake-up words.Calculating is each led
It includes which of multiple wake-up words wake up words that the confidence score of audio signal out, which can further include in identification voice,.Meter
The confidence score for calculating each derived audio signal may also include determining that voice includes the confidence level for waking up words.
The confidence score for calculating each derived audio signal may include comparing microphone detection to believe to each audio
Number corresponding sound, the signal strength of derived audio signal, the signal-to-noise ratio of derived audio signal, derived audio signal
One or more of the timing between the time echoed in spectral content and derived audio signal.Calculate each export
The confidence score of audio signal may include calculating in the apparent source and microphone of audio signal extremely for each audio signal
It is the distance between one few.The confidence score for calculating each derived audio signal may include calculating each audio signal source phase
For the position of microphone position.The position for calculating each audio signal source may include in each source and microphone based on calculating
The distance between at least two to carry out triangulation to the position.
At least part in selected one or more signals can be transferred to speech processing system by scheduling system, to mention
For being further processed.The selected one or more audio signals of transmission may include that at least one is selected from multiple speech processing systems
A speech processing system.At least one speech processing system in multiple speech processing systems may include providing on a wide area network
Speech-recognition services.At least one speech processing system in multiple speech processing systems may include audio recognition method, described
Audio recognition method executes in the same processor for executing scheduling system.The selection of speech processing system can be based on and user's phase
One or more of scene locating for associated preference, the confidence score of calculating or export audio signal.Scene may include
To which microphone in the identification of the user that may be being talked, multiple microphones produce selected derived audio signal,
User was relative to one of the mode of operation of the other equipment in the position of microphone position, system and moment on the same day or more
Person.The selection of speech processing system can be based on the resource that can be used for speech processing system.
The confidence score for comparing calculating may include that audio signal selected by determining at least two shows as including from extremely
The language of few two different users.Determining that selected audio signal is shown as includes that the language from least two different users can base
In voice recognition, the user relative in the position of the position of the microphone, the microphone which produce
The different uses for waking up words and the user in each selected audio signal, described two selected audio signals
One or more of visual identity.Scheduling system can also send two for selected audio signal corresponding with two different users
A different selected speech processing system.It can preference based on user, the load balance of speech processing system, selected audio letter
Number scene and two selected audio signals in different wake up one or more of using for words and believe selected audio
Number it is assigned to selected speech processing system.Scheduling system can also using selected audio signal corresponding with two different users as
Two individually handle request and are sent to identical speech processing system.
The confidence score for comparing calculating may include determining that at least two the received audio signals show as indicating identical
Language.Determine that selected audio signal indicates that identical language can be based on voice recognition, audio signal source relative to microphone position
In the position set, microphone which produce each selected audio signal, the arrival time of audio signal, audio signal it
Between or the output of microphone array element between one of correlation, pattern match and the visual identity of personal speech or
More persons.Scheduling system can also will appear as indicating in the audio signal of identical language only one be sent to speech processes system
System.Scheduling system can also will appear as indicating both to be sent to speech processing system in the audio signal of identical language.
At least one selected audio signal can be also transferred to each of at least two speech processing systems by scheduling system, received and
The response of each from speech processing system, and determine the sequence for wanting output response.
At least two selected audio signals can be also transferred at least one speech processing system by scheduling system, and reception comes from
The response of speech processing system corresponding with each transmission signal, and determine the sequence for wanting output response.Scheduling system can quilt
It is further configured to receive the response to being further processed, and uses output equipment output response.Output equipment can not with catch
The microphone for having obtained audio is corresponding.Output equipment can delocalization at any position that microphone is positioned.Output equipment can wrap
Include one or more of loudspeaker, earphone, wearable audio frequency apparatus, display, video screen or household electrical appliance.It is receiving
After the multiple responses being further processed, scheduling system can be by wanting output response at single output by response combination to determine
Sequence.After receiving to the multiple responses being further processed, scheduling system can be by selection output all or fewer than response
Response or send different output equipments for different responses and determine the sequence for wanting output response.The number of derived audio signal
Amount can be not equal to the quantity of microphone.At least one of microphone may include microphone array.The system can further include non-sound
Frequency input equipment.Non-audio input equipment may include accelerometer, Existing detector, camera, wearable sensors or user circle
One or more of face equipment.
In general, in one aspect, system includes the multiple equipment of positioning at different locations;And it is communicated with equipment
Scheduling system, which receives response from speech processing system in response to the request that had previously transmitted, determines and responds
At least one of equipment is forwarded the response to the correlation of each equipment and based on the determination.
Specific implementation can include one or more of the following terms with any combination.At least one of equipment may include
Audio output apparatus, and transmitted response may make the equipment to export audio signal corresponding with response.Audio output apparatus can
Including one or more of loudspeaker, earphone or wearable audio frequency apparatus.At least one of equipment may include display, view
Frequency screen or household electrical appliance.The request previously transmitted can never with the associated the third place in any of multiple positions of equipment
Place's transmission.Response can be the first response, and the system of dispatching can also receive the response from the second speech processing system.Scheduling system
System can also respond first for being forwarded in equipment, and second that the second response is forwarded in equipment for first.Scheduling
First response and second can also be responded first for being both forwarded in equipment by system.Scheduling system can be also by the first response
Any of equipment is forwarded to only one in the second response.
The correlation for determining response may include which is associated with the request previously transmitted in determining equipment.Determine response
Correlation may include which can closest user associated with request that is previously transmitting in determining equipment.Determine response
Correlation can be based on preference associated with the user of required system.The correlation for determining response may include determining previously transmission
The scene of request.Scene may include to may the identification of user associated with request, which Mike in multiple microphones
Wind relative to the mode of operation of the other equipment in the position of device location, system and may work as with the associated, user of request
One or more of its moment.The correlation for determining response may include the ability or Resource Availability of determining equipment.
Multiple output equipments can be positioned at different output equipment positions, and the system of dispatching can be additionally in response to transmission
The correlation requested to receive the response from speech processing system, determine response and each output equipment, and based on the determination
Forward the response at least one of output equipment.At least one of output equipment may include audio output apparatus, and
Transmitted response makes the equipment export audio signal corresponding with response.Audio output apparatus may include loudspeaker, earphone or can
Dress one or more of audio frequency apparatus.At least one of output equipment may include display, video screen or household electric
Device.The correlation for determining response may include the pass between determining output equipment and microphone associated with selected audio signal
System.The correlation for determining response may include which can be closest to selected audio signal source in determining output equipment.Determine response
Correlation may include determine export audio signal locating for scene.Scene may include the knowledge to the user that may be being talked
Not, which microphone produces selected derived audio signal, user relative to microphone position and sets in multiple microphones
One or more of the mode of operation of other equipment for the position of position, in system and moment on the same day.Determine response
Correlation may include the ability or Resource Availability of determining output equipment.
In general, in one aspect, system includes being located in multiple microphones at different microphone positions, being located in
Multiple loudspeakers at different loudspeaker locations and the scheduling system communicated with microphone and loudspeaker.Scheduling system is from multiple
Microphone exports multiple voice signals;Calculating about each derived voice signal includes the confidence score for waking up words;Than
Compared with the confidence score of calculating;And based on the comparison, select at least one of derived voice signal and will be selected
At least part in one or more signals is transferred to speech processing system.Scheduling system is received and is come from response to the transmission
The response of speech processing system, the correlation for determining response with each loudspeaker, and expansion is forwarded the response to based on the determination
At least one of sound device is for exporting.
In general, on the other hand, system includes multiple microphones and and the Mike positioned at different locations
The modification system of wind communication.Modification system is configured as exporting multiple audio signals from multiple microphones;It calculates each derived
The confidence score of audio signal;And it is based on confidence score calculated, it is modified using a derived audio signal
Another audio signal.
The confidence score for calculating each derived audio signal may include whether calculating the derived audio signal
Including voice and the derived audio signal whether include non-speech sounds confidence level.Calculate each derived audio letter
Number confidence score can include determining that whether the derived audio signal is voice signal.Use a derived audio
Signal is come to modify another audio signal may include being filtered by the second audio signal to the first audio signal.Pass through
It may include using the second audio signal as the first audio signal that two audio signals, which are filtered the first audio signal,
Sef-adapting filter reference.The quantity of derived audio signal may differ from the quantity of microphone.
At least one of microphone may include microphone array.First microphone array can spatially concentrate on
In one sound objects.Second microphone array can spatially concentrate in second sound target.First sound objects can be with
Including human sound.Second sound target may include noise source.
First microphone can be a part of the first equipment and second microphone can be a part of the second equipment,
And the first audio signal can be exported from the first microphone and the second audio signal can be exported from second microphone.Second
Equipment can be by the second audio signal transmission to the first equipment.The second audio signal can be used to modify the first sound in first equipment
Frequency signal.The second audio signal can be used to reduce the noise in the first audio signal in first equipment.
First microphone and second microphone can be a part of the first equipment.First audio signal can be from first
Microphone exports and the second audio signal can be exported from second microphone.Second audio signal can be used for reducing the first sound
Noise in frequency signal.Multiple microphones can be a part of the first equipment.First equipment can be spatially multiple by its
Microphone concentrates in the first individual sources and the second individual sources, wherein the first audio signal exports and from the first sound source
Two audio signals are exported from the second sound source.Second audio signal can be used for reducing the noise in the first audio signal.
In general, in another aspect, system includes multiple microphones, and multiple microphone is located in different location
Place, wherein the first microphone is a part of the first equipment and second microphone is a part of the second equipment, wherein operating
First equipment operates the second equipment to export the first audio signal from the first microphone to export the second audio from second microphone
Signal, and the second equipment is suitable for the second audio signal transmission to the first equipment.The modification of a part as the first equipment
System is in response to the first audio signal and the second audio signal, wherein modification system reduces by the first sound using the second audio signal
Noise in frequency signal.
In general, in another aspect, system includes multiple microphones, and multiple microphone is the one of the first equipment
Part, including the first microphone and second microphone, wherein operating the first equipment to export the first audio letter from the first microphone
Number and from second microphone export the second audio signal.Modification system is a part of the first equipment and in response to the first sound
Frequency signal and the second audio signal, wherein modification system reduces the noise in the first audio signal using the second audio signal.
In general, in another aspect, system includes multiple microphones, and multiple microphone is the one of the first equipment
Part, wherein the first equipment spatially concentrates on its multiple microphone in the first individual sources and the second individual sources,
In the first audio signal exported from the first sound source and the second audio signal is exported from the second sound source.The first equipment is operated with from
One sound source exports the first audio signal and exports the second audio signal from the second sound source.Modification system is one of the first equipment
Point and in response to the first audio signal and the second audio signal, wherein modification system reduces first using the second audio signal
Noise in audio signal.
Advantage includes the verbal order detected at multiple positions and the single response provided to the order.Advantage further includes
It provides to compared to the response for detecting the verbal order at the position of order and the more relevant position of user.
Can by it is any technically it is possible in a manner of combine all examples and feature referred to above.Other feature and advantage
It will be apparent in a specific embodiment and in the claims.
Detailed description of the invention
Fig. 1 show microphone and can voice command received by response microphones equipment system layout.
Fig. 2 shows an audio signal can be used the system of modifying another audio signal.
Specific embodiment
With more and more equipment realize sound control user interface (VUI), occur multiple equipment can be detected it is identical
Verbal order simultaneously attempts the problem of handling the order, this causes to occur being responsive at different operating points from redundancy taking mutual lance
The problems such as movement of shield.Which similarly, if verbal order can lead to the output or movement of multiple equipment, should be adopted by equipment
It may be fuzzy for taking movement.In some VUI, the referred to as special phrase of " wake up words ", " waking up word " or " keyword "
Speech recognition features-realization VUI equipment for activating VUI always listens to wake-up words, and calls out when the equipment listens to
When awake words, which parses any verbal order after it.This is in order to by not parsing detected each sound
Save process resource, and this can help to eliminate about which system be order target ambiguity, but if multiple systems
System is listening to identical wake-up words, such as due to the wake-up words to service provider rather than individual hardware is related
Connection, then problem is still which determining equipment should handle the order.
Fig. 1 shows exemplary system 100, wherein separate microphone array 102, smart phone 104,106 and of loudspeaker
(in order to avoid obscuring, we will be a for the respective microphone for all having detection user speech of one or more of one group of earphone 108
People's speech is known as " user " and equipment 106 is known as " loudspeaker ";" the discrete content that user is said " is " language ").And
And " sound ", " noise " and similar word refer to audible sound energy." audio signal " refers to indicating such sound, and
And it can be generated by microphone or other electronic equipments, and the electric signal or optical signal of audible sound energy can be converted back by loudspeaker.Inspection
The content that each equipment of survey language 110 is listened to it is as audio signal transmission to scheduling system 112.Have in equipment
In the case where multiple microphones, those equipment can combine the signal presented by individual microphone so that single combining audio letter is presented
Number or its can transmit the signal presented by each microphone.
Scheduling system 112 can be separately connected to service based on cloud thereon, in a phase for wherein each equipment
With the local service run in equipment or associated equipment, in the upper synthetic operation of some or all of these equipment itself
Any combination of Distributed Services or these frameworks or similar framework.Due to its different microphone design and itself and user
The different degrees of approach, each equipment can differently listen to language 110 (if any).For example, independent microphone array 102
There can be high quality Wave beam forming ability, this allows, and no matter user, which is located at the where independent microphone array, can clearly detect
Hear language, and earphone 108 and smart phone 104 are respectively provided with the near field microphone of high orientation, if user adorns oneself with ear
Machine and phone is remained into the face towards them, then the near field microphone only clearly obtains the sound of user.Meanwhile
Loudspeaker 106 can have simple omnidirectional microphone, the omnidirectional microphone user close to and towards loudspeaker when examine well
Voice is surveyed, but then generates low-quality signal in other cases.
Based on these factors and similar factor, scheduling system 112 calculates the confidence score of each audio signal, and (this can
It scores before sending the content that it is listened to the detection of its own including equipment itself, and corresponding together with it
Audio signal sends the score together).Based between confidence score comparison and/or confidence score and baseline between
Comparison, scheduling system 112 selects one or more audio signals with for further processing.This may include locally executing voice
Direct movement is identified and taken, or passes through network 114 (such as, internet or any dedicated network) for audio signal transmission
To another service provider.For example, if an equipment generates the audio signal for having high confidence level to following event: signal
Including waking up words " OK Google ", then the audio signal can be sent to Google's speech recognition system based on cloud to be used for
Processing.By audio signal transmission to remote service, wake up words can together with any language after it by
It is included, or can only sends language.
Confidence score can be based on a large amount of factors, and may further indicate that the confidence level of more than one parameter.For example, score can
Which kind of indicate about position of wake-up words (and/or whether having used wake-up words) or user relative to microphone used
Confidence level.Score can also indicate the confidence level whether audio signal has high quality.In one example, scheduling system can be right
Audio signal from two equipment scores: the two, which is directed to, has high confidence level score, but its using special wake-up words
Middle one has low confidence in terms of audio signal quality, and another one then has high confidence in terms of audio signal quality
Degree.Selection had into the audio signal for the high confidence level score for being used for signal quality for further processing.
When more than one equipment transmits audio signal, determine that one of key factor of confidence level be exactly audio signal is table
Showing identical language still indicates two (or more) different language.Scoring itself can based on factor such as signal level,
Signal-to-noise ratio (SNR), the amount of echoing in signal, the spectral content of signal, user's identification, the position about user relative to microphone
Understanding or two or more equipment at audio signal relative timing.Position relevant scoring and user identity relevant scoring
It can be based on audio signal itself, and external data can be based on, the wearable tracker and mention that such as vision system, user are worn
For the identity of the equipment of signal.For example, the owner of the smart phone is its sound if smart phone is audio signal source
The confidence score for the user being listened will be high.It can be based at multiple positions or in the array at single location
The intensity of received audio signal and timing determine user location at multiple microphones.
In addition to determining and having used which wake-up words and which signal best, scoring can also be provided should for informing
How the additional scene of audio signal is handled.For example, possibility should if confidence score instruction user is just towards loudspeaker
By a VUI associated with smart phone, VUI associated with loudspeaker is used.Scene may include such content such as
Which user talking, which kind of activity the user relative to the position of equipment and towards, the user is carrying out (for example, forging
Refine, cook, see TV), the same day at the time of or which other equipment is used (including except providing those of audio signal equipment
Except equipment).
In some cases, scoring instruction listens to more than one order.For example, two equipment can respectively for
Lower event has high confidence level: it listens to different wake-up words or it listens to different users and is talking.This
In the case of, a request is sent each system used in words that wakes up by transmittable two requests-of scheduling system, or
Two are sent with the individual system called per family by two different requests.In other cases, it can be transmitted more than one
Audio signal-for example, to obtain more than one response, to allow remote system determining using which signal or
Improve voice recognition by combination signal.In addition to the audio signal of selection for further processing, scoring can also result in it
His user feedback.For example, light can flash in selected any equipment, so that user, which knows, has received order.
When sending audio signal to thereon with any service for being used to handle or system reception response from scheduling system,
Also it will appear similar consideration.In many cases, the processing of response will be also informed about the scene of language.For example, response can
It is sent to the equipment that selected audio signal receives from it.In other cases, different equipment can be transmitted in response.For example, such as
Fruit has selected the audio signal from separate microphone array 102, but plays audio file since the VUI response returned is,
Then the response should be handled by earphone 108 or loudspeaker 106.If response be display information, smart phone 104 or have screen
Some other equipment will be used to deliver response.If since scoring instruction microphone array audio signal has optimum signal matter
Measure and select microphone array audio signal, then add scoring may have indicated that user earphone 108 is not used but
Loudspeaker 106, therefore the possibility target that loudspeaker is in response to are used in same room.Also it will consider other ability-examples of equipment
Such as, although illustrating only audio frequency apparatus, voice command can handle other systems, such as illumination or domestic automation system.Cause
This dispatches system it may be concluded that it refers to detecting the room of most strong audio signal if being to turn off the light to the response of language
Between in lamp.Other possible output equipments include display, screen (for example, screen or television monitoring on smart phone
Device), household electrical appliance, door lock etc..In some instances, scene is supplied to remote system, and remote system is based specifically on
The combination of language and scene targets specific output equipment.
As described above, scheduling system can be single computer or distributed system.Provided speech processes can be similar
Ground is provided by single computer or distributed system, coextensive or separate with scheduling system with scheduling system.Each can be complete
Equipment is locally navigated to, is entirely positioned in cloud or distributes therebetween entirely.They can be integrated into equipment
One or all.The various tasks-score to signal, detect wake up words, send signal to another system with
For handling, the signal of resolve command, processing order, generates response, determines which equipment should handle response etc.-and can be combined in
Together or it is split as multiple subtasks.Each of task and subtask can by the combination of different equipment or equipment with
Local mode is executed with system based on cloud or other remote systems.
When we refer to microphone, we include microphone array, and are not intended to specific microphone techniques, topology
Or signal processing carries out any restrictions.Similarly, including any audio output should be understood as to the reference of loudspeaker and earphone
Equipment-TV, household audio and video system, doorbell, wearable loudspeaker etc..
Fig. 2 shows the second exemplary systems 200 with smart speakers 1 (202) and smart speakers 2 (204).Intelligence
Can loudspeaker be a kind of intelligent personal assistants comprising one or more microphones and one or more speakers and have place
Reason ability and communication capacity.The example of smart speakers is Amazon Echo.Alternatively, equipment 202 and 204, which can be, does not rise
To the effect of " smart speakers ", but still there is the equipment of one or more microphones, processing capacity and communication capacity.It is such to replace
Example for property equipment may include portable mobile wireless loudspeaker, such as BoseWireless speaker.One
In a little examples, two or more equipment (such as Amazon Echo Dot and Bose in combinationLoudspeaking
Device) smart speakers are provided.System 200 further includes modification system 206.Modification system 206 is configured as from from equipment 202
And/or input signal export (or reception) multiple audio signals of the microphone in equipment 204.Modification system 206 is also configured
To calculate the confidence score of each derived audio signal, and modified based on confidence score using an audio signal
Another audio signal.The function of modification system 206 can be a part of one or both of equipment 202 and 204, and/
Or it can be the specific installation that can be communicated with equipment 202 and 204 a part and/or it can be equipment based on cloud or
Service.Aspect based on cloud is indicated by network 208.As indicated by line 203, equipment 202 and 204 can communicate with one another.It is in
In the environment of front yard, which is usually (but not necessarily) wireless, for example, via the Wi-Fi for using router.Alternative solution is to make
With directly wirelessly or non-wirelessly communicating for such as bluetooth or LAN.
One or more microphone detections of each of equipment 202 and 204 are from user 210 (language) and/or make an uproar
The sound of sound source 212.In general, the first equipment more strongly picks up user spoken utterances than another equipment, and another equipment is than
One equipment more strongly picks up noise.It there is many ways in which and can handle the audio signal from equipment 202 and 204 to calculate needle
To the confidence level of following event: whether signal be based on or be based on including language and signal or including undesirable sound
(herein commonly referred to as " noise ").Mode as a kind of is to be swashed in each of equipment 202 and 204 using voice
Detector (VAD) living.VAD can distinguish whether sound is language.It is used to reduce the audio signal including language in system 200
In the case where noise content, undesirable noise is considered based on the audio signal for having received sound for not triggering VAD,
And the audio signal for having received sound based on certain triggering VAD is considered (or including at least) desired language.
As indicated by dotted line 221-224, in the non-limiting example, equipment 202 and the degree of closeness of user 210 are greater than
The degree of closeness of it and noise source 212, and the degree of closeness of equipment 204 and noise source 212 is close with user 210 greater than it
Degree.System can include determining that equipment is closer to expectation sound source (for example, user) or undesirable sound source (for example, noise
Source) ability.Modification system 206 can complete the determination.As described above, may be carried out really in any technically feasible mode
It is fixed, such as by comparing the timing of microphone detection to sound, or by comparing the signal strength of derived audio signal, or
Person by comparing derived audio signal signal-to-noise ratio, or by comparing derived audio signal spectral content, Huo Zhetong
Cross echoing in relatively more derived audio signal.In one example, in many cases, equipment 202, which is picked up, comes from user 210
Language than it to pick up sound from noise source 212 stronger (because it closer to user 210), and then for equipment 204
On the contrary.In this case, modification system 206 can determine equipment 202 closer to user 210, and equipment 212 is closer to making an uproar
Sound source 212.Modification system 206 can calculate sound source 210 and/or the distance between 212 and equipment 202 and/or 204.Modification system
System 206 can calculate the position of sound source 210 and/or 212.In a non-limiting example, which can be by triangulation.
Can by using derived from noise source audio signal come modify from most consumingly receive language source derived from
Audio signal improves the quality of the audio signal including desired audio (language).Therefore, from equipment 204, (it is most consumingly picked up
Take noise source 212) derived from audio signal for modifying the audio signal derived from the equipment 202 (it most consumingly obtains user
210 language).By using modification system 206 by the audio signal based on noise to the audio signal based on sound into
Row filtering, may be implemented the improvement of signal quality.For example, the audio stream from equipment 204 may be used as the sound from equipment 202
The reference of the sef-adapting filter of frequency stream, to further decrease equipment 202 from the received noise of noise source 212.Audio signal
Adaptive-filtering be well known in the art, and therefore will not be discussed further here.
In this example, equipment 202 and 204 can be located at the different location in public domain, such as the room in such as family
Between or business meetings room.In one case, public domain is considered equipment 202 and 204 and all picks up from noise source 212
Any region of number voice.When equipment 202 and 204 is smart speakers or including one or more microphones and processing energy
When the other equipment of power and communication capacity, user 210, which can say, is intended for one or both of equipment 202 and 204
Order.May there are TV or refrigerator to be currently running simultaneously, or one in possible equipment 202 and 204 is playing music.Appoint
What this non-speech sounds (referred to as " noise ") all may interfere with being properly received and using for voice command.Therefore, expectation is reduced
Noise in signal (with language/voice command signal) help to improve the smart speakers for most receiving language strongly or its
The function of his equipment.
Multiple (two or more) microphones at different location may include two or more distinct device (examples
Such as, respectively there are two equipment of one or more microphones) one or more microphones, or may include individual equipment
Multiple microphones.In the first scenario, multiple microphones of each equipment can spatially concentrate on expectation sound source (use
Family or noise source) on, such as pass through beam forming.When individual equipment includes used multiple microphones, beam forming can
For by beam position noise source and by different beam position target sources (user).When identical microphone is used for two waves
Shu Shi, these wave beams can be sequence, or if equipment has sufficient amount of microphone, these wave beams be can be simultaneously
Capable.
In the situation shown in fig. 2, equipment 202 and 204 respectively can wirelessly communicate each other and with modification system 206
Wireless communication.In many cases, system 206 is completed the processing of one of equipment 202 or 204 is used, therefore is not deposited
In the specific installation including system 206.Another alternative solution is the completion system in remote equipment (for example, in cloud 208)
206.Under a kind of scene, pick up noise equipment 204 be processed to after audio signal streaming be transferred to equipment 202.Then,
Equipment 202 use input based on the audio stream of noise as the reference in sef-adapting filter to reduce from equipment 202
The noise content of audio signal.This includes desired language.
The embodiment of the systems and methods includes that will become apparent to computer portion for those skilled in the art
Part and computer implemented step.For example, it will be appreciated by those skilled in the art that the instruction for executing computer implemented step can
The computer being stored as on computer-readable medium (such as, floppy disk, hard disk, CD, flash rom, non-volatile ROM and RAM)
Executable instruction.In addition, it will be appreciated by those skilled in the art that computer executable instructions can be held on various processors
Row, such as, for example, microprocessor, digital signal processor, gate array etc..For ease of description, the systems and methods are not
It is a part that each step or element are described as computer system herein, but those skilled in the art will recognize
Know each step or element can have corresponding computer system or software component.Therefore, by describing its corresponding step
Rapid or element (that is, their function) realizes such computer system and/or software component within the scope of this disclosure.
Multiple embodiments have been described.It will be appreciated, however, that in the feelings for the range for not departing from inventive concept described herein
Under condition, additional modifications can be carried out, and therefore, other embodiments are in the scope of the following claims.
Claims (24)
1. a kind of system, comprising:
Multiple microphones, the multiple microphone are positioned at different location;And
Modification system, the modification system and the mi crophone communication and is configured as:
Multiple audio signals are exported from the multiple microphone,
The confidence score of each derived audio signal is calculated, and
Based on the confidence score of the calculating, another audio signal is modified using a derived audio signal.
2. system according to claim 1, wherein the confidence score for calculating each derived audio signal includes calculating
The derived audio signal whether include voice and the derived audio signal whether include non-speech sounds confidence
Degree.
3. system according to claim 1, wherein the confidence score for calculating each derived audio signal includes determining
Whether the derived audio signal is voice signal.
4. system according to claim 1, wherein modifying another audio signal using a derived audio signal
Including being filtered by the second audio signal to the first audio signal.
5. system according to claim 4, wherein by the second audio signal to the first audio signal be filtered including
Use second audio signal as the reference of the sef-adapting filter for first audio signal.
6. system according to claim 1, wherein the derived audio signal is in varying numbers in the microphone
Quantity.
7. system according to claim 1, wherein at least one of described microphone includes microphone array.
8. system according to claim 7, wherein the first microphone array spatially concentrates in the first sound objects.
9. system according to claim 8, wherein second microphone array spatially concentrates in second sound target.
10. system according to claim 9, wherein first sound objects include human sound.
11. system according to claim 10, wherein the second sound target includes noise source.
12. system according to claim 1, wherein the first microphone is a part and second microphone of the first equipment
It is a part of the second equipment, and wherein the first audio signal is exported and the second audio signal from first microphone
It is exported from the second microphone.
13. system according to claim 12, wherein second equipment is by second audio signal transmission described in
First equipment.
14. system according to claim 13, wherein first equipment is using second audio signal to modify
State the first audio signal.
15. system according to claim 14, wherein first equipment is using second audio signal to reduce
State the noise in the first audio signal.
16. system according to claim 1, wherein the first microphone and second microphone are all one of the first equipment
Point.
17. system according to claim 16, wherein the first audio signal is exported and from first microphone
Two audio signals are exported from the second microphone.
18. system according to claim 17, wherein second audio signal be used to reduce the first audio letter
Noise in number.
19. system according to claim 1, wherein the multiple microphone is a part of the first equipment.
20. system according to claim 19, wherein first equipment is spatially by the multiple of first equipment
Microphone concentrates in the first individual sources and the second individual sources, wherein the first audio signal is exported from first sound source
And the second audio signal is exported from second sound source.
21. system according to claim 20, wherein second audio signal be used to reduce the first audio letter
Noise in number.
22. a kind of system, comprising:
Multiple microphones, the multiple microphone are positioned at different location, wherein the first microphone is the one of the first equipment
Part and second microphone are a part of the second equipment;
Operate first equipment wherein to export the first audio signal from first microphone, operate second equipment with
The second audio signal is exported from the second microphone, and second equipment is suitable for arriving second audio signal transmission
First equipment;And
Modification system, the modification system are a part of first equipment and in response to first audio signal and institute
The second audio signal is stated, wherein the modification system is reduced using second audio signal in first audio signal
Noise.
23. a kind of system, comprising:
Multiple microphones, the multiple microphone are a part of the first equipment, including the first microphone and second microphone;
First equipment is wherein operated to export the first audio signal from first microphone and from second Mike
Wind exports the second audio signal;And
Modification system, the modification system are a part of first equipment and in response to first audio signal and institute
The second audio signal is stated, wherein the modification system is reduced using second audio signal in first audio signal
Noise.
24. a kind of system, comprising:
Multiple microphones, the multiple microphone are a part of the first equipment;
Wherein multiple microphones of first equipment are spatially concentrated on the first individual sources and by first equipment
In two individual sources, wherein the first audio signal is exported from first sound source and the second audio signal is from the rising tone
Source is exported;
Operate first equipment wherein to export the first audio signal from first sound source and lead from second sound source
Second audio signal out;And
Modification system, the modification system are a part of first equipment and in response to first audio signal and institute
The second audio signal is stated, wherein the modification system is reduced using second audio signal in first audio signal
Noise.
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/US2017/053177 WO2019059939A1 (en) | 2017-09-25 | 2017-09-25 | Processing speech from distributed microphones |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110121744A true CN110121744A (en) | 2019-08-13 |
Family
ID=60043303
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201780075396.8A Pending CN110121744A (en) | 2017-09-25 | 2017-09-25 | Handle the voice from distributed microphone |
Country Status (4)
Country | Link |
---|---|
EP (1) | EP3539128A1 (en) |
JP (1) | JP2019537071A (en) |
CN (1) | CN110121744A (en) |
WO (1) | WO2019059939A1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111048067A (en) * | 2019-11-11 | 2020-04-21 | 云知声智能科技股份有限公司 | Microphone response method and device |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140301558A1 (en) * | 2013-03-13 | 2014-10-09 | Kopin Corporation | Dual stage noise reduction architecture for desired signal extraction |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2008236077A (en) * | 2007-03-16 | 2008-10-02 | Kobe Steel Ltd | Target sound extracting apparatus, target sound extracting program |
US9113240B2 (en) * | 2008-03-18 | 2015-08-18 | Qualcomm Incorporated | Speech enhancement using multiple microphones on multiple devices |
JP5724125B2 (en) * | 2011-03-30 | 2015-05-27 | 株式会社国際電気通信基礎技術研究所 | Sound source localization device |
JP5958218B2 (en) * | 2011-09-15 | 2016-07-27 | 株式会社Jvcケンウッド | Noise reduction device, voice input device, wireless communication device, and noise reduction method |
US10229697B2 (en) * | 2013-03-12 | 2019-03-12 | Google Technology Holdings LLC | Apparatus and method for beamforming to obtain voice and noise signals |
US10026399B2 (en) * | 2015-09-11 | 2018-07-17 | Amazon Technologies, Inc. | Arbitration between voice-enabled devices |
-
2017
- 2017-09-25 JP JP2019530671A patent/JP2019537071A/en active Pending
- 2017-09-25 CN CN201780075396.8A patent/CN110121744A/en active Pending
- 2017-09-25 WO PCT/US2017/053177 patent/WO2019059939A1/en unknown
- 2017-09-25 EP EP17781254.2A patent/EP3539128A1/en not_active Withdrawn
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140301558A1 (en) * | 2013-03-13 | 2014-10-09 | Kopin Corporation | Dual stage noise reduction architecture for desired signal extraction |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111048067A (en) * | 2019-11-11 | 2020-04-21 | 云知声智能科技股份有限公司 | Microphone response method and device |
Also Published As
Publication number | Publication date |
---|---|
JP2019537071A (en) | 2019-12-19 |
WO2019059939A1 (en) | 2019-03-28 |
EP3539128A1 (en) | 2019-09-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10149049B2 (en) | Processing speech from distributed microphones | |
CN109155130A (en) | Handle the voice from distributed microphone | |
EP3122066B1 (en) | Audio enhancement via opportunistic use of microphones | |
US11922095B2 (en) | Device selection for providing a response | |
CN107465974B (en) | Sound signal detector | |
CN107465970B (en) | Apparatus for voice communication | |
US10089980B2 (en) | Sound reproduction method, speech dialogue device, and recording medium | |
JP2023542968A (en) | Hearing enhancement and wearable systems with localized feedback | |
US20210225374A1 (en) | Method and system of environment-sensitive wake-on-voice initiation using ultrasound | |
CN113228710A (en) | Sound source separation in hearing devices and related methods | |
CN110121744A (en) | Handle the voice from distributed microphone | |
EP4184507A1 (en) | Headset apparatus, teleconference system, user device and teleconferencing method | |
US11882415B1 (en) | System to select audio from multiple connected devices | |
US20230035531A1 (en) | Audio event data processing | |
WO2023010011A1 (en) | Processing of audio signals from multiple microphones | |
WO2023010012A1 (en) | Audio event data processing | |
EP4005249A1 (en) | Estimating user location in a system including smart audio devices |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20190813 |