CN110121744A - Handle the voice from distributed microphone - Google Patents

Handle the voice from distributed microphone Download PDF

Info

Publication number
CN110121744A
CN110121744A CN201780075396.8A CN201780075396A CN110121744A CN 110121744 A CN110121744 A CN 110121744A CN 201780075396 A CN201780075396 A CN 201780075396A CN 110121744 A CN110121744 A CN 110121744A
Authority
CN
China
Prior art keywords
audio signal
equipment
microphone
derived
audio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201780075396.8A
Other languages
Chinese (zh)
Inventor
A·莫吉米
D·克里斯特
W·贝拉迪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Bose Corp
Original Assignee
Bose Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Bose Corp filed Critical Bose Corp
Publication of CN110121744A publication Critical patent/CN110121744A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02165Two microphones, one receiving mainly the noise signal and the other one mainly the speech signal
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The present invention provides a kind of system, the system has the multiple microphones of positioning at different locations and the modification system with the mi crophone communication.The modification system is configured as exporting multiple audio signals from the multiple microphone;Calculate the confidence score of each derived audio signal;And it is based on confidence score calculated, another audio signal is modified using a derived audio signal.

Description

Handle the voice from distributed microphone
Background technique
This disclosure relates to handle the voice from distributed microphone.
Current speech identifying system assumes that a microphone or microphone array are listening to user and speaking and based on language Sound takes movement.The movement may include local voice identification and response, identification based on cloud and response or these combination.One In a little situations, local identification " waking up words ", and further processing is remotely provided based on the wake-up words.
Distributed loudspeaker system tunable is located in the audio playback at multiple loudspeakers around family, so that sound Playback is synchronous between each position.
Summary of the invention
In general, in one aspect, system includes the multiple microphones and and microphone of positioning at different locations The scheduling system of communication.Scheduling system exports multiple audio signals from multiple microphones, calculates each derived audio signal Confidence score, and compare the confidence score of calculating.Based on the comparison, in audio signal derived from scheduling Systematic selection At least one, with for further processing.
Specific implementation can include one or more of the following terms with any combination.Scheduling system may include multiple locals Processor, multiple native processor are connected respectively at least one of microphone.Scheduling system may include at least first Ground processor and at least second processor that can be used for first processor on network.Calculate each derived audio signal Confidence score may include calculating whether signal may include voice, in signal whether may include waking up words, may include in signal Which kind of wakes up words, the quality including voice in the signal, its sound may be recorded in the user in signal identity and The confidence level of one or more of position of the user relative to microphone position.Calculate the confidence of each derived audio signal Degree score may also include determining that audio signal shows as including whether language and the language include wake-up words.Calculating is each led It includes which of multiple wake-up words wake up words that the confidence score of audio signal out, which can further include in identification voice,.Meter The confidence score for calculating each derived audio signal may also include determining that voice includes the confidence level for waking up words.
The confidence score for calculating each derived audio signal may include comparing microphone detection to believe to each audio Number corresponding sound, the signal strength of derived audio signal, the signal-to-noise ratio of derived audio signal, derived audio signal One or more of the timing between the time echoed in spectral content and derived audio signal.Calculate each export The confidence score of audio signal may include calculating in the apparent source and microphone of audio signal extremely for each audio signal It is the distance between one few.The confidence score for calculating each derived audio signal may include calculating each audio signal source phase For the position of microphone position.The position for calculating each audio signal source may include in each source and microphone based on calculating The distance between at least two to carry out triangulation to the position.
At least part in selected one or more signals can be transferred to speech processing system by scheduling system, to mention For being further processed.The selected one or more audio signals of transmission may include that at least one is selected from multiple speech processing systems A speech processing system.At least one speech processing system in multiple speech processing systems may include providing on a wide area network Speech-recognition services.At least one speech processing system in multiple speech processing systems may include audio recognition method, described Audio recognition method executes in the same processor for executing scheduling system.The selection of speech processing system can be based on and user's phase One or more of scene locating for associated preference, the confidence score of calculating or export audio signal.Scene may include To which microphone in the identification of the user that may be being talked, multiple microphones produce selected derived audio signal, User was relative to one of the mode of operation of the other equipment in the position of microphone position, system and moment on the same day or more Person.The selection of speech processing system can be based on the resource that can be used for speech processing system.
The confidence score for comparing calculating may include that audio signal selected by determining at least two shows as including from extremely The language of few two different users.Determining that selected audio signal is shown as includes that the language from least two different users can base In voice recognition, the user relative in the position of the position of the microphone, the microphone which produce The different uses for waking up words and the user in each selected audio signal, described two selected audio signals One or more of visual identity.Scheduling system can also send two for selected audio signal corresponding with two different users A different selected speech processing system.It can preference based on user, the load balance of speech processing system, selected audio letter Number scene and two selected audio signals in different wake up one or more of using for words and believe selected audio Number it is assigned to selected speech processing system.Scheduling system can also using selected audio signal corresponding with two different users as Two individually handle request and are sent to identical speech processing system.
The confidence score for comparing calculating may include determining that at least two the received audio signals show as indicating identical Language.Determine that selected audio signal indicates that identical language can be based on voice recognition, audio signal source relative to microphone position In the position set, microphone which produce each selected audio signal, the arrival time of audio signal, audio signal it Between or the output of microphone array element between one of correlation, pattern match and the visual identity of personal speech or More persons.Scheduling system can also will appear as indicating in the audio signal of identical language only one be sent to speech processes system System.Scheduling system can also will appear as indicating both to be sent to speech processing system in the audio signal of identical language. At least one selected audio signal can be also transferred to each of at least two speech processing systems by scheduling system, received and The response of each from speech processing system, and determine the sequence for wanting output response.
At least two selected audio signals can be also transferred at least one speech processing system by scheduling system, and reception comes from The response of speech processing system corresponding with each transmission signal, and determine the sequence for wanting output response.Scheduling system can quilt It is further configured to receive the response to being further processed, and uses output equipment output response.Output equipment can not with catch The microphone for having obtained audio is corresponding.Output equipment can delocalization at any position that microphone is positioned.Output equipment can wrap Include one or more of loudspeaker, earphone, wearable audio frequency apparatus, display, video screen or household electrical appliance.It is receiving After the multiple responses being further processed, scheduling system can be by wanting output response at single output by response combination to determine Sequence.After receiving to the multiple responses being further processed, scheduling system can be by selection output all or fewer than response Response or send different output equipments for different responses and determine the sequence for wanting output response.The number of derived audio signal Amount can be not equal to the quantity of microphone.At least one of microphone may include microphone array.The system can further include non-sound Frequency input equipment.Non-audio input equipment may include accelerometer, Existing detector, camera, wearable sensors or user circle One or more of face equipment.
In general, in one aspect, system includes the multiple equipment of positioning at different locations;And it is communicated with equipment Scheduling system, which receives response from speech processing system in response to the request that had previously transmitted, determines and responds At least one of equipment is forwarded the response to the correlation of each equipment and based on the determination.
Specific implementation can include one or more of the following terms with any combination.At least one of equipment may include Audio output apparatus, and transmitted response may make the equipment to export audio signal corresponding with response.Audio output apparatus can Including one or more of loudspeaker, earphone or wearable audio frequency apparatus.At least one of equipment may include display, view Frequency screen or household electrical appliance.The request previously transmitted can never with the associated the third place in any of multiple positions of equipment Place's transmission.Response can be the first response, and the system of dispatching can also receive the response from the second speech processing system.Scheduling system System can also respond first for being forwarded in equipment, and second that the second response is forwarded in equipment for first.Scheduling First response and second can also be responded first for being both forwarded in equipment by system.Scheduling system can be also by the first response Any of equipment is forwarded to only one in the second response.
The correlation for determining response may include which is associated with the request previously transmitted in determining equipment.Determine response Correlation may include which can closest user associated with request that is previously transmitting in determining equipment.Determine response Correlation can be based on preference associated with the user of required system.The correlation for determining response may include determining previously transmission The scene of request.Scene may include to may the identification of user associated with request, which Mike in multiple microphones Wind relative to the mode of operation of the other equipment in the position of device location, system and may work as with the associated, user of request One or more of its moment.The correlation for determining response may include the ability or Resource Availability of determining equipment.
Multiple output equipments can be positioned at different output equipment positions, and the system of dispatching can be additionally in response to transmission The correlation requested to receive the response from speech processing system, determine response and each output equipment, and based on the determination Forward the response at least one of output equipment.At least one of output equipment may include audio output apparatus, and Transmitted response makes the equipment export audio signal corresponding with response.Audio output apparatus may include loudspeaker, earphone or can Dress one or more of audio frequency apparatus.At least one of output equipment may include display, video screen or household electric Device.The correlation for determining response may include the pass between determining output equipment and microphone associated with selected audio signal System.The correlation for determining response may include which can be closest to selected audio signal source in determining output equipment.Determine response Correlation may include determine export audio signal locating for scene.Scene may include the knowledge to the user that may be being talked Not, which microphone produces selected derived audio signal, user relative to microphone position and sets in multiple microphones One or more of the mode of operation of other equipment for the position of position, in system and moment on the same day.Determine response Correlation may include the ability or Resource Availability of determining output equipment.
In general, in one aspect, system includes being located in multiple microphones at different microphone positions, being located in Multiple loudspeakers at different loudspeaker locations and the scheduling system communicated with microphone and loudspeaker.Scheduling system is from multiple Microphone exports multiple voice signals;Calculating about each derived voice signal includes the confidence score for waking up words;Than Compared with the confidence score of calculating;And based on the comparison, select at least one of derived voice signal and will be selected At least part in one or more signals is transferred to speech processing system.Scheduling system is received and is come from response to the transmission The response of speech processing system, the correlation for determining response with each loudspeaker, and expansion is forwarded the response to based on the determination At least one of sound device is for exporting.
In general, on the other hand, system includes multiple microphones and and the Mike positioned at different locations The modification system of wind communication.Modification system is configured as exporting multiple audio signals from multiple microphones;It calculates each derived The confidence score of audio signal;And it is based on confidence score calculated, it is modified using a derived audio signal Another audio signal.
The confidence score for calculating each derived audio signal may include whether calculating the derived audio signal Including voice and the derived audio signal whether include non-speech sounds confidence level.Calculate each derived audio letter Number confidence score can include determining that whether the derived audio signal is voice signal.Use a derived audio Signal is come to modify another audio signal may include being filtered by the second audio signal to the first audio signal.Pass through It may include using the second audio signal as the first audio signal that two audio signals, which are filtered the first audio signal, Sef-adapting filter reference.The quantity of derived audio signal may differ from the quantity of microphone.
At least one of microphone may include microphone array.First microphone array can spatially concentrate on In one sound objects.Second microphone array can spatially concentrate in second sound target.First sound objects can be with Including human sound.Second sound target may include noise source.
First microphone can be a part of the first equipment and second microphone can be a part of the second equipment, And the first audio signal can be exported from the first microphone and the second audio signal can be exported from second microphone.Second Equipment can be by the second audio signal transmission to the first equipment.The second audio signal can be used to modify the first sound in first equipment Frequency signal.The second audio signal can be used to reduce the noise in the first audio signal in first equipment.
First microphone and second microphone can be a part of the first equipment.First audio signal can be from first Microphone exports and the second audio signal can be exported from second microphone.Second audio signal can be used for reducing the first sound Noise in frequency signal.Multiple microphones can be a part of the first equipment.First equipment can be spatially multiple by its Microphone concentrates in the first individual sources and the second individual sources, wherein the first audio signal exports and from the first sound source Two audio signals are exported from the second sound source.Second audio signal can be used for reducing the noise in the first audio signal.
In general, in another aspect, system includes multiple microphones, and multiple microphone is located in different location Place, wherein the first microphone is a part of the first equipment and second microphone is a part of the second equipment, wherein operating First equipment operates the second equipment to export the first audio signal from the first microphone to export the second audio from second microphone Signal, and the second equipment is suitable for the second audio signal transmission to the first equipment.The modification of a part as the first equipment System is in response to the first audio signal and the second audio signal, wherein modification system reduces by the first sound using the second audio signal Noise in frequency signal.
In general, in another aspect, system includes multiple microphones, and multiple microphone is the one of the first equipment Part, including the first microphone and second microphone, wherein operating the first equipment to export the first audio letter from the first microphone Number and from second microphone export the second audio signal.Modification system is a part of the first equipment and in response to the first sound Frequency signal and the second audio signal, wherein modification system reduces the noise in the first audio signal using the second audio signal.
In general, in another aspect, system includes multiple microphones, and multiple microphone is the one of the first equipment Part, wherein the first equipment spatially concentrates on its multiple microphone in the first individual sources and the second individual sources, In the first audio signal exported from the first sound source and the second audio signal is exported from the second sound source.The first equipment is operated with from One sound source exports the first audio signal and exports the second audio signal from the second sound source.Modification system is one of the first equipment Point and in response to the first audio signal and the second audio signal, wherein modification system reduces first using the second audio signal Noise in audio signal.
Advantage includes the verbal order detected at multiple positions and the single response provided to the order.Advantage further includes It provides to compared to the response for detecting the verbal order at the position of order and the more relevant position of user.
Can by it is any technically it is possible in a manner of combine all examples and feature referred to above.Other feature and advantage It will be apparent in a specific embodiment and in the claims.
Detailed description of the invention
Fig. 1 show microphone and can voice command received by response microphones equipment system layout.
Fig. 2 shows an audio signal can be used the system of modifying another audio signal.
Specific embodiment
With more and more equipment realize sound control user interface (VUI), occur multiple equipment can be detected it is identical Verbal order simultaneously attempts the problem of handling the order, this causes to occur being responsive at different operating points from redundancy taking mutual lance The problems such as movement of shield.Which similarly, if verbal order can lead to the output or movement of multiple equipment, should be adopted by equipment It may be fuzzy for taking movement.In some VUI, the referred to as special phrase of " wake up words ", " waking up word " or " keyword " Speech recognition features-realization VUI equipment for activating VUI always listens to wake-up words, and calls out when the equipment listens to When awake words, which parses any verbal order after it.This is in order to by not parsing detected each sound Save process resource, and this can help to eliminate about which system be order target ambiguity, but if multiple systems System is listening to identical wake-up words, such as due to the wake-up words to service provider rather than individual hardware is related Connection, then problem is still which determining equipment should handle the order.
Fig. 1 shows exemplary system 100, wherein separate microphone array 102, smart phone 104,106 and of loudspeaker (in order to avoid obscuring, we will be a for the respective microphone for all having detection user speech of one or more of one group of earphone 108 People's speech is known as " user " and equipment 106 is known as " loudspeaker ";" the discrete content that user is said " is " language ").And And " sound ", " noise " and similar word refer to audible sound energy." audio signal " refers to indicating such sound, and And it can be generated by microphone or other electronic equipments, and the electric signal or optical signal of audible sound energy can be converted back by loudspeaker.Inspection The content that each equipment of survey language 110 is listened to it is as audio signal transmission to scheduling system 112.Have in equipment In the case where multiple microphones, those equipment can combine the signal presented by individual microphone so that single combining audio letter is presented Number or its can transmit the signal presented by each microphone.
Scheduling system 112 can be separately connected to service based on cloud thereon, in a phase for wherein each equipment With the local service run in equipment or associated equipment, in the upper synthetic operation of some or all of these equipment itself Any combination of Distributed Services or these frameworks or similar framework.Due to its different microphone design and itself and user The different degrees of approach, each equipment can differently listen to language 110 (if any).For example, independent microphone array 102 There can be high quality Wave beam forming ability, this allows, and no matter user, which is located at the where independent microphone array, can clearly detect Hear language, and earphone 108 and smart phone 104 are respectively provided with the near field microphone of high orientation, if user adorns oneself with ear Machine and phone is remained into the face towards them, then the near field microphone only clearly obtains the sound of user.Meanwhile Loudspeaker 106 can have simple omnidirectional microphone, the omnidirectional microphone user close to and towards loudspeaker when examine well Voice is surveyed, but then generates low-quality signal in other cases.
Based on these factors and similar factor, scheduling system 112 calculates the confidence score of each audio signal, and (this can It scores before sending the content that it is listened to the detection of its own including equipment itself, and corresponding together with it Audio signal sends the score together).Based between confidence score comparison and/or confidence score and baseline between Comparison, scheduling system 112 selects one or more audio signals with for further processing.This may include locally executing voice Direct movement is identified and taken, or passes through network 114 (such as, internet or any dedicated network) for audio signal transmission To another service provider.For example, if an equipment generates the audio signal for having high confidence level to following event: signal Including waking up words " OK Google ", then the audio signal can be sent to Google's speech recognition system based on cloud to be used for Processing.By audio signal transmission to remote service, wake up words can together with any language after it by It is included, or can only sends language.
Confidence score can be based on a large amount of factors, and may further indicate that the confidence level of more than one parameter.For example, score can Which kind of indicate about position of wake-up words (and/or whether having used wake-up words) or user relative to microphone used Confidence level.Score can also indicate the confidence level whether audio signal has high quality.In one example, scheduling system can be right Audio signal from two equipment scores: the two, which is directed to, has high confidence level score, but its using special wake-up words Middle one has low confidence in terms of audio signal quality, and another one then has high confidence in terms of audio signal quality Degree.Selection had into the audio signal for the high confidence level score for being used for signal quality for further processing.
When more than one equipment transmits audio signal, determine that one of key factor of confidence level be exactly audio signal is table Showing identical language still indicates two (or more) different language.Scoring itself can based on factor such as signal level, Signal-to-noise ratio (SNR), the amount of echoing in signal, the spectral content of signal, user's identification, the position about user relative to microphone Understanding or two or more equipment at audio signal relative timing.Position relevant scoring and user identity relevant scoring It can be based on audio signal itself, and external data can be based on, the wearable tracker and mention that such as vision system, user are worn For the identity of the equipment of signal.For example, the owner of the smart phone is its sound if smart phone is audio signal source The confidence score for the user being listened will be high.It can be based at multiple positions or in the array at single location The intensity of received audio signal and timing determine user location at multiple microphones.
In addition to determining and having used which wake-up words and which signal best, scoring can also be provided should for informing How the additional scene of audio signal is handled.For example, possibility should if confidence score instruction user is just towards loudspeaker By a VUI associated with smart phone, VUI associated with loudspeaker is used.Scene may include such content such as Which user talking, which kind of activity the user relative to the position of equipment and towards, the user is carrying out (for example, forging Refine, cook, see TV), the same day at the time of or which other equipment is used (including except providing those of audio signal equipment Except equipment).
In some cases, scoring instruction listens to more than one order.For example, two equipment can respectively for Lower event has high confidence level: it listens to different wake-up words or it listens to different users and is talking.This In the case of, a request is sent each system used in words that wakes up by transmittable two requests-of scheduling system, or Two are sent with the individual system called per family by two different requests.In other cases, it can be transmitted more than one Audio signal-for example, to obtain more than one response, to allow remote system determining using which signal or Improve voice recognition by combination signal.In addition to the audio signal of selection for further processing, scoring can also result in it His user feedback.For example, light can flash in selected any equipment, so that user, which knows, has received order.
When sending audio signal to thereon with any service for being used to handle or system reception response from scheduling system, Also it will appear similar consideration.In many cases, the processing of response will be also informed about the scene of language.For example, response can It is sent to the equipment that selected audio signal receives from it.In other cases, different equipment can be transmitted in response.For example, such as Fruit has selected the audio signal from separate microphone array 102, but plays audio file since the VUI response returned is, Then the response should be handled by earphone 108 or loudspeaker 106.If response be display information, smart phone 104 or have screen Some other equipment will be used to deliver response.If since scoring instruction microphone array audio signal has optimum signal matter Measure and select microphone array audio signal, then add scoring may have indicated that user earphone 108 is not used but Loudspeaker 106, therefore the possibility target that loudspeaker is in response to are used in same room.Also it will consider other ability-examples of equipment Such as, although illustrating only audio frequency apparatus, voice command can handle other systems, such as illumination or domestic automation system.Cause This dispatches system it may be concluded that it refers to detecting the room of most strong audio signal if being to turn off the light to the response of language Between in lamp.Other possible output equipments include display, screen (for example, screen or television monitoring on smart phone Device), household electrical appliance, door lock etc..In some instances, scene is supplied to remote system, and remote system is based specifically on The combination of language and scene targets specific output equipment.
As described above, scheduling system can be single computer or distributed system.Provided speech processes can be similar Ground is provided by single computer or distributed system, coextensive or separate with scheduling system with scheduling system.Each can be complete Equipment is locally navigated to, is entirely positioned in cloud or distributes therebetween entirely.They can be integrated into equipment One or all.The various tasks-score to signal, detect wake up words, send signal to another system with For handling, the signal of resolve command, processing order, generates response, determines which equipment should handle response etc.-and can be combined in Together or it is split as multiple subtasks.Each of task and subtask can by the combination of different equipment or equipment with Local mode is executed with system based on cloud or other remote systems.
When we refer to microphone, we include microphone array, and are not intended to specific microphone techniques, topology Or signal processing carries out any restrictions.Similarly, including any audio output should be understood as to the reference of loudspeaker and earphone Equipment-TV, household audio and video system, doorbell, wearable loudspeaker etc..
Fig. 2 shows the second exemplary systems 200 with smart speakers 1 (202) and smart speakers 2 (204).Intelligence Can loudspeaker be a kind of intelligent personal assistants comprising one or more microphones and one or more speakers and have place Reason ability and communication capacity.The example of smart speakers is Amazon Echo.Alternatively, equipment 202 and 204, which can be, does not rise To the effect of " smart speakers ", but still there is the equipment of one or more microphones, processing capacity and communication capacity.It is such to replace Example for property equipment may include portable mobile wireless loudspeaker, such as BoseWireless speaker.One In a little examples, two or more equipment (such as Amazon Echo Dot and Bose in combinationLoudspeaking Device) smart speakers are provided.System 200 further includes modification system 206.Modification system 206 is configured as from from equipment 202 And/or input signal export (or reception) multiple audio signals of the microphone in equipment 204.Modification system 206 is also configured To calculate the confidence score of each derived audio signal, and modified based on confidence score using an audio signal Another audio signal.The function of modification system 206 can be a part of one or both of equipment 202 and 204, and/ Or it can be the specific installation that can be communicated with equipment 202 and 204 a part and/or it can be equipment based on cloud or Service.Aspect based on cloud is indicated by network 208.As indicated by line 203, equipment 202 and 204 can communicate with one another.It is in In the environment of front yard, which is usually (but not necessarily) wireless, for example, via the Wi-Fi for using router.Alternative solution is to make With directly wirelessly or non-wirelessly communicating for such as bluetooth or LAN.
One or more microphone detections of each of equipment 202 and 204 are from user 210 (language) and/or make an uproar The sound of sound source 212.In general, the first equipment more strongly picks up user spoken utterances than another equipment, and another equipment is than One equipment more strongly picks up noise.It there is many ways in which and can handle the audio signal from equipment 202 and 204 to calculate needle To the confidence level of following event: whether signal be based on or be based on including language and signal or including undesirable sound (herein commonly referred to as " noise ").Mode as a kind of is to be swashed in each of equipment 202 and 204 using voice Detector (VAD) living.VAD can distinguish whether sound is language.It is used to reduce the audio signal including language in system 200 In the case where noise content, undesirable noise is considered based on the audio signal for having received sound for not triggering VAD, And the audio signal for having received sound based on certain triggering VAD is considered (or including at least) desired language.
As indicated by dotted line 221-224, in the non-limiting example, equipment 202 and the degree of closeness of user 210 are greater than The degree of closeness of it and noise source 212, and the degree of closeness of equipment 204 and noise source 212 is close with user 210 greater than it Degree.System can include determining that equipment is closer to expectation sound source (for example, user) or undesirable sound source (for example, noise Source) ability.Modification system 206 can complete the determination.As described above, may be carried out really in any technically feasible mode It is fixed, such as by comparing the timing of microphone detection to sound, or by comparing the signal strength of derived audio signal, or Person by comparing derived audio signal signal-to-noise ratio, or by comparing derived audio signal spectral content, Huo Zhetong Cross echoing in relatively more derived audio signal.In one example, in many cases, equipment 202, which is picked up, comes from user 210 Language than it to pick up sound from noise source 212 stronger (because it closer to user 210), and then for equipment 204 On the contrary.In this case, modification system 206 can determine equipment 202 closer to user 210, and equipment 212 is closer to making an uproar Sound source 212.Modification system 206 can calculate sound source 210 and/or the distance between 212 and equipment 202 and/or 204.Modification system System 206 can calculate the position of sound source 210 and/or 212.In a non-limiting example, which can be by triangulation.
Can by using derived from noise source audio signal come modify from most consumingly receive language source derived from Audio signal improves the quality of the audio signal including desired audio (language).Therefore, from equipment 204, (it is most consumingly picked up Take noise source 212) derived from audio signal for modifying the audio signal derived from the equipment 202 (it most consumingly obtains user 210 language).By using modification system 206 by the audio signal based on noise to the audio signal based on sound into Row filtering, may be implemented the improvement of signal quality.For example, the audio stream from equipment 204 may be used as the sound from equipment 202 The reference of the sef-adapting filter of frequency stream, to further decrease equipment 202 from the received noise of noise source 212.Audio signal Adaptive-filtering be well known in the art, and therefore will not be discussed further here.
In this example, equipment 202 and 204 can be located at the different location in public domain, such as the room in such as family Between or business meetings room.In one case, public domain is considered equipment 202 and 204 and all picks up from noise source 212 Any region of number voice.When equipment 202 and 204 is smart speakers or including one or more microphones and processing energy When the other equipment of power and communication capacity, user 210, which can say, is intended for one or both of equipment 202 and 204 Order.May there are TV or refrigerator to be currently running simultaneously, or one in possible equipment 202 and 204 is playing music.Appoint What this non-speech sounds (referred to as " noise ") all may interfere with being properly received and using for voice command.Therefore, expectation is reduced Noise in signal (with language/voice command signal) help to improve the smart speakers for most receiving language strongly or its The function of his equipment.
Multiple (two or more) microphones at different location may include two or more distinct device (examples Such as, respectively there are two equipment of one or more microphones) one or more microphones, or may include individual equipment Multiple microphones.In the first scenario, multiple microphones of each equipment can spatially concentrate on expectation sound source (use Family or noise source) on, such as pass through beam forming.When individual equipment includes used multiple microphones, beam forming can For by beam position noise source and by different beam position target sources (user).When identical microphone is used for two waves Shu Shi, these wave beams can be sequence, or if equipment has sufficient amount of microphone, these wave beams be can be simultaneously Capable.
In the situation shown in fig. 2, equipment 202 and 204 respectively can wirelessly communicate each other and with modification system 206 Wireless communication.In many cases, system 206 is completed the processing of one of equipment 202 or 204 is used, therefore is not deposited In the specific installation including system 206.Another alternative solution is the completion system in remote equipment (for example, in cloud 208) 206.Under a kind of scene, pick up noise equipment 204 be processed to after audio signal streaming be transferred to equipment 202.Then, Equipment 202 use input based on the audio stream of noise as the reference in sef-adapting filter to reduce from equipment 202 The noise content of audio signal.This includes desired language.
The embodiment of the systems and methods includes that will become apparent to computer portion for those skilled in the art Part and computer implemented step.For example, it will be appreciated by those skilled in the art that the instruction for executing computer implemented step can The computer being stored as on computer-readable medium (such as, floppy disk, hard disk, CD, flash rom, non-volatile ROM and RAM) Executable instruction.In addition, it will be appreciated by those skilled in the art that computer executable instructions can be held on various processors Row, such as, for example, microprocessor, digital signal processor, gate array etc..For ease of description, the systems and methods are not It is a part that each step or element are described as computer system herein, but those skilled in the art will recognize Know each step or element can have corresponding computer system or software component.Therefore, by describing its corresponding step Rapid or element (that is, their function) realizes such computer system and/or software component within the scope of this disclosure.
Multiple embodiments have been described.It will be appreciated, however, that in the feelings for the range for not departing from inventive concept described herein Under condition, additional modifications can be carried out, and therefore, other embodiments are in the scope of the following claims.

Claims (24)

1. a kind of system, comprising:
Multiple microphones, the multiple microphone are positioned at different location;And
Modification system, the modification system and the mi crophone communication and is configured as:
Multiple audio signals are exported from the multiple microphone,
The confidence score of each derived audio signal is calculated, and
Based on the confidence score of the calculating, another audio signal is modified using a derived audio signal.
2. system according to claim 1, wherein the confidence score for calculating each derived audio signal includes calculating The derived audio signal whether include voice and the derived audio signal whether include non-speech sounds confidence Degree.
3. system according to claim 1, wherein the confidence score for calculating each derived audio signal includes determining Whether the derived audio signal is voice signal.
4. system according to claim 1, wherein modifying another audio signal using a derived audio signal Including being filtered by the second audio signal to the first audio signal.
5. system according to claim 4, wherein by the second audio signal to the first audio signal be filtered including Use second audio signal as the reference of the sef-adapting filter for first audio signal.
6. system according to claim 1, wherein the derived audio signal is in varying numbers in the microphone Quantity.
7. system according to claim 1, wherein at least one of described microphone includes microphone array.
8. system according to claim 7, wherein the first microphone array spatially concentrates in the first sound objects.
9. system according to claim 8, wherein second microphone array spatially concentrates in second sound target.
10. system according to claim 9, wherein first sound objects include human sound.
11. system according to claim 10, wherein the second sound target includes noise source.
12. system according to claim 1, wherein the first microphone is a part and second microphone of the first equipment It is a part of the second equipment, and wherein the first audio signal is exported and the second audio signal from first microphone It is exported from the second microphone.
13. system according to claim 12, wherein second equipment is by second audio signal transmission described in First equipment.
14. system according to claim 13, wherein first equipment is using second audio signal to modify State the first audio signal.
15. system according to claim 14, wherein first equipment is using second audio signal to reduce State the noise in the first audio signal.
16. system according to claim 1, wherein the first microphone and second microphone are all one of the first equipment Point.
17. system according to claim 16, wherein the first audio signal is exported and from first microphone Two audio signals are exported from the second microphone.
18. system according to claim 17, wherein second audio signal be used to reduce the first audio letter Noise in number.
19. system according to claim 1, wherein the multiple microphone is a part of the first equipment.
20. system according to claim 19, wherein first equipment is spatially by the multiple of first equipment Microphone concentrates in the first individual sources and the second individual sources, wherein the first audio signal is exported from first sound source And the second audio signal is exported from second sound source.
21. system according to claim 20, wherein second audio signal be used to reduce the first audio letter Noise in number.
22. a kind of system, comprising:
Multiple microphones, the multiple microphone are positioned at different location, wherein the first microphone is the one of the first equipment Part and second microphone are a part of the second equipment;
Operate first equipment wherein to export the first audio signal from first microphone, operate second equipment with The second audio signal is exported from the second microphone, and second equipment is suitable for arriving second audio signal transmission First equipment;And
Modification system, the modification system are a part of first equipment and in response to first audio signal and institute The second audio signal is stated, wherein the modification system is reduced using second audio signal in first audio signal Noise.
23. a kind of system, comprising:
Multiple microphones, the multiple microphone are a part of the first equipment, including the first microphone and second microphone;
First equipment is wherein operated to export the first audio signal from first microphone and from second Mike Wind exports the second audio signal;And
Modification system, the modification system are a part of first equipment and in response to first audio signal and institute The second audio signal is stated, wherein the modification system is reduced using second audio signal in first audio signal Noise.
24. a kind of system, comprising:
Multiple microphones, the multiple microphone are a part of the first equipment;
Wherein multiple microphones of first equipment are spatially concentrated on the first individual sources and by first equipment In two individual sources, wherein the first audio signal is exported from first sound source and the second audio signal is from the rising tone Source is exported;
Operate first equipment wherein to export the first audio signal from first sound source and lead from second sound source Second audio signal out;And
Modification system, the modification system are a part of first equipment and in response to first audio signal and institute The second audio signal is stated, wherein the modification system is reduced using second audio signal in first audio signal Noise.
CN201780075396.8A 2017-09-25 2017-09-25 Handle the voice from distributed microphone Pending CN110121744A (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2017/053177 WO2019059939A1 (en) 2017-09-25 2017-09-25 Processing speech from distributed microphones

Publications (1)

Publication Number Publication Date
CN110121744A true CN110121744A (en) 2019-08-13

Family

ID=60043303

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201780075396.8A Pending CN110121744A (en) 2017-09-25 2017-09-25 Handle the voice from distributed microphone

Country Status (4)

Country Link
EP (1) EP3539128A1 (en)
JP (1) JP2019537071A (en)
CN (1) CN110121744A (en)
WO (1) WO2019059939A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111048067A (en) * 2019-11-11 2020-04-21 云知声智能科技股份有限公司 Microphone response method and device

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140301558A1 (en) * 2013-03-13 2014-10-09 Kopin Corporation Dual stage noise reduction architecture for desired signal extraction

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008236077A (en) * 2007-03-16 2008-10-02 Kobe Steel Ltd Target sound extracting apparatus, target sound extracting program
US9113240B2 (en) * 2008-03-18 2015-08-18 Qualcomm Incorporated Speech enhancement using multiple microphones on multiple devices
JP5724125B2 (en) * 2011-03-30 2015-05-27 株式会社国際電気通信基礎技術研究所 Sound source localization device
JP5958218B2 (en) * 2011-09-15 2016-07-27 株式会社Jvcケンウッド Noise reduction device, voice input device, wireless communication device, and noise reduction method
US10229697B2 (en) * 2013-03-12 2019-03-12 Google Technology Holdings LLC Apparatus and method for beamforming to obtain voice and noise signals
US10026399B2 (en) * 2015-09-11 2018-07-17 Amazon Technologies, Inc. Arbitration between voice-enabled devices

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140301558A1 (en) * 2013-03-13 2014-10-09 Kopin Corporation Dual stage noise reduction architecture for desired signal extraction

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111048067A (en) * 2019-11-11 2020-04-21 云知声智能科技股份有限公司 Microphone response method and device

Also Published As

Publication number Publication date
JP2019537071A (en) 2019-12-19
WO2019059939A1 (en) 2019-03-28
EP3539128A1 (en) 2019-09-18

Similar Documents

Publication Publication Date Title
US10149049B2 (en) Processing speech from distributed microphones
CN109155130A (en) Handle the voice from distributed microphone
EP3122066B1 (en) Audio enhancement via opportunistic use of microphones
US11922095B2 (en) Device selection for providing a response
CN107465974B (en) Sound signal detector
CN107465970B (en) Apparatus for voice communication
US10089980B2 (en) Sound reproduction method, speech dialogue device, and recording medium
JP2023542968A (en) Hearing enhancement and wearable systems with localized feedback
US20210225374A1 (en) Method and system of environment-sensitive wake-on-voice initiation using ultrasound
CN113228710A (en) Sound source separation in hearing devices and related methods
CN110121744A (en) Handle the voice from distributed microphone
EP4184507A1 (en) Headset apparatus, teleconference system, user device and teleconferencing method
US11882415B1 (en) System to select audio from multiple connected devices
US20230035531A1 (en) Audio event data processing
WO2023010011A1 (en) Processing of audio signals from multiple microphones
WO2023010012A1 (en) Audio event data processing
EP4005249A1 (en) Estimating user location in a system including smart audio devices

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20190813