CN115868178A - Audio system and method for voice activity detection - Google Patents

Audio system and method for voice activity detection Download PDF

Info

Publication number
CN115868178A
CN115868178A CN202180050288.1A CN202180050288A CN115868178A CN 115868178 A CN115868178 A CN 115868178A CN 202180050288 A CN202180050288 A CN 202180050288A CN 115868178 A CN115868178 A CN 115868178A
Authority
CN
China
Prior art keywords
signal
combination
user
comparing
frequency band
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202180050288.1A
Other languages
Chinese (zh)
Inventor
D·G·莫顿
P·托雷斯
姚翔恩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Bose Corp
Original Assignee
Bose Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Bose Corp filed Critical Bose Corp
Publication of CN115868178A publication Critical patent/CN115868178A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R1/00Details of transducers, loudspeakers or microphones
    • H04R1/10Earpieces; Attachments therefor ; Earphones; Monophonic headphones
    • H04R1/1083Reduction of ambient noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R3/00Circuits for transducers, loudspeakers or microphones
    • H04R3/005Circuits for transducers, loudspeakers or microphones for combining the signals of two or more microphones
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L2025/783Detection of presence or absence of voice signals based on threshold decision
    • G10L2025/786Adaptive threshold
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R1/00Details of transducers, loudspeakers or microphones
    • H04R1/20Arrangements for obtaining desired frequency or directional characteristics
    • H04R1/32Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only
    • H04R1/40Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only by combining a number of identical transducers
    • H04R1/406Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only by combining a number of identical transducers microphones
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R2201/00Details of transducers, loudspeakers or microphones covered by H04R1/00 but not provided for in any of its subgroups
    • H04R2201/10Details of earpieces, attachments therefor, earphones or monophonic headphones covered by H04R1/10 but not provided for in any of its subgroups
    • H04R2201/107Monophonic and stereophonic headphones with microphone for two-way hands free communication
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R2430/00Signal processing covered by H04R, not provided for in its groups
    • H04R2430/03Synergistic effects of band splitting and sub-band processing

Abstract

Audio systems, methods, and processor instructions are provided that detect voice activity of a user and provide an output voice signal. The systems, methods, and instructions receive a plurality of microphone signals and combine the plurality of microphone signals according to a first combination and a second combination. The first combination produces a main signal with an enhanced response in the direction of the user's mouth, and the second combination produces a reference signal with a reduced response in the direction of the user's mouth. The main signal and the reference signal are added and subtracted to produce a sum signal and a difference signal, respectively. The summed signal is compared to the difference signal and an output speech signal is provided based on the comparison.

Description

Audio system and method for voice activity detection
Background
Various audio devices such as headphones, earphones, and the like are used in numerous environments for a variety of purposes, examples of which include entertainment purposes (such as gaming or listening to music), production purposes (such as telephone calls), and professional purposes (such as airline communications or studio listening), among others. Different environments and purposes may have different requirements for fidelity, noise isolation, noise reduction, voice pick-up, etc. Various echo and noise cancellation and reduction systems and methods, as well as other processing systems and methods, may be included to improve accurate communication in providing a user's voice or speech output signal.
Some such systems and methods exhibit improved performance when they have a reliable indication that the user of the device is actively speaking. For example, certain systems and methods may change various processes, such as filter coefficients, adaptation rates, reference signal selection, etc., while reliably determining that a user is speaking. The enhanced capabilities of these systems and methods may allow the user's speech to be more clearly separated or isolated from other noise in the output audio signal, further allowing enhanced applications such as voice communication and speech recognition, including speech recognition for communication, e.g., a voice-to-text application for Short Message Service (SMS) messaging, or a Virtual Personal Assistant (VPA) application.
Accordingly, there is a need for reliable detection of a user speaking, and the present application relates to reliable detection of a user speaking, herein generally referred to as Voice Activity Detection (VAD).
Disclosure of Invention
Aspects and examples relate to audio systems and methods that pick up a user's voice from one or more microphone signals and reduce other acoustic components (such as background noise and other speakers) to enhance the user's voice component but not the other acoustic components. More particularly, aspects and examples relate to methods and systems for reliably detecting when a user is speaking (i.e., voice activity detection).
According to one aspect, a method of detecting voice activity of a user is provided and includes: receiving a plurality of microphone signals; combining the plurality of microphone signals according to a first combination to produce a main signal having an enhanced response in a direction of the user's mouth; combining the plurality of microphone signals according to a second combination to produce a reference signal having a reduced response in a direction of the user's mouth; adding the main signal and the reference signal to produce a summed signal; subtracting one of the main signal or the reference signal from the other of the main signal or the reference signal to generate a difference signal; comparing the sum signal to the difference signal; and providing an output speech signal based on the comparison.
In various examples, the first combination may be a Minimum Variance Distortionless Response (MVDR) combination. The second combination may be a delay and subtract combination.
According to some examples, comparing the sum signal to the difference signal includes determining at least one of an energy, amplitude, or envelope of each of the sum signal and the difference signal and comparing the at least one of an energy, amplitude, or envelope of the sum signal and the difference signal. This comparison may also include comparing at least one of the ratio or difference to a threshold or multiplying at least one of the energy, amplitude or envelope by a factor and comparing the multiplied energy, amplitude or envelope to other energies, amplitudes or envelopes.
In various examples, comparing the sum signal to the difference signal includes comparing the sum signal to the difference signal in a first frequency band and in a second frequency band, the second frequency band being different from the first frequency band. In some examples, the first frequency band may include frequencies in the range of 200Hz-400Hz, and the second frequency band may include frequencies in the range of 500Hz-700 Hz.
Some examples may include processing a speech signal with an adaptive filter and modifying the adaptive filter based on the comparison. Altering the adaptive filter may include changing coefficients of the adaptive filter, changing an adaptation rate, changing a step size, freezing the adaptation, or disabling the adaptive filter.
According to another aspect, an audio system is provided that includes a plurality of microphones and a controller coupled to the plurality of microphones. The controller is configured to: receiving a plurality of microphone signals from the plurality of microphones; combining the plurality of microphone signals according to a first combination to produce a main signal having an enhanced response in a direction of the user's mouth; combining the plurality of microphone signals according to a second combination to produce a reference signal having a reduced response in a direction of the user's mouth; adding the main signal and the reference signal to produce a summed signal; subtracting one of the main signal or the reference signal from the other of the main signal or the reference signal to generate a difference signal; comparing the sum signal with the difference signal; and providing an output speech signal based on the comparison.
In some examples, the first combination may be a Minimum Variance Distortionless Response (MVDR) combination and the second combination may be a delay and subtract combination.
In various examples, comparing the sum signal to the difference signal includes determining at least one of an energy, amplitude, or envelope of each of the sum signal and the difference signal and comparing the at least one of an energy, amplitude, or envelope of the sum signal and the difference signal.
In various examples, comparing the sum signal to the difference signal includes comparing the sum signal to the difference signal in a first frequency band and in a second frequency band, the second frequency band different from the first frequency band. For example, in some examples, the first frequency band may include frequencies in the range of 200Hz-400Hz, and the second frequency band may include frequencies in the range of 500Hz-700 Hz.
In some examples, providing the speech signal based on the comparison may include processing the speech signal with an adaptive filter and altering the adaptive filter based on the comparison. Altering the adaptive filter may include changing coefficients of the adaptive filter, changing an adaptation rate, changing a step size, freezing the adaptation, or disabling the adaptive filter.
According to yet another aspect, there is provided a non-transitory computer readable medium having encoded thereon instructions that, when executed by a suitable processor (or processors), cause the processor to perform a method comprising: receiving a plurality of microphone signals; combining the plurality of microphone signals according to a first combination to produce a main signal having an enhanced response in a direction of the user's mouth; combining the plurality of microphone signals according to a second combination to produce a reference signal having a reduced response in a direction of the user's mouth; adding the main signal and the reference signal to generate a summed signal; subtracting one of the main signal or the reference signal from the other of the main signal or the reference signal to generate a difference signal; comparing the sum signal to the difference signal; and providing an output speech signal based on the comparison.
In various examples, the first combination may be a Minimum Variance Distortionless Response (MVDR) combination. The second combination may be a delay and subtract combination.
According to some examples, comparing the sum signal to the difference signal includes determining at least one of an energy, amplitude, or envelope of each of the sum signal and the difference signal and comparing the at least one of an energy, amplitude, or envelope of the sum signal and the difference signal. This comparison may also include comparing at least one of the ratio or difference to a threshold or multiplying at least one of the energy, magnitude, or envelope by a factor and comparing the multiplied energy, magnitude, or envelope to other energy, magnitude, or envelope.
In various examples, comparing the sum signal to the difference signal includes comparing the sum signal to the difference signal in a first frequency band and in a second frequency band, the second frequency band being different from the first frequency band. In some examples, the first frequency band may include frequencies in the range of 200Hz-400Hz, and the second frequency band may include frequencies in the range of 500Hz-700 Hz.
Some examples may include processing a speech signal with an adaptive filter and modifying the adaptive filter based on the comparison. Altering the adaptive filter may include changing coefficients of the adaptive filter, changing an adaptation rate, changing a step size, freezing the adaptation, or disabling the adaptive filter.
Other aspects, examples, and advantages of these exemplary aspects and examples are discussed in detail below. Examples disclosed herein may be combined with other examples in any manner consistent with at least one of the principles disclosed herein, and references to "an example," "some examples," "an alternative example," "various examples," "one example," etc., are not necessarily mutually exclusive and are intended to mean that a particular feature, structure, or characteristic described may be included in at least one example. The appearances of such terms herein are not necessarily all referring to the same example.
Drawings
Various aspects of at least one example are discussed below with reference to the accompanying drawings, which are not intended to be drawn to scale. The accompanying drawings are included to provide illustration and a further understanding of the various aspects and examples, and are incorporated in and constitute a part of this specification, but are not intended as a definition of the limits of the invention. In the drawings, like or nearly like components that are illustrated in various figures may be represented by like numerals. For purposes of clarity, not every component may be labeled in every drawing. In the drawings:
FIG. 1 is a perspective view of an exemplary pair of headphones;
FIG. 2 is a schematic illustration of an environment in which the exemplary headset of FIG. 1 may be used;
FIG. 3 is a schematic diagram of an exemplary noise reduction system for enhancing a user's speech signal among other acoustic signals;
FIG. 4 is a schematic diagram of an exemplary system for detecting voice activity of a user;
FIG. 5 is a schematic diagram of another exemplary system for detecting voice activity of a user; and is
FIG. 6 is a flow diagram of an exemplary voice activity detection method.
Detailed Description
Aspects of the present disclosure relate to audio systems and methods that support the pickup of voice signals by a user (e.g., a wearer) of a headset, earphone, or the like by reliably detecting voice activity of the user (e.g., detecting when the user speaks). Conventional Voice Activity Detection (VAD) systems and methods may receive or construct a primary signal configured or arranged to include a user speech component, and receive or construct a reference signal configured or arranged to not include (or reduce the inclusion of) the user speech component. The signal envelope, magnitude or energy of the primary signal is compared to the signal envelope, magnitude or energy of the reference signal and if the primary signal exceeds a threshold relative to the reference signal, it is determined that the user is speaking. Such systems and methods typically output a binary flag (e.g., VAD =0, 1) to indicate whether the user is speaking. The flag may advantageously be applied in other parts of the audio system to freeze noise cancellation or to reduce the adaptation of the adaptive filter of the system and/or echo canceller. The application of VAD indications may encompass a number of other actions or effects that are outside the scope of this disclosure but that are apparent to those skilled in the art.
Conventional VAD systems and methods that conform to those described above may suffer from reduced performance when the audio system is near a boundary condition (e.g., an acoustically reflective environment such as a nearby wall and/or a user's arms, hands, etc. placed near headphones, earphones, etc.). In essence, acoustic reflections of the user's speech from boundary conditions may enter the reference signal, thus reducing the differential signal energy between the main signal (intended to include the user's speech) and the reference signal (intended to not include the user's speech). Aspects and examples described herein accommodate this phenomenon and enhance the reliability of voice activity detection when a user approaches or creates boundary conditions (e.g., acoustically reflective objects or surfaces in relative proximity).
Obtaining a user speech signal with reduced noise components and/or echo components may enhance speech-based features or functions that can be provided as part of an audio system or other associated equipment, such as communication systems (cellular, radio, aeronautical), entertainment systems (gaming), voice recognition applications (voice-to-text, virtual personal assistants), and other systems and applications that process audio, particularly voice or speech. Examples disclosed herein may be coupled to or connected with other systems by wired or wireless means, or may be independent of other systems or equipment.
Headphones, earphones, headsets, and other various personal audio system form factors (e.g., in-ear transducers, earplugs, neck or shoulder worn devices, as well as other head worn devices with integrated audio, eyeglasses, etc.) are consistent with the various aspects and examples herein.
In general, acoustic reflections from nearby environmental boundaries (e.g., surfaces and objects) may cause a significant degradation of conventional VAD performance in a one-sided (e.g., left or right) audio system as compared to a binaural audio system (left and right) because the additional signal characteristics between the left and right sides may not be available in the one-sided system and method. Accordingly, the aspects and examples disclosed herein may be more suitable for a single-sided audio system and method. Nonetheless, the described aspects and examples are also applicable to binaural systems and methods.
Examples disclosed herein may be combined with other examples in any manner consistent with at least one of the principles disclosed herein, and references to "an example," "some examples," "an alternative example," "various examples," "one example," etc. are not necessarily mutually exclusive and are intended to mean that a particular feature, structure, or characteristic described may be included in at least one example. The appearances of such terms herein are not necessarily all referring to the same example.
It is to be understood that the examples of the methods and apparatus discussed herein are not limited in their application to the details of construction and the arrangement of components set forth in the following description or illustrated in the drawings. The methods and apparatus can be implemented in other examples and can be operated or performed in various ways. Examples of specific implementations are provided herein for illustrative purposes only and are not intended to be limiting. Also, the phraseology and terminology used herein is for the purpose of description and should not be regarded as limiting. The use of "including," "comprising," "having," "containing," "involving," and variations thereof herein, is meant to encompass the items listed thereafter and equivalents thereof as well as additional items. References to "or" may be understood to be inclusive such that any term described using "or" may indicate any single one, more than one, or all of the stated terms. Any reference to front and back, right and left, top and bottom, upper and lower, and vertical and horizontal is for convenience of description, and is not intended to limit the present systems and methods or their components to any one positional or spatial orientation.
Fig. 1 shows an example of an earplug 100 comprising an earplug end 110, an acoustic transducer (loudspeaker, internal and therefore not shown) for producing an acoustic output from e.g. an audio signal, and one or more microphones 120. While the example earplug 100 is shown for a right ear, the left ear example can also be provided, for example, in a symmetrical or mirror image manner, and/or various examples can include a pair of left and right earplugs. In general, the earplug end 110 includes an acoustic channel and an end having a feature, such as an "umbrella," configured to provide a level of acoustic sealing near the ear canal of a user (e.g., wearer) of the earplug 100. The earplug tip also includes a retention and stabilization feature, such as two arms connected at the distal end to retain the earplug 100 in the ear of the user when in use. Other examples may include different support structures to hold one or more earpieces near the user's ears. For example, an open-ear audio device that may be incorporated into eyeglasses or other head-mounted devices and/or structures that may be worn near or around the head, neck, and/or ears.
The earplug 100 is shown with two microphones 120, a more forward microphone 120F and a more rearward microphone 120R (collectively 120). In other examples, more microphones may be included and may be arranged at varying locations. The microphones 120 are located at varying positions so that they do not receive the same acoustic signal. Combinations of changes in two or more microphone signals may be advantageously compared to detect whether a user is speaking, to provide a speech signal representative of the user's speech, to remove or reduce noise components and/or echo components from the speech signal, and various other signal processing and/or communication functions and features.
Although a microphone is shown and labeled with a reference numeral, in some examples, the visual elements shown in the figures may represent an acoustic port where acoustic signals enter to ultimately reach the microphone, which may be internally and externally physically invisible. In an example, one or more of the microphones 120 may be immediately interior to the acoustic port or may be displaced a distance from the acoustic port, and may include an acoustic waveguide between the acoustic port and the associated microphone.
The signals from the microphones 120 are combined in various ways to advantageously steer the beams and nulls in a manner that maximizes the user's voice in one instance to provide a primary signal and minimizes the user's voice in another instance to provide a reference signal. Thus, the reference signal may represent ambient noise and may be provided as a reference to the adaptive filter of the noise reduction subsystem. Such a noise reduction system may modify the main signal to reduce components related to the reference signal (e.g., a noise-related signal), and the noise reduction subsystem provides an output signal that approximates the user's speech signal, with reduced noise content.
In various examples, signals may be advantageously processed in different sub-bands to enhance the effectiveness of noise reduction or other signal processing. The generation of a signal in which the user speech component is enhanced and the other components are reduced is generally referred to herein as speech pickup, speech selection, speech isolation, voice enhancement, and the like. As used herein, the terms "speech," "voice," "conversation," and variations thereof may be used interchangeably without regard to whether such voice involves the use of vocal cords.
Fig. 2 illustrates an exemplary environment 200 in which a user 210 (shown as a top view of the user's head) may wear an audio device (such as the earplug 100) near an acoustically reflective surface 220 (such as a wall). For certain acoustic frequencies, especially certain frequencies where the distance d (230) of the earplug 100 from the reflective surface 220 is less than a quarter wavelength, indirect acoustic energy reflected from the acoustically reflective surface 220 may become substantially in phase with direct acoustic energy reaching the microphone 120. Thus, when various signal processing of one or more microphone signals or combinations of microphone signals depends on the directionality of various components in the microphone signals, such signal processing may exhibit reduced performance. For example, voice activity detectors, noise reduction systems, echo reduction systems, etc., particularly those systems that rely on a combination of microphone signals to enhance or reduce acoustic signals from certain directions (e.g., a beamformer and a null former, or generally array processing), may exhibit reduced performance, such as when signal content intended to be excluded by such a combination is instead included because it is reflected by the reflective surface 220. In various examples, the acoustically reflective surface (such as reflective surface 220) may be a wall, a corner, a half wall, furniture or other object, a headrest, or a user's hand (such as when making a gesture, reaching for the earplug 100, or placing a handle behind the head).
FIG. 3 is a block diagram of an exemplary noise reduction system 300 that processes a microphone signal to produce an output signal that includes user speech components that are enhanced relative to background noise and other speakers. A set of multiple microphones 302 (such as the microphones 120 of fig. 1-2) converts acoustic energy into an electronic signal 304 and provides the signal 304 to each of two array processors 306, 308. The signal 304 may be in analog form. Alternatively, one or more analog-to-digital converters (ADCs) (not shown) may first convert the microphone output so that signal 304 may be in digital form. The array processors 306, 308 apply array processing techniques, such as phased array, delay-add techniques, and may utilize Minimum Variance Distortionless Response (MVDR) and Linear Constrained Minimum Variance (LCMV) techniques to adjust the responsiveness of the set of microphones 302 to enhance or reject acoustic signals from various directions.
Beamforming enhances acoustic signals from a particular direction or range of directions, while null beamforming reduces or rejects acoustic signals from a particular direction or range of directions. The first array processor 306 is a beamformer for maximizing the acoustic response of the set of microphones 302 in the direction of the user's mouth (e.g., pointing, for example, in front of and below the earpiece 100) and providing a main signal 310. Due to the beamforming array processor 306, the primary signal 310 includes higher user speech signal energy than any of the individual microphone signals 304. The primary signal 310, which is the output of the first array processor 306, may be considered equivalent to the output of a directional microphone directed at the user's mouth.
The second array processor 308 directs the zero values towards the user's mouth and provides a reference signal 312. Since the zero points to the user's mouth, the reference signal 312 includes the minimum (if any) signal energy of the user's speech. Thus, reference signal 312 is substantially comprised of components due to background noise and other sound sources that are not the user's voice. For example, the reference signal 312 is a signal related to an acoustic environment other than the user's speech. The reference signal 312 as output by the second array processor 308 may be considered equivalent to the output of a microphone directed to the surrounding environment (anywhere except the user's mouth).
The main signal 310 includes a user speech component and includes a noise component (e.g., background, other speakers, etc.), while the reference signal 312 includes substantially only a noise component under normal circumstances. If the reference signal 312 is nearly identical to the noise component of the primary signal 310, the noise component of the primary signal 310 may be removed by simply subtracting the reference signal 312 from the primary signal 310. In practice, however, the reference signal 312 is related to and indicative of the noise component of the primary signal 310, but is not exactly equal to the noise component of the primary signal 310, as will be understood by those skilled in the art. Thus, adaptive filtering may be used to remove at least some of the noise components from the primary signal 310 by using the reference signal 312 as an indication of the noise components.
Many adaptive filter methods known in the art are designed to remove components associated with the reference signal. For example, some examples include Normalized Least Mean Square (NLMS) adaptive filters. The output of the adaptive filter 314 is a speech estimate signal 316 that represents an approximation of the user's speech signal.
Exemplary adaptive filters 314 may include various types in conjunction with various adaptive techniques (e.g., NLMS). The operation of the adaptive filter typically includes a digital filter that receives a reference signal that is related to an unwanted component of the main signal. The digital filter attempts to generate an estimate of the unwanted component of the main signal from the reference signal. By definition, the unwanted component of the main signal is a noise component. The estimate of the noise component by the digital filter is a noise estimate. If the digital filter generates a good noise estimate, the noise component can be effectively removed from the main signal by simply subtracting the noise estimate. On the other hand, if the digital filter does not generate a good estimate of the noise component, such subtraction may be ineffective or may reduce the main signal, e.g. increase the noise. Thus, the adaptive algorithm operates in parallel with the digital filter and adjusts the digital filter in the form of, for example, changing weights or filter coefficients. In some examples, the adaptive algorithm may monitor the main signal when it is known to have only a noise component (i.e., when the user is not speaking) and adjust the digital filter to generate a noise estimate that matches the main signal when the main signal includes only the noise component. The adaptive algorithm can know when the user is not speaking through various means. In at least one example, the system enforces a pause or silence period after triggering speech enhancement. For example, the user may need to press a button or speak a wake-up command and then pause until the system indicates to the user that it is ready. During the required pauses, the adaptive algorithm monitors the main signal, which does not include any user speech, and adapts the filter to the background noise. Then, when the user speaks, the digital filter generates a good noise estimate that is subtracted from the main signal to generate a speech estimate, e.g., speech estimate signal 316.
Additionally, and according to examples herein, the voice activity detector 400, 500 (VAD) may be operable to detect when a user is speaking or is not speaking. Fig. 4 and 5 each illustrate the operation of an exemplary voice activity detection algorithm. In the example of fig. 4, two microphones 120 are used, but in other examples, additional microphones may be used. Similar to the noise reduction system 300 of fig. 3, the vad400 combines the microphone signals 404 according to a first combination 406 to produce a main signal 410 and combines the microphone signals according to a second combination 408 to produce a reference signal 412. In some examples, the primary signal 410 may be the same signal as the primary signal 310, but is not required. Also, in some examples, reference signal 412 may be, but is not required to be, the same signal as reference signal 312.
The first combination 406 may be an array process that combines the microphone signals 404 to have an enhanced response in the direction of the user's mouth, thereby producing a main signal 410 with an enhanced speech component when the user speaks. According to some examples, the first combination 406 may be an MVDR beamformer. The primary signal 410 as the output of the first combination 406 may be considered equivalent to the output of a directional microphone directed at the user's mouth.
The second combination 408 may be an array process that combines the microphone signals 404 to have a reduced response in the direction of the user's mouth, thereby producing a reference signal 412 having a reduced speech component (and thus an enhanced noise component representative of the surrounding environment). In some examples, the second combination 408 may be a null former having a null (or low) response in the direction of the user's mouth. The reference signal 412, which is the output of the second combination 408, can be considered equivalent to the output of a microphone pointed at the ambient environment (anywhere except the user's mouth).
According to at least one example, the second combination 408 may be a delayed and subtracted combination of the microphone signals 404. Referring to the earplug 100 of fig. 1 and 2, the front microphone 120F is closer to the user's mouth than the rear microphone 120R when properly worn by the user. Thus, the user's voice first reaches the front microphone 120F and then reaches the rear microphone 120R. Thus, delaying the signal from the front microphone 120F by an appropriate amount of time (to time align the two microphone signals) and subtracting the other from either of the microphone signals can thereby cancel out the user's speech component. Thus, in this example, the reference signal 412 has a reduced user speech component.
With continued reference to the VAD400 of fig. 4, the comparator 414 compares the host signal 410 with the reference signal 412. When the user is not speaking, the main signal 410 and the reference signal 412 may have some relationship to each other, such as their relative energy may be substantially constant, but if the user starts speaking, the energy in the main signal 410 may increase significantly (because it includes the user's speech) while the reference signal 412 may not increase (because it rejects the user's speech). In a sense, the reference signal 412 may indicate the acoustic environment (e.g., its noise level) from which the comparator 414 may "expect" the baseline signal level in the primary signal, and possibly because the user is speaking if the primary signal 414 exceeds the baseline level. Thus, the comparator 414 can determine whether the user is speaking and provide an output 416 indicating detected (or undetected) voice activity. According to various examples, output 416 may have two states (e.g., a logical one or zero) to indicate whether the user is speaking. Other examples may provide various forms of output 416.
According to various examples, the comparator 414 may compare any one or more of the energy, amplitude, envelope, or other properties of the signals being compared. Further, the comparator 414 may compare the signals to each other and/or may compare the threshold to any of the signals and/or to any of the ratios or differences of the signals (e.g., ratios or differences of the energy, amplitude, envelope, etc. of the signals). In various examples, comparator 414 may include smoothing, time averaging, or low pass filtering of the signal. In various examples, comparator 414 may compare within a limited frequency band or sub-band.
In some examples, it may be desirable for comparator 414 to take a ratio of signal energies (or amplitudes, envelopes, etc.) and compare the ratio to a threshold. Instead of strictly calculating the ratio (which may occupy a large amount of computational resources), some examples may equivalently adjust one of the signal properties by multiplying it by a factor, and then comparing the adjusted signal property with a comparable property of another signal. For example, in some examples, when the signal energy of the host signal 410 exceeds the reference signal 412 energy by an amount (or vice versa), say 20%, it may be determined by the comparator 414 outputting VAD =1 (voice detected). In some examples, comparator 414 may determine the signal energy, calculate a ratio of the signal energies, and compare the ratio to a threshold of 1.2 (e.g., representing 20% higher). However, in some examples, the comparator 414 may equivalently multiply one of the signal energies by 1.2 and compare the result directly with the other signal energy. For example, the computation cost of multiplication may be lower than computing the ratio between two signal energies.
The ability to detect voice activity may be a central control of various audio systems, and in particular audio systems that include voice pick-up and other processing to provide outgoing user voice signals. For example, the audio system may include one or more subsystems that perform adaptation processing when the user is not speaking, but need to freeze adaptation when the user begins speaking (e.g., noise reduction system 300 of fig. 3). The various subsystems may alter their operation differently depending on whether the user is speaking and/or may terminate their operation when the user is speaking. For example, in some examples, when the user is not speaking, the outgoing user voice signal may be paused, such as operating in a half-duplex mode to save energy and/or bandwidth. The VAD lets the system know to start transmission again. For these and other reasons, efficient voice activity detection is essential. In particular, if the VAD fails, the user's speech component may be processed like noise, and the adaptation process may disadvantageously operate to remove it.
The exemplary VAD400 of fig. 4 relies on a reference signal 412 having a reduced user speech component. However, in the event that the user is near an acoustically reflective surface (such as a wall or other object) or the user's hand is near a microphone (hands behind head, hands out to hold the earbud 100, etc.), the user's voice may reflect off of the nearby surface and provide a secondary (indirect) source of the user's voice at the microphone 120. Thus, in such cases, the second combination 408 may be less effective at rejecting the user speech component. Rather, reference signal 412 may include portions of the user's voice that are reflected from nearby surfaces. In such cases, the VAD400 may fail to detect speech at least in part because both the reference signal 412 and the primary signal 410 increase as the user begins to speak, which may result in the difference between the signals of the comparator 414 being insufficient to determine that the user is speaking.
For example, if the user is near a wall, there may be a significant reflection of the user's voice that is not rejected by the second combination 408. Furthermore, such speech energy in reference signal 412 may also be in reference signal 312 of, for example, a noise reduction system (see fig. 3), which may cause adaptive processing of the noise reduction system to attempt to remove speech.
Referring to fig. 5, another exemplary VAD 500 is shown. VAD 500 is similar to VAD400, but includes additional processing to account for correlated energy between a first combination 506 (e.g., MVDR beamformer) and a second combination 508 (e.g., delay and subtraction nullexers) of microphone signals 504 due to nearby reflective surfaces. When the user is near the acoustic reflective surface, the indirect (reflected) speech may be substantially in phase with the user's direct speech (e.g., at low frequencies where the surface is about 1/4 wavelength or less from the user). Thus, the second combination 508 may not reject this reflected user speech energy because it is not coming from the user's mouth direction and thus does not reach the appropriate time difference for delaying and subtracting to eliminate it. The VAD 500 solves this problem by performing an addition and subtraction between the main signal 510 and the reference signal 512, and comparing the resulting summed signal and difference signal instead of the main signal and reference signal.
As described above, the first combination 506 includes the user's speech in the main signal 510. When the user is near a wall or other reflection source, lower frequency speech will be reflected into the microphone signal 504 that is not rejected (or reduced) by the second combination 508, and thus the reference signal 512 also has a component of the user's speech. For various frequency subbands, such as those with reflection sources that are 1/4 wavelength or less apart, the speech component in the reference signal 512 may be substantially in-phase with the speech component in the primary signal 510. Thus, the addition of the primary signal 510 and the reference signal 512 (to produce the summed signal 518) enhances the in-phase low-frequency interval energy, while the subtraction of one of the primary signal 510 and the reference signal 512 from the other (to produce the difference signal 520) eliminates or at least significantly reduces the in-phase low-frequency interval energy. Thus, in the appropriate low frequency part of the signal spectrum, the sum signal 518 will be much larger than the difference signal 520.
In various examples, the sums and differences may be complex sums and complex subtractions, respectively, in the frequency domain (e.g., on phase and amplitude information). In other examples, the summing and subtracting may be performed in the time domain.
According to various examples, sum and difference may be calculated for multiple low frequency intervals (and various combinations of the intervals), and relative energy levels may be compared across one or more of the frequency intervals. In some examples, VAD 500 determines the energy of each of sum signal 518 and difference signal 520 within the relevant frequency interval and may apply a low-pass filter to smooth the energy envelope. The relative levels of the frequency bins are then compared to a threshold. If the threshold is exceeded, there may be a boundary that interferes with the VAD beamformer. Thus, the VAD 500 may provide the output signal 516 as a logic TRUE, which may be interpreted as an indication that the user is speaking in the presence of boundary interference (a nearby reflective surface).
In various examples, several frequency bins may be analyzed together and/or separately because the reflection path length is variable, resulting in some in-phase and out-of-phase reflections depending on distance. For example, if the user places their hands behind his or her head, they are closer to the microphone array than to the wall, so that the higher frequency bins can be in phase. The user's hand may reflect less low frequency energy than a wall, but may reflect more high frequency energy due to the generally closer distance. Thus, and in some examples, for frequencies in the range of 200Hz to 400Hz, nearby walls may be detected by significant in-phase content between the primary signal and the reference signal, while for frequencies in the range of 500Hz to 700Hz, the user's hand may be detected nearby by significant in-phase content between the primary signal and the reference signal.
Fig. 6 illustrates a method 600 of detecting voice activity of a user when proximate to an acoustic reflective surface (such as may be implemented by the VAD 500 of fig. 5). The method 600 receives a plurality of microphone signals (step 610) and combines the microphone signals according to a first combination (step 620) to provide a primary signal and according to a second combination (step 630) to provide a reference signal. The first combination is configured to provide a main signal having an enhanced component representative of the user's speech, while the second combination is configured to provide a reference signal having a reduced component representative of the user's speech. In some examples, the first combination may be configured to provide a main signal having reduced non-speech components (such as ambient noise) while the second combination is configured to provide a reference signal having enhanced non-speech components (such as a noise reference signal (representative of ambient noise)).
When the microphone signal includes reflected acoustic energy from a nearby surface, such as a wall or a user's hand (e.g., near the microphone), there may be substantially in-phase user speech content in the reference signal. Such user voice content in the reference signal may cause conventional voice activity detectors to falsely conclude that the user is not speaking, which may cause other subsystems to perform poorly. For example, a conventional noise (or echo) reduction subsystem with adaptive filter processing (see, e.g., system 300 of fig. 3) may freeze adaptation when a user is speaking, and failure to detect that the user is speaking may cause such subsystem to begin adapting to the user speech content when it is not supposed, e.g., such systems typically adapt the filter to the noise (or echo) content. Even where conventional voice activity detectors accurately detect voice activity, if other subsystems use the reference signal as a noise reference signal, the user voice content in the reference signal may result in poor performance of such other subsystems. It is therefore important to detect when the reference signal (erroneously) comprises speech content, e.g. due to a nearby reflecting surface.
As described above, for certain frequency bins based on distance to the reflecting surface, the speech content in the reference signal caused by nearby reflecting surfaces may be in phase with the speech content in the main signal. The closer the reflecting surfaces are, the stronger the reflection (e.g., amplitude) and the higher the frequency range over which the reflection will be in phase.
With continued reference to fig. 6, to detect in-phase user speech content in the reference signal, the method 600 adds the main signal and the reference signal (step 640) to provide a summed signal, and subtracts the main signal and the reference signal (calculates the difference between them) (step 650) to provide a difference signal. If significant user speech content is present in the reference signal that is in phase with the main signal, these in-phase components are added (enhanced) in the summed signal and subtracted (cancelled or reduced) in the difference signal. Thus, the method 600 potentially compares (step 660) the summed signal to the difference signal across various frequency ranges or frequency intervals. A sufficient difference (in terms of energy, amplitude, etc.) between the summed signal and the difference signal at certain frequencies, ranges or intervals means that the main signal and the reference signal contain in-phase components that further indicate that the reflecting surface is nearby based on the in-phase components of the frequencies, ranges or intervals, resulting in the reference signal containing a user speech component. Accordingly, and as discussed above, conventional voice activity detectors may be unreliable in this case, and thus the method 600 indicates that voice activity is detected (step 670), e.g., VAD =1.
As also discussed above, other subsystems may alter their operation based on an indication of voice activity, such as by freezing adaptive filters of, for example, noise reduction, echo reduction, and/or other subsystems. In some examples, when method 600 (or system 500) indicates voice activity, noise reduction, echo reduction, or other subsystems may cease operation. In various examples, when method 600 (or system 500) indicates voice activity, a primary signal (such as any of primary signals 310, 410, 510 of fig. 3, 4, or 5, respectively) may be provided as an estimated voice signal to be provided as an output voice signal (with or without additional processing). In other words, the lack of an indication of voice activity (or no voice activity) (e.g., VAD = 0) may cause other subsystems to stop processing or to stop providing an output voice signal. Thus, in general, various examples of audio systems and methods according to those described herein may include various subsystems, the operation of which may or may not depend on a binary indication of voice activity (e.g., VAD = 0/1), such as by adjusting, altering, freezing, stopping, or starting various processes based on an output indication of the voice activity detection method 600 or system 500.
As discussed above, the exemplary systems 100, 300, 400, 500 and their associated subsystems may operate in the digital domain and may include an analog-to-digital converter (not shown). In addition, the components and processes included in the exemplary system may achieve better performance when operating on narrowband signals rather than wideband signals. Thus, certain examples may include subband filtering to allow processing of one or more subbands. For example, beamforming, null forming, adaptive filtering, signal combining (adding, subtracting), signal comparison, voice activity detection, spectral enhancement, etc., exhibit enhanced functionality when operating on individual subbands. In some examples, the subbands may be combined together after operation of the exemplary system to produce an output signal. In some examples, the microphone signals 304, 404, 504 may be filtered to remove content outside of the typical spectrum of human speech. Alternatively, an exemplary subsystem may be employed to operate only on subbands within a spectrum associated with human speech and ignore subbands outside the spectrum. Additionally, although the exemplary system is discussed with reference to only a single set of microphones 120, 302, in some examples, there may be additional sets of microphones, e.g., one set on the left and another set on the right, other aspects and examples of the exemplary system may be applied to the additional sets of microphones, and other aspects and examples of the exemplary system may be combined with the additional sets of microphones.
In various examples and combinations, one or more of the above systems and methods may be used to capture a user's voice and isolate or enhance the user's voice from background noise, echoes, and other talkers. Any of the described systems and methods, and variations thereof, may be implemented at different reliability levels based on, for example, microphone quality, microphone placement, acoustic ports, form factor/frame design, thresholds, selection of adaptive algorithms, spectral algorithms, and other algorithms, weighting factors, window sizes, etc., and other criteria that may accommodate different applications and operating parameters.
Many, if not all, of the functions, methods, and/or components of the systems and methods disclosed herein according to the various aspects and examples can be implemented or performed in a Digital Signal Processor (DSP) and/or other circuitry (analog or digital) adapted to perform signal processing and other functions according to the aspects and examples disclosed herein. Additionally or alternatively, microprocessors, logic controllers, logic circuits, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), general purpose computing processors, microcontrollers, etc., or any combination of these, may be suitable and may include analog or digital circuit components and/or other components relative to any particular implementation. The functions and components disclosed herein may operate in the digital domain, the analog domain, or a combination of the two, and although the description of an analog-to-digital converter (ADC) or a digital-to-analog converter (DAC) is absent in the various figures, certain examples include an ADC and/or a DAC, where appropriate. Any suitable hardware and/or software (including firmware, etc.) can be configured to implement or realize the components of the aspects and examples disclosed herein, and various implementations of the aspects and examples can include components and/or functions in addition to those disclosed. Various embodiments may include stored instructions for a digital signal processor and/or other circuitry to cause the circuitry to perform, at least in part, the functions described herein.
Having thus described several aspects of at least one example, it is to be appreciated various alterations, modifications, and improvements will readily occur to those skilled in the art. Such alterations, modifications, and improvements are intended to be part of this disclosure, and are intended to be within the scope of the invention. Accordingly, the foregoing description and drawings are by way of example only, and the scope of the invention should be determined from appropriate construction of the appended claims, and equivalents thereof.

Claims (20)

1. A method of detecting voice activity of a user, the method comprising:
receiving a plurality of microphone signals;
combining the plurality of microphone signals according to a first combination to produce a primary signal having an enhanced response in a direction of the user's mouth;
combining the plurality of microphone signals according to a second combination to produce a reference signal having a reduced response in a direction of the user's mouth;
adding the main signal and the reference signal to produce a summed signal;
subtracting one of the primary signal or the reference signal from the other of the primary signal or the reference signal to produce a difference signal;
comparing the sum signal to the difference signal; and
an output speech signal is provided based on the comparison.
2. The method of claim 1, wherein the first combination is a Minimum Variance Distortionless Response (MVDR) combination.
3. The method of claim 1, wherein the second combination is a delay and subtract combination.
4. The method of claim 1, wherein comparing the sum signal to the difference signal comprises determining and comparing at least one of an energy, an amplitude, or an envelope of the sum signal and the difference signal.
5. The method of claim 4, wherein comparing the at least one of an energy, amplitude, or envelope of the sum and difference signals comprises comparing at least one of a ratio or difference to a threshold or multiplying at least one of the energy, amplitude, or envelope by a factor and comparing the multiplied energy, amplitude, or envelope to other energies, amplitudes, or envelopes.
6. The method of claim 1, wherein comparing the sum signal to the difference signal comprises comparing the sum signal to the difference signal in a first frequency band and in a second frequency band, the second frequency band being different from the first frequency band.
7. The method of claim 6, wherein the first frequency band comprises frequencies in the range of 200Hz-400Hz, and the second frequency band comprises frequencies in the range of 500Hz-700 Hz.
8. The method of claim 1, further comprising processing a speech signal with an adaptive filter and altering the adaptive filter based on the comparison.
9. An audio system, the audio system comprising:
a plurality of microphones; and
a controller coupled to the plurality of microphones and configured to:
receive a plurality of microphone signals from the plurality of microphones,
combining the plurality of microphone signals according to a first combination to produce a main signal having an enhanced response in a direction of the user's mouth,
combining the plurality of microphone signals according to a second combination to produce a reference signal having a reduced response in a direction of the user's mouth,
adding the main signal and the reference signal to produce a summed signal,
subtracting one of the primary signal or the reference signal from the other of the primary signal or the reference signal to generate a difference signal,
comparing the sum signal with the difference signal, an
An output speech signal is provided based on the comparison.
10. The audio system of claim 9, wherein the first combination is a Minimum Variance Distortionless Response (MVDR) combination and the second combination is a delay and subtract combination.
11. The audio system of claim 9, wherein comparing the sum signal to the difference signal comprises determining and comparing at least one of an energy, amplitude, or envelope of the sum signal and the difference signal.
12. The audio system of claim 9, wherein comparing the sum signal to the difference signal comprises comparing the sum signal to the difference signal in a first frequency band and in a second frequency band, the second frequency band being different from the first frequency band.
13. The audio system of claim 12, wherein the first frequency band comprises frequencies in the range of 200Hz-400Hz, and the second frequency band comprises frequencies in the range of 500Hz-700 Hz.
14. The audio system of claim 9, wherein providing the speech signal based on the comparison comprises processing the speech signal with an adaptive filter and altering the adaptive filter based on the comparison.
15. A non-transitory computer readable medium having encoded thereon instructions that, when executed by a processor, cause the processor to perform a method comprising:
receiving a plurality of microphone signals;
combining the plurality of microphone signals according to a first combination to produce a primary signal having an enhanced response in a direction of the user's mouth;
combining the plurality of microphone signals according to a second combination to produce a reference signal having a reduced response in a direction of the user's mouth;
adding the main signal and the reference signal to produce a summed signal;
subtracting one of the primary signal or the reference signal from the other of the primary signal or the reference signal to produce a difference signal;
comparing the sum signal to the difference signal; and
an output speech signal is provided based on the comparison.
16. The non-transitory computer-readable medium of claim 15, wherein the first combination is a Minimum Variance Distortionless Response (MVDR) combination and the second combination is a delay and subtract combination.
17. The non-transitory computer-readable medium of claim 15, wherein comparing the sum signal to the difference signal comprises determining and comparing at least one of an energy, amplitude, or envelope of the sum signal and the difference signal.
18. The non-transitory computer-readable medium of claim 15, wherein comparing the sum signal to the difference signal comprises comparing the sum signal to the difference signal in a first frequency band and in a second frequency band, the second frequency band different from the first frequency band.
19. The non-transitory computer-readable medium of claim 18, wherein the first frequency band includes frequencies in the range of 200Hz-400Hz, and the second frequency band includes frequencies in the range of 500Hz-700 Hz.
20. The non-transitory computer-readable medium of claim 15, wherein providing the speech signal based on the comparison comprises processing a speech signal with an adaptive filter and altering the adaptive filter based on the comparison.
CN202180050288.1A 2020-08-17 2021-08-12 Audio system and method for voice activity detection Pending CN115868178A (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US16/995,134 US11482236B2 (en) 2020-08-17 2020-08-17 Audio systems and methods for voice activity detection
US16/995,134 2020-08-17
PCT/US2021/045739 WO2022040011A1 (en) 2020-08-17 2021-08-12 Audio systems and methods for voice activity detection

Publications (1)

Publication Number Publication Date
CN115868178A true CN115868178A (en) 2023-03-28

Family

ID=77640767

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202180050288.1A Pending CN115868178A (en) 2020-08-17 2021-08-12 Audio system and method for voice activity detection

Country Status (3)

Country Link
US (2) US11482236B2 (en)
CN (1) CN115868178A (en)
WO (1) WO2022040011A1 (en)

Family Cites Families (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009151578A2 (en) * 2008-06-09 2009-12-17 The Board Of Trustees Of The University Of Illinois Method and apparatus for blind signal recovery in noisy, reverberant environments
US10186277B2 (en) * 2015-03-19 2019-01-22 Intel Corporation Microphone array speech enhancement
US10242696B2 (en) * 2016-10-11 2019-03-26 Cirrus Logic, Inc. Detection of acoustic impulse events in voice applications
US10475471B2 (en) * 2016-10-11 2019-11-12 Cirrus Logic, Inc. Detection of acoustic impulse events in voice applications using a neural network
US9930447B1 (en) * 2016-11-09 2018-03-27 Bose Corporation Dual-use bilateral microphone array
US9843861B1 (en) * 2016-11-09 2017-12-12 Bose Corporation Controlling wind noise in a bilateral microphone array
US10366708B2 (en) 2017-03-20 2019-07-30 Bose Corporation Systems and methods of detecting speech activity of headphone user
US10311889B2 (en) 2017-03-20 2019-06-04 Bose Corporation Audio signal processing for noise reduction
US10499139B2 (en) * 2017-03-20 2019-12-03 Bose Corporation Audio signal processing for noise reduction
US10264354B1 (en) * 2017-09-25 2019-04-16 Cirrus Logic, Inc. Spatial cues from broadside detection
US10096328B1 (en) * 2017-10-06 2018-10-09 Intel Corporation Beamformer system for tracking of speech and noise in a dynamic environment
CN112424863B (en) * 2017-12-07 2024-04-09 Hed科技有限责任公司 Voice perception audio system and method
US10438605B1 (en) 2018-03-19 2019-10-08 Bose Corporation Echo control in binaural adaptive noise cancellation systems in headsets
US11062727B2 (en) * 2018-06-13 2021-07-13 Ceva D.S.P Ltd. System and method for voice activity detection
EP3675517B1 (en) * 2018-12-31 2021-10-20 GN Audio A/S Microphone apparatus and headset
US10964314B2 (en) * 2019-03-22 2021-03-30 Cirrus Logic, Inc. System and method for optimized noise reduction in the presence of speech distortion using adaptive microphone array
US11328740B2 (en) * 2019-08-07 2022-05-10 Magic Leap, Inc. Voice onset detection
US11917384B2 (en) * 2020-03-27 2024-02-27 Magic Leap, Inc. Method of waking a device using spoken voice commands

Also Published As

Publication number Publication date
US20220051686A1 (en) 2022-02-17
US20230040975A1 (en) 2023-02-09
WO2022040011A1 (en) 2022-02-24
US11482236B2 (en) 2022-10-25
US11688411B2 (en) 2023-06-27

Similar Documents

Publication Publication Date Title
JP7108071B2 (en) Audio signal processing for noise reduction
US10339952B2 (en) Apparatuses and systems for acoustic channel auto-balancing during multi-channel signal extraction
US7983907B2 (en) Headset for separation of speech signals in a noisy environment
US7464029B2 (en) Robust separation of speech signals in a noisy environment
CN111902866A (en) Echo control in a binaural adaptive noise cancellation system in a headphone
US9633670B2 (en) Dual stage noise reduction architecture for desired signal extraction
JP2013532308A (en) System, method, device, apparatus and computer program product for audio equalization
EP3422736B1 (en) Pop noise reduction in headsets having multiple microphones
KR20090050372A (en) Noise cancelling method and apparatus from the mixed sound
CN112334972A (en) Real-time detection of feedback instability
US10249323B2 (en) Voice activity detection for communication headset
CA2798282A1 (en) Wind suppression/replacement component for use with electronic systems
US11854565B2 (en) Wrist wearable apparatuses and methods with desired signal extraction
US10299027B2 (en) Headset with reduction of ambient noise
US11688411B2 (en) Audio systems and methods for voice activity detection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination