EP3516653B1

EP3516653B1 - Apparatus and method for generating noise estimates

Info

Publication number: EP3516653B1
Application number: EP16784821.7A
Authority: EP
Inventors: Wenyu Jin; Wei Xiao
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2016-10-12
Filing date: 2016-10-12
Publication date: 2021-08-11
Anticipated expiration: 2036-10-12
Also published as: WO2018068846A1; EP3516653A1

Description

This invention relates to an apparatus and a method for generating noise estimates.
Voice telecommunication is an increasingly essential part of daily life. Noise is a critical issue for voice telecommunications and is inevitable in the real world. Noise reduction (NR) technologies can be applied to enhance the intelligibility of voice communications. The majority of existing NR methods are optimized for near-end speech enhancement. These methods work well, for example, when a mobile phone is used in "hand-held" mode. Hand-held scenarios are generally easy to handle, due to high signal-to-noise ratios (SNR). NR methods are vulnerable to "hands-free" scenarios, however. These often involve low SNRs due to distant sound pickup. Complex noise variations in particular can undermine system performance in hands-free mode. Reduction of this "non-stationary" noise is difficult to achieve.
Noise reduction methods based on single channel noise estimation can usually only deal with stationary noise scenarios and are vulnerable to non-stationary noise and interferers. Better differentiation between speech and noise can be achieved using multiple microphones. Using multiple microphones also facilitates accurate estimation of complex noise conditions and can lead to effective non-stationary noise suppression.
Examples of existing techniques that explore the possibility of noise estimation using multiple microphone arrays include techniques described in: "A microphone array with adaptive post-filtering for noise reduction in reverberant rooms" by R. Zelinkski (Proc. ICASSP-88, vol. 51988, pp. 2578-2581) and "Microphone array post-filter based on noise field coherence" by McCowan et al (Speech and Audio Processing, IEEE Transactions 11.6 (2003), 709-716). These techniques assume the noise is either spatially white (incoherent) or fully diffuse and cannot deal with time-varying noise and interference sources. They are also ineffective at low frequencies when the sound source is close to the microphone. Speech and noise signals show similar coherence properties under those conditions, meaning that it is not possible to determine one from the other on the basis of coherence alone.
One technique that recognises speech and noise cannot be distinguished from each other based on coherence alone at low frequencies is described in "Dual microphone noise PSD noise estimation for mobile phones in hands-free position exploiting the coherence and speech presence probability" by Nelke et al (Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference). This paper proposes a solution in which the noise power and signals from multiple microphones at higher frequencies. The final noise estimate is converged between the low and high frequency estimates based on a fixed frequency threshold. This leads to vulnerability: the coherence models for speech and noise inevitably overlap at complex low SNR scenarios and the fixed frequency threshold is not always effective at separating the regions of overlap from those of non-overlap. It also leads to high complexity for real-time implementations, as the noise estimation is based on a multi-channel adaptive coherence model that requires adaption for both a speech coherence model and a noise coherence model.
US 2008/0159559 A1 describes a post-filter for a microphone array which is based on a transition frequency determined in accordance with a distance between microphones. A wind noise reduction device is described in US 2008/0317261 A1 . US 2014/0161271 A1 concerns a noise eliminating device and US 2016/0078856 A1 also addresses eliminating noise.
It is an object of the invention to provide concepts for generating more accurate noise estimates.
The foregoing and other objects are achieved by the features of the independent claims. Further implementation forms are apparent from the dependent claims, the description and the figures.
The present invention will now be described by way of example with reference to the accompanying drawings. In the drawings:

Figure 1 shows a noise estimator according to an embodiment of the present invention;
Figure 2 shows an example of a process for estimating noise;
Figure 3 shows a more detailed example of a noise estimator according to an embodiment of the invention;
Figure 4 is a flowchart showing a more detailed process for estimating noise; and
Figure 5 shows simulation results comparing a fixed cut-off frequency procedure with an adaptive cut-off frequency procedure in accordance with an embodiment of the present invention.

An example of a noise estimator is shown in Figure 1. An overview of an operation of the noise estimator is shown in Figure 2. The noise estimator 100 comprises an estimator 101 and an adaptation unit 102. The estimator is configured to receive an audio signal that is detected by microphone 103 (step S201). The estimator is configured to receive audio signals that are detected by multiple microphones 104. The microphones are part of the noise estimator. That device could be, for example, a mobile phone, smart phone, landline telephone, tablet, laptop, teleconferencing equipment or any generic user equipment, particularly user equipment that is commonly used to capture speech signals.
The audio signal represents sounds that have been captured by a microphone. An audio signal will often be formed from a component that is wanted (which will usually be speech) and a component that is not wanted (which will usually be noise). Estimating the unwanted component means that it can be removed from the audio signal. Each microphone will capture its own version of sounds in the surrounding environment, and those versions will tend to differ from each other depending on differences between the microphones themselves and on the respective positions of the microphones relative to the sound sources. If the sounds in the environment include speech and noise, each microphone will typically capture an audio signal that is representative of both speech and noise. Similarly, if the sounds in the environment just include noise (e.g. during pauses in speech), each microphone will capture an audio signal that represents just that noise. Sounds in the surrounding environment will typically be reflected differently in each individual audio signal. In some circumstances, these differences can be exploited to estimate the noise signal.
The estimator (101) is configured to generate an overall estimate of noise in the audio signal (steps 202 to 204). The estimated noise can then be removed from the audio signal by another part of the device. The estimator is configured to generate the estimate based on one or more of the audio signals captured by the microphones. (In some implementations, the audio signals that are captured by the microphones are pre-processed before being input into the estimator. Such "pre-processed" are also covered by the general term "audio signals" used herein). Each audio signal can be considered as being formed from a series of complex sinusoidal functions. Each of those sinusoidal functions is a spectral component of the audio signal. Typically, each spectral component is associated with a particular frequency, phase and amplitude. The audio signal can be disassembled into its respective spectral components by a Fourier analysis.
The estimator 101 aims to form an overall noise estimate by generating a spectral noise estimate for each spectral component in the audio signal. In Figure 1, the estimator comprises a low-frequency estimator 105 and a high frequency estimator 106. The low-frequency estimator 105 is configured to generate spectral noise estimates for the spectral components of the audio signal that are below a cut-off frequency. Those spectral noise estimates will form a low frequency section of the overall noise estimate. The low frequency estimator achieves this by applying a first estimation technique to the audio signal to generate spectral noise estimates that are associated with frequencies below a cut-off frequency (step S202). The high frequency estimator 106 is configured to generate spectral noise estimates for the spectral components of the audio signal that are above the cut-off frequency. Those spectral estimates will form a higher frequency section of the overall noise estimate. The high frequency estimator achieves this by applying a second estimation technique to the audio signal to generate spectral noise estimates that are associated with frequencies above the cut-off frequency (step S203).
The estimator also comprises a combine module 107 that is configured to form the overall noise estimate by combining the spectral noise estimates that are output by the low and high frequency estimators. The combine module forms the overall noise estimate to have spectral noise estimates that are output by the low frequency estimator below the cut-off frequency and spectral noise estimates that are output by the high frequency estimator above the cut-off frequency (step S204). In some embodiments, the low and high frequency estimators will both be configured to generate spectral noise estimates across the whole frequency range of the audio signal. The combine module will then just select the appropriate spectral noise estimate to use for each frequency bin in the overall noise estimate, with that selection depending on the cut-off frequency.
The estimator 101 also comprises an adaptation unit 102. The adaptation unit is configured to adjust the cut-off frequency. The adaptation unit makes this adjustment to account for changes in the respective coherence properties of the speech and noise signals that are reflected in the audio signal (step S205). The coherence properties of the noise signal generally varies in dependence on frequency. At low frequencies, speech and noise tend to show similar degrees of coherence whereas at higher frequencies noise is often incoherent while speech is coherent. Coherence properties can also be affected by distance between a sound source and a microphone: noise and speech show particularly similar coherence properties at low frequencies when the microphone and the sound source are close together. The respective coherence properties displayed by the noise and speech signals will thus tend to vary with time, particularly in mobile and/or hands free scenarios where one or more sound sources (such as someone talking) may move with respect to the microphone. One option is to track the coherence properties of both speech and noise. However, in practice, it is the noise coherence that particularly changes. Consequently changes between the respective coherence properties of the speech and noise signals can be monitored by tracking the coherence properties of just the noise.
Adjusting the cut-off frequency so as to adapt to changes in the coherence properties of the noise signal that are represented in the audio signal may be advantageous because it enables the estimator to generate the overall noise estimate using techniques that work well for the particular coherence properties that are prevalent in the noise on either side of the cut-off frequency, and to alter that cut-off frequency to account for changes in those coherence properties with time. This is particularly useful for the complex noise scenarios that occur when user equipment is used in hands-free mode.
The structures shown in Figures 1 (and all the block apparatus diagrams included herein) are intended to correspond to a number of functional blocks. This is for illustrative purposes only. Figure 1 is not intended to define a strict division between different parts of hardware on a chip or between different programs, procedures or functions in software. In some embodiments, some or all of the signal processing techniques described herein are likely to be performed wholly or partly in hardware. This particularly applies to techniques incorporating repetitive arithmetic operations. Examples of such repetitive operations might include Fourier transforms, auto- and cross-correlations and pseudo inverses. In some implementations, at least some of the functional blocks are likely to be implemented wholly or partly by a processor acting under software control. Any such software is suitably stored on a non-transitory machine readable storage medium. The processor could, for example, be a DSP of a mobile phone, smart phone, landline telephone, tablet, laptop, teleconferencing equipment or any generic user equipment with speech processing capability.
A more detailed example of a noise estimator is shown in Figure 3. The system is configured to receive multiple audio signals X₁ to X_m (301). Each of these audio signals represents a recording from a specific microphone. The number of microphones can thus be denoted M. Each channel is provided with a segmentation/windowing module 302. These modules are followed by transform units 303 configured to convert the windowed signals into the frequency domain.
The transform units 303 are configured to implement the Fast Fourier Transform (FFT) to derive the short-term Fourier transform (STFT) coefficients for each input channel. These coefficients represent spectral components of the input signal. The STFT is a Fourier-related transform used to determine the sinusoidal frequency and phase content of local sections of a signal as it changes over time. The STFT may be computed by dividing the audio signal into short segments of equal length and then computing the Fourier transform separately on each short segment. The result is the Fourier spectrum for each short segment of the audio signal, giving the signal processor the changing frequency spectra of the audio signal as a function of time. Each spectral component thus has an amplitude and a time extension. The length of the FFT can be denoted N. N represents a number of frequency bins, with the STFT essentially decomposing the original audio signal into those frequency bins.
The outputs from the transform units 303 are input into the estimator, shown generally at 304. In the example of Figure 3, the low frequency estimator is implemented by "SPP Based NE" unit 305 (which will be referred to as SPP unit 305 hereafter). The low frequency estimator is configured to generate the spectral noise estimates below the cut-off frequency. The high frequency estimator is implemented by the "Noise Coherence/Covariance" modelling unit 306 and the "MMSE Optimal NE Solver" 307 (which will be respectively referred to as modelling unit 306 and optimiser 307 hereafter). The high frequency estimator is configured to generate the spectral noise estimates above the cut-off frequency.
The low frequency estimator 305 and the high frequency estimator 306, 307 process the outputs from the transform units using different noise estimation techniques. The low frequency estimator suitably uses a technique that is adapted to the respective coherence properties of the noise signal and the speech signal that are expected to predominate in the audio signal below the cut-off frequency. In most embodiments this means that the low frequency estimator will apply an estimation technique that is adapted to a scenario in which the coherence of both signals is high and similar to the coherence of the other. In the example of Figure 3 the low frequency estimator is configured to generate its spectral noise estimates based on a single microphone signal. The high frequency estimator will similarly apply an estimation technique that is adapted to a coherence of the noise signal and the speech signal that is expected to predominate in the audio signal above the cut-off frequency. The noise and speech signals are generally expected to show different coherence properties above the cut-off frequency, with the noise signal becoming less coherent than below the cut-off frequency. A more accurate noise estimate may be obtained by combining signals from multiple microphones under these conditions, so the high frequency estimator may be configured to receive audio signals from multiple microphones.
Suitably the noise estimates that are output by the low frequency estimator 305 and the high frequency estimator 306, 307 take the form of power spectral densities (PSD). A PSD represents the noise as a series of coefficients. Each coefficient represents an estimated power of the noise in an audio signal for a respective frequency bin. The coefficient in each frequency bins can be considered a spectral noise estimates. The frequency bins suitably replicate the frequency bins into which the audio signals were decomposed by transform units 303. The outputs of the low frequency estimator and the high frequency estimator thus represent spectral noise estimates for each spectral noise component of the audio signal.
The two sets of coefficients are input into the "Estimate Selection" unit 308. This estimate selection unit combines the functionality of combine module 107 and adaptation unit 102 shown in Figure 1. The estimate selection unit is configured to choose between the coefficients that are output by the low frequency estimator and the high frequency estimator in dependence on frequency. To form parts of the overall noise estimate that are below the cut-off frequency, the adaptation unit chooses the coefficients output by SPP unit 305. To form parts of the overall noise estimate that are above the cut-off frequency, the estimate selection unit chooses the coefficients output by the combination of the modelling unit 306 and the optimiser 307. The estimate selection unit also monitors a coherence of the noise signal by means of the audio signal, and uses this to adapt the cut-off frequency.
The low frequency estimator may use any suitable estimation technique to generate spectral noise estimates that are below a cut-off frequency. One option would be an MMSE-based spectral noise power estimation technique. Another option is soft decision voice activity detection. This is the technique implemented by SPP unit 305, which is configured to implement a single-channel SPP-based method (where "SPP" stands for Speech Presence Probability). SPP maintains a quick noise tracking capability, results in less noise power overestimation and is computationally less expensive than other options.
SPP module 305 is configured to receive an audio signal from one microphone. For devices that have multiple microphones that are not necessarily the same (e.g. smartphones), the SPP unit 305 is preferably configured to receive the single channel that corresponds to the device's "primary" microphone.
At higher frequencies, noise estimation is suitably based on a multi-channel adaptive coherence model. Model adaptation unit 306 is configured to update a noise coherence model and a noise covariance model in dependence on signals input from multiple microphones. Optimiser 307 takes the outputs of the model adaptation unit and generates the optimum noise estimate for higher frequency sections of the overall noise estimate given those outputs.
An example of an estimation process that may be performed by the noise estimator shown in Figure 3 is shown in Figure 4 and described in detail below. In step S401 the incoming signals 301 are received from multiple microphones. In step S402, those signals are segmented/windowed (by segmentation/windowing units 302) and converted into the frequency domain (by transform units 303). The probability of speech presence in the current frame is then detected by SPP unit 305 (step S404) using the function: $ρ_{τ, ω} = {(1 + (1 + ξ_{opt}) \exp (- \frac{{|X_{1} (τ, ω)|}^{2}}{{\hat{Φ}}_{N, SPP} (τ - 1, ω)} \frac{ξ_{opt}}{ξ_{opt} + 1}))}^{- 1}$
where ρ_τ,ω represents the probability of speech presence in frame τ and frequency bin ω, X ₁ is the audio signal received by SPP unit 305, ξ_opt is a fixed, optimal a priori signal-to-noise ratio and Φ̂_N,SPP (τ - 1, ω) is the noise estimate of the previous frame. ρ_τ,ω is a value between 0 and 1, where 1 indicates speech presence.
The SPP unit (305) also updates its estimated noise PSD as the weighted sum of the current noisy frame and the previous estimate (step S404): ${\hat{Φ}}_{N, SPP} (λ - 1, n) = ρ_{τ, ω} . {\hat{Φ}}_{N, SPP} (τ - 1, ω) + (1 - ρ_{τ, ω}) . {|X_{1} (τ, ω)|}^{2}$
The speech presence probability calculation also triggers the updating of the noise coherence and covariance models by modelling unit 306, since these models are preferably updated in the absence of speech.
The model adaptation unit (306) is configured to track two qualities of the noise comprised in the incoming microphone signals: its coherence and its covariance (step S405).
The model adaptation unit is configured to track noise coherence using a model that is based on a coherence function. The coherence function characterises a noise field by representing the coherence between two signals at points p and q. The magnitude of the output of the noise coherence function is always less than or equal to one (i.e. |'Υ_ω,pq | ≤ 1). Essentially the output represents a normalized measure of the correlation that exists between signals at two discrete points in a noise field. The noise coherence function between the j^th and k^th microphone can be initialised with the diffuse noise model: $' ϒ_{ω, pq} = sinc (\frac{2 * π * f * d_{pq}}{c})$
where f is the frequency, d_pq is the distance between points p and q, c is the speed of sound, and ω is an index representing the relevant frequency bin. The relevant distance is this scenario is between the j^th and k^th microphones, so the subscripts j and k will be substituted for p and q hereafter.
The model adaptation unit (306) updates the coherence model as the microphone signals are received: $ϒ_{pq} (τ, ω) = α_{γ} ϒ_{pq} (τ - 1, ω) + (1 - α_{γ}) * \frac{Φ_{jk} (τ, ω)}{\sqrt{Φ_{jj} (τ, ω) Φ_{kk} (τ, ω)}} when ρ (τ, ω) < 0.1$
Where τ is the frame index, ω is the frequency bin, and Φjj(τ, ω), Φ_kk (τ, ω) and Φ_jk (τ, ω) are the recursively-smoothed, auto-correlated and cross-correlated PSDs of the audio signals from the j^th and k^th microphones respectively. ρ(τ,ω) is the posteriori SPP index for the current frame and is provided to model adaptation unit 306 by SPP unit 305. ρ(τ,ω) acts as the threshold for Υ_pq (τ, ω) to be updated. In practice it is preferable to only update Υ _pq (τ, ω) in periods where speech is absent. A suitable value for the smoothing factor α_γ might be 0.95.
The auto- and cross-correlated PSDs that are input into equation (4) can be calculated by recursive smoothing of the input signals: $Φ_{jk}^{'} = {αΦ}_{jk} + (1 - α) X_{j} \cdot X_{k}^{*}$
where $X_{j} (\frac{N + 2}{2} \times 1)$
and $X_{k} (\frac{N + 2}{2} \times 1)$
are the FFT coefficient vectors of the j^th and k^th channels for the current frame. α is a smoothing factor.
The model adaptation unit (306) is also configured to actively update a noise covariance matrix (also in step S405). For each narrow frequency band, the noise covariance matrix R_nn (MXM) is recursively updated using: ${\hat{R}}_{nn} = α R_{nn} + (1 - α) x^{T} conj (x), when ρ < 0.1$
where x (1XM) represents the STFT coefficients of the input signals from all of the microphones in respect of frequency bin n.
The model adaptation unit (306) is thus configured to establish the coherence and covariance models and update them as audio signals are received from the microphones.
Provided that the covariance model R_nn and the adaptive coherence model 'Υ _pq are derived, they are linked as follows: $R = Φσ$
where $R = [\begin{matrix} diag (R_{nn}) \\ odiag (R_{nn}) \end{matrix}]$
(P²X1) and $σ = [\begin{matrix} σ_{c}^{2} \\ σ_{w}^{2} \end{matrix}]$

$σ_{d}^{2}$
represents the variance of coherent noise and the $σ_{w}^{2}$
is the variance of incoherent noise. diag and odiag represents the diagonal and off-diagonal elements respectively, written in vector form:

diag(R_nn ) = [ R_nn (1,1) ... R_nn (P,P)] ^T and
odiag(R_nn ) = [ R_nn (1,2),..., R_nn (1,P), R_nn (2,1),..., R_nn (P,P - 1)] ^T
Φ (P ² X2) is derived from the adaptive coherence models between the multiple pairs of microphones: $Φ = [\begin{matrix} diag (ϒ) & 1^{*} \\ odiag (ϒ) & 0^{*} \end{matrix}]$
$Where ϒ = (\begin{matrix} ϒ_{11} & \dots & ϒ_{1 P} \\ ⋮ & ⋱ & ⋮ \\ ϒ_{P 1} & \dots & ϒ_{PP} \end{matrix}) and 1 * = 1_{P \times 1}$

The updated models are used to generate a further noise estimate using an optimal least squares solution (step S406). The values of R and Φ are suitably transferred from the model adaptation unit (306) to the optimiser (307). In the example of Figure 3, the optimiser is configured to generate the noise estimate for higher frequencies by searching for an optimal least-squares solution to equation (7) in the minimum mean square error (MMSE) sense. This optimal solution in the MMSE sense is given by: $σ = real (Φ^{‡} R)$
Where Φ^‡ is the Moore-Penrose pseudo-inverse.
The overall noise PSD estimator is: $\hat{Φ_{c}} = σ_{c}^{2} + σ_{w}^{2}$
This approach decomposes the noise estimation problem into a series of linear equations in which every microphone signal is compared with every other microphone signal. This gives a more optimal solution than current methods, which only compare signals from a single pair of microphones. One option for extending current methods to multiple microphones would be to pair those microphones off and estimate the overall noise by averaging the noise estimates generated by each pair of microphones. This approach is not preferred, however, as each microphone signal is then only compared with one other microphone signal. Comparing all of the microphone signals against each results in a more accurate noise estimate.
The estimate selector 308 is configured to form the overall noise estimate. It receives the estimates generated by both the SPP unit (305) and the optimiser (307) (Φ _S and Φ _C respectively) and combines them to form the overall noise estimate (step S407).
Finally, the cut-off frequency is adaptively adjusted so that the two noise estimates can be combined into an overall noise estimate (also in step S407). In order to converge both the low and high frequency estimated coefficients into the final noise estimate more effectively, estimate selector 308 is configured to adaptively adjust the split frequency between the single microphone noise estimate and the multi microphone noise estimate based on the updating model in equation (4). The following scheme is one option for setting the cut-off frequency that controls the combination of the two estimated noise PSDs: ${\hat{Φ}}_{n} (τ, ω) = {\begin{matrix} {\hat{Φ}}_{s} (τ, ω), ω < \min (f_{12}, \dots \dots f_{pq}) \\ {\hat{Φ}}_{c} (τ, ω), ω \geq \min (f_{12}, \dots \dots f_{pq}) \end{matrix}$
where f_pq represents the frequency where the magnitude squared value of the updated coherence function in equation (4) for the pq^th microphone pair has some predetermined value. A suitable value might be, for example, 0.5. f_pq varies according to the adaptive coherence model Y.The split frequency is selected to be the lowest frequency among various microphone pairs where the magnitude squared value of coherence function has the predetermined value. This ensures that the appropriate noise estimate is selected for the speech and noise coherence properties experienced at different frequencies, meaning that problems caused by similarity and overlapping between speech and noise coherence properties can be consistently avoided for each channel.
Given the estimated noise PSD ${\hat{Φ}}_{n},$
noise reduction can be achieved using any suitable noise reduction methods, including wiener filtering, spectrum subtraction etc.
The techniques described above have been tested via simulation using complex non-stationary subway scenario recordings and three microphones. The recording length was 130 seconds. The recording was processed using the adaptive cut-off frequency technique described above and a technique in which the cut-off frequency is fixed. The results are shown in Figure 5. The lower plot 502 illustrates the technique described herein and it can clearly be seen that it has been more effective in addressing the non-stationary noise issues that the fixed cut-off frequency technique shown in upper plot 501. The processing was also more efficient. The processing time using the non-adaptive technique was 62 seconds, compared with 35 seconds for the adaptive technique.
It should be understood that where this explanation and the accompanying claims refer to the noise estimator doing something by performing certain steps or procedures or by implementing particular techniques that does not preclude the noise estimatorfrom performing other steps or procedures or implementing other techniques as part of the same process. In other words, where the noise estimator is described as doing something "by" certain specified means, the word "by" is meant in the sense of the noise estimator performing a process "comprising" the specified means rather than "consisting of" them.
The applicant hereby discloses in isolation each individual feature described herein and any combination of two or more such features, to the extent that such features or combinations are capable of being carried out based on the present specification as a whole in the light of the common general knowledge of a person skilled in the art, irrespective of whether such features or combinations of features solve any problems disclosed herein, and without limitation to the scope of the claims. The applicant indicates that aspects of the present invention may consist of any such individual feature or combination of features. In view of the foregoing description it will be evident to a person skilled in the art that various modifications may be made within the scope of the invention.

Claims

A noise estimator for generating an overall noise estimate for an audio signal, wherein the noise estimator comprises microphones for capturing sounds, the sounds are represented by a plurality of audio signals comprising the audio signal, each of the plurality of audio signals is formed, at least partly, by a noise signal and comprises a plurality of spectral components, and wherein the overall noise estimate comprises, for each spectral component in the audio signal, a respective spectral noise estimate, the noise estimator comprising:
an estimator (304) configured to generate the overall noise estimate by:
applying a first estimation technique to the audio signal to generate spectral noise estimates for spectral components of the audio signal that are below a cut-off frequency;

applying a different second estimation technique to the audio signal to generate, based on the plurality of audio signals, spectral noise estimates for spectral components of the audio signal that are above the cut-off frequency; and

forming the overall noise estimate to comprise, for spectral components below the cut-off frequency, the spectral noise estimates generated using the first estimation technique and, for spectral components above the cut-off frequency, the spectral noise estimates generated using the second estimation technique; characterized by

an adaptation unit (306) configured to adjust the cut-off frequency so as to account for changes in coherence of the noise signal that are reflected in the audio signal, wherein the adaptation unit is configured to select the cut-off frequency to be the lowest frequency at which one of the plurality of audio signals shows a predetermined degree of coherence with another of the plurality of audio signals.
A noise estimator of claim 1, wherein the estimator (308) is configured to apply:
as the first estimation technique, a technique that is adapted to a coherence of the noise signal that is expected to predominate in the audio signal below the cut-off frequency; and

as the second estimation technique, a technique that is adapted to a coherence of the noise signal that is expected to predominate in the audio signal above the cut-off frequency.
A noise estimator as claimed in claim 1 or 2, wherein the estimator is configured to generate the spectral noise estimates for above the cut-off frequency using an optimisation function that takes the plurality of audio signals as inputs.
A noise estimator as claimed in any of claims 1 to 3, wherein the estimator is configured to generate the spectral noise estimates for above the cut-off frequency by comparing each of the plurality of audio signals with every other of the plurality of audio signals.
A noise estimator as claimed in any of claims 1 to 4, wherein the estimator is configured to generate the spectral noise estimates for above the cut-off frequency in dependence on the coherence between each of the plurality of audio signals and every other of the plurality of audio signals.
A noise estimator as claimed in any of claims 1 to 4, wherein the estimator is configured to generate the spectral noise estimates above the cut-off frequency in dependence on a covariance between each of the plurality of audio signals with every other of the plurality of audio signals.
A noise estimator as claimed in any preceding claim, wherein the estimator (308) is configured to generate the spectral noise estimates for below the cut-off frequency in dependence on a single audio signal that is representative of the noise signal.
A noise estimator as claimed in any preceding claim, wherein the estimator (308) is configured to generate the spectral noise estimates for below the cut-off frequency and/or the spectral noise estimates for above the cut-off frequency by applying the respective first or second estimation technique only to parts of the audio signal that are determined to not comprise speech.
A method for generating an overall noise estimate of an audio signal using a noise estimator which comprises microphones for capturing sounds, the sounds being represented by a plurality of audio signals comprising the audio signal, wherein each of the plurality of audio signals is formed, at least partly, by a noise signal and comprises a plurality of spectral components, and wherein the overall noise estimate comprises, for each spectral component in the audio signal, a respective spectral noise estimate, the method comprising:
applying (S202) a first estimation technique to the audio signal to generate spectral noise estimates for spectral components of the audio signal that are below a cut-off frequency;

applying (S203) a different second estimation technique to the audio signal to generate spectral noise estimates for spectral components of the audio signal that are above the cut-off frequency;

forming (S204) the overall noise estimate to comprise, for spectral components below the cut-off frequency, the spectral noise estimates generated using the first estimation technique and, for spectral components above the cut-off frequency, the spectral noise estimates generated, based on the plurality of audio signals, using the second estimation technique; and

adjusting (S205) the cut-off frequency so as to account for changes in coherence of the noise signal that are reflected in the audio signal wherein the cut-off frequency is selected to be the lowest frequency at which one of the plurality of audio signals shows a predetermined degree of coherence with another of the plurality of audio signals.
A non-transitory machine readable storage medium having stored thereon processor executable instructions implementing a method for generating an overall noise estimate of an audio signal using a noise estimator which comprises microphones for capturing sounds, the sounds are represented by a plurality of audio signals comprising the audio signal, wherein each of the plurality of audio signals is formed, at least partly, by a noise signal and comprises a plurality of spectral components, and wherein the overall noise estimate comprises, for each spectral component in the audio signal, a respective spectral noise estimate, the method comprising:
applying (S202) a first estimation technique to the audio signal to generate spectral noise estimates for spectral components of the audio signal that are below a cut-off frequency;

applying (S203) a different second estimation technique to the audio signal to generate spectral noise estimates for spectral components of the audio signal that are above the cut-off frequency;

forming (S204) the overall noise estimate to comprise, for spectral components below the cut-off frequency, the spectral noise estimates generated using the first estimation technique and, for spectral components above the cut-off frequency, the spectral noise estimates generated, based on the plurality of audio signals, using the second estimation technique; and

adjusting (S407) the cut-off frequency so as to account for changes in coherence of the noise signal that are reflected in the audio signal wherein the cut-off frequency is selected to be the lowest frequency at which one of the plurality of audio signals shows a predetermined degree of coherence with another of the plurality of audio signals.