Technical field of the invention
Background of the invention
This invention relates to enhancing of desired sound signals such as speech signals and to suppression of noise and interfering sound sources by using an array of microphones and adaptive beam forming.
In traditional verbal communication such as telephone communication a single microphone is located near the speaker's mouth. The microphone can be part of a handset or of a headset. In hands free verbal communication the microphone or microphones will usually be in a position more remote to the speaker's mouth than in handheld communication. Hands free communication includes the situations where eg a telephone handset or a microphone is placed on a conference table for picking up the speech of several speakers, and where a microphone is mounted in a car for picking up the speech of a person in the car, typically of the driver. In hands free communication the microphone will usually be placed in a position more remote from the speaker's mouth than in handheld communications, whereby the received speech signal from the speaker will be weaker, which in turn reduces the signal-to-noise ratio, and the intelligibility of the speech will be reduced.
With pure frequency or narrow frequency-band signals from a fixed point source beam forming can be performed with a set of microphones so arranged in space that the sum of the microphone signals will have a maximum response at the source. Furthermore, by individually delaying the microphone signals the point of maximum response can be moved electronically.
If the point source emits a wide frequency-band signal such as a speech signal, the beam former must be able to delay a plurality of frequencies individually. This is called wide frequency-band beam forming, and signals from a certain location are allowed to pass the system, whereas signals outside that location are attenuated or even cancelled. A common way of achieving this is to use digital linear filters at each microphone signal.
In general, wide frequency-band beam forming methods make use of fundamental properties of the spatial and/or the temporal distribution of both the speech source and the noise sources in order to improve the speech signal quality. Roughly speaking, beam-forming methods are either fixed or adaptive. Fixed beam formers are fundamentally based on modelled assumptions on the speech signal and the noise field. Based on the assumed model optimal beam formers can be constructed. Optimal function is only guaranteed for perfect model matching. Adaptive beam formers are used to track variations and to compensate for model mismatch. Generally, adaptive beam formers are based on continuous estimates of spatial and statistical information contained in the received speech and noise signal. In general, they are more complex to implement and require complex computations.
It is known to use an array of microphones for speech enhancement by exploiting fundamental properties about spatial and temporal distribution of both the speech and the noise sources. Existing methods for broadband adaptive processing of signals from an array of microphones involve highly complex computing routines, and distortion is introduced in the speech signal. Furthermore they are sensitive to model mismatch. The most common methods also include a voice activity detector (VAD) or a double talk detector (DTD), which will substantially degrade performance due to difficulties in their exact implementation.
Optimal beam formers exist such as the Signal-to-Noise plus Interference Beam former (SNIB) , and the Minimum Mean Squares Error Beam former (MMSEB).
For the Signal-to-Noise plus Interference Beam former (SNIB) the output signal-to-noise plus interference ratio (SNIR) is defined as
and the beam former that maximises the ratio Q, is the optimal Signal-to-Noise plus Interference Beam former (SNIB). The mean signal output power is expressed as a function of the filter weights in the beam former, and the optimal weights, which maximises the output signal-to-noise plus interference ratio Q will have to be defined.
The optimal Minimum Mean Squares Error Beam former (MMSEB) is defined as the beam former that minimizes the mean squares difference between the beam former output when all sources are active, and a single microphone observation, when only the signal of interest is present.
Summary of the invention
Beam formers can be adequately described both in the time-domain and in the frequency-domain, since the measuring unit of frequency (s-1) is the inverse of the measuring unit of time (s).
The invention uses an array of two or more microphones, and the broadband signal from each microphone is passed through a bank of band pass filters separating each broadband signal into several frequency sub-bands. The band pass filtered signals pertaining to the same frequency sub-band are beam formed in adaptive beam formers. Finally, from the adaptively beam formed frequency sub-band signals a single beam formed broadband signal is reconstructed. The sensitivity of the microphone array is thereby focused in a region including the desired sound source, whereby sound signals originating outside that region are suppressed. The beam forming is performed in frequency sub-bands, which requires less computational power than broadband beam forming. The method thus lends itself to the use in mobile communications devices such as mobile telephones.
With the invention the spatial selectivity of the microphone array is beam formed to the speaker, whereby disturbing sound sources are suppressed. Point sources that can be suppressed include speaking persons other than the user, and loudspeakers such as a loudspeaker of a hands free communications device and the loudspeakers of a car stereo system and even moving sources. Diffuse fields that can be suppressed include ambient noise and reverberation in the room.
Brief description of the drawing
The invention requires information on the desired signal source. Such information is acquired at times when there is reason to believe that only insignificant noise is present, and the system then records sound signals from the desired sound source alone, eg speech from a speaker, and calculates and stores an estimate of the source correlation matrix and an estimate of the source signal cross correlation vector. The acquisition of the source signal alone can be done in an initial phase where the user selects a mode of operation therefor, or the device using the invention can use eg a voice activity detector to detect voice activity of a speaker, and at instances with only insignificant noise the speaker's voice is recorded and analysed.
Detailed description of the invention
- Figure 1 shows a schematic block diagram of an apparatus using the method of the invention.
In figure 1 a first plurality of I microphones M1 - Ml are arranged for receiving sound signals from a speaker. The plurality of microphones have fixed positions relative to each other thus forming a fixed array, and their number I will be adapted to the actual use. Thus, eg in a handset there will be relatively few microphones such as two, and in a more stationary installation such as in a car a higher number of microphones will typically be used such as four, six or more. In principle any type of microphone can be used. Each microphone outputs an electrical signal representing the sound signal received by the microphone. The electrical output signal from the microphones can be an analogue signal or a digital signal, and in the latter case the microphones will have an analogue-to-digital (A/D) converter.
The output signal of each microphone is input to an individual one of a first plurality of I filter banks, which are capable of receiving and processing the signals in analogue or digital form as output by the microphones. In principle, there is one filter bank for each microphone, but by analogue or digital multiplexing methods one filter bank can be used for several microphones.
In principle, all filter banks are identical, and each filter bank has a second plurality of K band pass filters where each band pass filter lets a predetermined band of frequencies pass through the filter, so that each filter covers a distinct band of frequencies derived from the microphone output signal. Together the plurality of band pass filters in each bank covers the whole frequency range of interest. Each bank of band pass filters thus outputs a second plurality of K band pass filtered signals covering distinct frequency bands.
The construction with I filter banks each having K band pass filters lends it self to implementation in digital circuits, and in practice the illustrated banks of parallel band pass filters will be implemented using digital signal processing.
A second plurality of K beam formers each receive a first plurality of I inputs of band pass filtered signals, ie one from each filter bank, where the I band pass filtered signals all cover the same frequency band. In each frequency-band the corresponding beam former focuses the sensitivity of the microphone array to a beam or region including the desired sound source, which in this case is a speaking person.
The outputs of the K beam formers are fed as input signals to a unit in which a single channel wide frequency-band output signal is reconstructed by combining the K beam formed signals.
In the above description the block diagram in figure 1 is used as an illustration for explaining the invention. The invention is preferably implemented in software controlled digital circuits, in which case the illustrated individual blocks in figure 1 will not be distinct from each other. Rather, the software controlled digital circuits will perform the described operations.
In case of stationary signal conditions and when microphone positions and the speaker's position relative to the microphone array are known and accurate, prior art methods of beam forming may be sufficient, such as the methods described in ,  and . However, when surrounding noise and/or additional disturbing noise sources are changing with time, an adaptive structure or method will perform better and is preferred in order to make use of changing spatial and temporal signal properties.
The invention uses a new algorithm called the Calibrated Weighted Recursive Least Squares (CW-RLS) algorithm. Characteristic features of this algorithm are the introduction of a weighting factor that is used on the observed signal correlation matrix estimates and the use of pre-calculated source correlation estimates. The invention proposes an efficient method of recursively updating estimates.
The MMSE optimal beam former weighting factors in frequency sub-band k
can be expressed as follows
is the frequency sub-band number, N
is the sample number, y (k) (n)
is the output of the beam former, and s (n)
is reference information on the signal source alone. This information is not directly available, or it cannot be expected to be available, and it therefore has to be measured and estimated separately. This is done in a calibration sequence with no or only insignificant background noise, ie the desired signal from the desired signal source such as a speaking person alone. This calibration signal will represent the temporal and spatial information about the desired source. Since the source signal information S (n)
is independent on the actually received sound signals, the Least Squares problem can be separated into two parts:
where the first part can be calculated in advance. Calculating the sum yields
where the estimated source correlation matrix for frequency sub-band k
can be pre-calculated in the calibration phase as
and the estimated source signal spatial cross correlation vector for frequency sub-band k
where s (k) (n)
are actually received microphone signals when the desired signal source is active alone, ie in the calibration phase. The least squares minimization of equation (3) is found by
is the observed/measured signal correlation matrix for frequency sub-band k
. This means that an estimate of the calibration data can be used in the algorithm.
In the following the Calibrated Weighted Recursive Least Squares (CW-RLS) algorithm is derived. It is a characteristic feature of the CW-RLS algorithm that it introduces a weighting factor on the observed signal correlation matrix estimates, and also uses pre-calculated source correlation estimates in equation (4). The algorithm also achieves this update recursively.
An exponential weighting factor λ, 0 < λ < 1, which may also be referred to as a "forgetting factor", is introduced in the second part of equation (2) according to
where 1 ≤ r
. An algorithm is thereby obtained that follows the statistical variations in the observed sound signals or data, when the beam former operates in a non-stationary environment. Calculation of the sum gives the same relationship as in equation (3), but the correlation matrix estimates of the observed data will be in accordance with the following equation:
where the least squares solution is given by equation (5). Since the correlation estimates from the calibration sequence, equation (4), are gathered in advance, a recursive update formula for each sample of the new data observation vector x (k) (n)
can be formulated.
First, the total correlation matrix is introduced:
where it is desired to recursively update the inverse of this matrix. This is done using the Matrix-Inversion Lemma . The total correlation matrix is updated at sample instance n
The effect of this updating is that the total correlation matrix is weighted, and that both the rank one "correction term" x (k) (n)x (n)
and the fraction (1-λ) of the estimated source correlation matrix, or calibration correlation matrix, R̂
) multiplied by the weighting factor, are added. This updating can be implemented directly in a two-step procedure using the Matrix-Inversion Lemma , however, this will require complex power and time-consuming calculations of the inverse of the estimated source correlation matrix R̂
) for each step of updating. Such complex power and time-consuming calculations are undesirable.
One way to circumvent the matrix inversion and thus substantially reduce the complexity of the calculations is to update the total correlation matrix by adding scaled eigenvectors of the estimated source correlation matrix, which will result in several, and simpler, rank one updates as
is the p:th eigenvalue, and q
is the p:th eigenvector of the |-by-| estimated source correlation matrix R̂
) The weighted optimal recursive least squares solution at sample instant n
is then given by
where the calibration correlation vector r̂
) is gathered and estimated in advance and is assumed to be uncorrelated with the observed data.
One simple way to further reduce the complexity is sequentially adding one scaled eigenvector at each sample instance. This is easily achieved by replacing the index ρ
in equation (10) by the single index ρ
= (n mod I
) + 1. When the statistical properties of the environmental noise change abruptly, eg when a new source of disturbance suddenly appears, a smoothing of the weights may be appropriate. A first order auto regressive (AR) model is preferred for the smoothing, and the weight update then becomes
where α is the AR-parameter, which corresponds to the real-valued pole of the AR-model. By applying the Matrix-Inversion Lemma in order to update the inverse of the total correlation matrix an efficient implementation of the algorithm of the invention is obtained.
The method of beam forming according to the invention has two main phases. In the first phase, information on the desired signal source alone is gathered. The first phase will be referred to as the calibration phase. The second phase is referred to as the operation phase.
In the calibration phase, in principle, the desired signal source such as a speaking person is active alone. With the arrangement in figure 1 sound signals from a speaking person are captured by the array of microphones. Output signals from the first plurality l
microphones are filtered in the filter banks, whereby for each microphone a second plurality K
of band pass filtered signals covering distinct frequency bands are derived from the microphone signals. For each of the distinct frequency sub-bands the estimated source correlation matrix R̂
is calculated and stored in a memory. This is done simultaneously for all frequency sub-bands. When there are other known disturbing or undesired sound sources, eg a loudspeaker for hands-free operation of the telephone, with a fixed position relative to the microphone array, correlation estimates from these signals are added to the stored estimated source correlation matrix R̂
An estimated source signal spatial cross correlation vector r̂
for frequency sub-band k
is also calculated and stored.
The calibration may be performed in an initial phase and in quiet environments, but the system will preferably perform the calibration also during the operation phase at times where there are reasons to believe that undesired sounds such as background noise are of minor importance and have a negligible influence on the estimation of the source correlation matrix. Thereby the estimated source correlation matrix R̂
is updated currently. Methods of detecting such times for updating the estimated source correlation matrix are known and involve eg a voice activity detector (VAD).
The calibration is performed as follows. With the desired signal source, eg a speaking person, as the only or at least the dominating sound source, the microphone signal vector
is calculated for each of the K
distinct frequency sub-bands. The estimated source signal spatial cross correlation vector for each frequency sub-band k
and for each sample n
is calculated as follows:
and the estimated source correlation matrix is calculated as follows:
With all known undesired disturbing sound sources being active the corresponding microphone signal vector
is calculated for each of the K
distinct frequency sub-bands, and correspondingly the estimated undesired disturbance correlation matrix is calculated as follows:
The correlation matrices are stored in diagonalized form:
The eigenvectors are denoted
and the eigenvalues are denoted
For each frequency sub-band k,
the eigenvectors q
, the eigenvalues γ
, and the cross correlation vector r̂
(N), where 0≤ k
-1, are stored in memory for subsequent use in the operation phase.
In the operation phase it is thus assumed that, for each of the K
distinct frequency sub-bands, the estimated source correlation matrix R̂
and the estimated source signal spatial cross correlation vector r̂
are stored and available, either from a separate initial calibration phase or from a more recent updating.
In each of the K frequency sub-bands, the following variables are used:
- beam former weighting vectors:
- the inverse of the total correlation matrix variable at time instant n, for frequency sub-band number k: P , Initialize as:
- the forgetting factor λ for the WRLS algorithm and a weight smoothing factor α for the weighting factor update, which are preferably both chosen as constants for all frequency sub-bands:
The algorithm is as follows. When any two or more of the sound sources (speech and noise) are active simultaneously, the following quantities are computed for each sample n
and for each frequency sub-band k
where index p
For each frequency sub-band k
the output from the corresponding beam former then is:
The algorithm is then repeated for the next sample n+1, etc.
In the operation phase the microphone output signals are continuously decomposed into discrete frequency sub-bands. The sub-band weighting factors w (k) are updated by making use of both the stored correlation estimates and of the actual microphone observations. The totality of beam former output signals, which each is a beam formed frequency sub-band time-domain signal, are input to a reconstruction filter bank. The output of the reconstruction filter bank is taken as the estimate of the sound signal from the desired sound source. Once the correlation estimates are stored in the memory, the algorithm is continuously adapting.
The algorithm contains a step in which a rank one update of the correlation matrix is performed using scaled eigenvectors, one eigenvector for each new input data vector. This step adds correlation estimates from the source signal, whereby information gathered in the acquisition or calibration phase will remain a constant part of the correlation matrix, while the contributions form the undesired environmental noise will be subject to the forgetting factor λ in the estimates.
Good performance of the algorithm of the invention requires that the number K of frequency sub-bands is large enough for the frequency-domain representation to be accurate. In other words, the number of frequency sub-bands is proportional to the length of the equivalent time-domain filters (filter length = the number of parameters used in a digital filter), and the number of degrees of freedom in the beam formers increases with the number of frequency sub-bands. Also, the delay caused by the frequency transformations is related to the number of frequency sub-bands.
Time-domain and frequency-domain representations are closely related, and the algorithm of the invention can easily be extended to a combination of time-domain and frequency-domain representations. Each frequency sub-band signal can also be regarded as a time-domain signal samples at a reduced sampling rate, ie proportional to the frequency sub-band bandwidth, and having substantially only the frequencies in the sub-band. By applying the time-domain algorithm in each frequency sub-band, the degrees of freedom for the band-pass filters are increased, while the number of sub-bands may be kept constant. The lengths of the sub-band filters may differ between sub-bands, and the consequence is that a multi-resolution sub-band identification is obtained.
The extension of the algorithm is achieved simply by using more lags from the observed microphone signals x (n), 1 ≤i ≤I,
when defining the input vector
where each element consists of L (k)
and the weight factors for each sub-band are similarly extended as
where each element consists of L(k)
The definitions of the correlation matrix, the eigenvalue and eigenvector matrices follow directly, and the size of the matrices will be increased by the factor L (k) · L (k) for frequency sub-band number k. The amount of memory needed to store the eigenvectors and the eigenvalues increases with increasing number of sub-band used.
The band-pass filtering or decomposition of the broadband microphone signals into frequency sub-bands is preferably done using a uniform Discrete Fourier Transform (DFT), eg as described in , to decompose the full-rate sampled signals x i (n) into K sub-band signals. The sub-bands are preferably created in such a way that a prototype filter with a low-pass characteristic is used to ensure that the response from the k-th sub-band is the same as that of the prototype filter, although centred at a normalized frequency 2πk/K, whereby the set of K sub-bands will cover the whole frequency range. In a modulated filter bank the prototype filter equals one of the filters in the bank (usually the first filter), and the other filters are modulated versions of the prototype filter. Thus only the prototype filter needs to be created. Each sub-band signal is thereby represented in the base band. The filter bank should cover the entire frequency range, and a redundant, ie over-complete, frequency sub-band decomposition and reconstruction should be used. The invention is not limited to using uniformly distributed frequency sub-bands or a modulated filter bank.
In order to reduce aliasing between sub-bands, the sub-band decomposition is made over-sampled.
In the reconstruction filter bank a time domain signal is reconstructed or synthesized from the beam formed frequency sub-band signals from the beam formers. The beam formed frequency sub-band signals are up-converted from the base band to the actual frequency band, and summed in the reconstruction filter.
List of terms used
In practical use, eg for hands-free operation of a mobile telephone in a car, the microphone array can eg have six microphones in a linear configuration with a spacing of 50 mm between microphones, and the microphone array can be placed at a nominal distance of eg 350 mm from the speaker's mouth. The number K of band pass filters in each filter bank can be in the range from 32 or less to 256 or more, typically 64.
= total number of frequency sub-bands
= index number of frequency sub-band, 0 ≤ k
= number of microphone channels, ie number of microphones in the array
= microphone index in the array, 1 ≤ i
= number of samples
= time signal sample number, sample instant
( (n) =
observed signal from microphone i
in frequency sub-band k
at sample n
= output signal from beam formers
z e j2πf
= estimated source correlation matrix for frequency sub-band k
estimated source signal spatial cross correlation vector for frequency sub-band k
= observed/measured signal correlation matrix for frequency subband k
= estimated noise (disturbance) correlation matrix for frequency subband k
= beam former weighting factors for frequency sub-band k
λ = forgetting factor, 0 < λ< 1, typical value: λ= 0.99
α = weight smoothing factor, typical value: α = 0.01
-  J. E. Hudson, "Adaptive Array Principles", Peter Peregrinus Ltd., 1991, ISBN 0-86341-247-5.
-  S. Haykin, "Adaptive Filter Theory", Prentice Hall International, Inc., 1996, ISBN 0-13-397985-7.
-  M. M. Goulding, J. S. Bird, "Speech Enhancement for Mobile Telephony", IEEE Transactions on Vehicular Technology, vol. 39, no. 4, pp. 316-326, November 1990.
-  S. Nordholm, V. Rehbock, K. L. Toe, S. Nordebo, "Chebyshev Optimization for the Design of Broadband Beam formers in the Nearfield", IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, vol. 45, pp. 141-143, 1998.
-  S. Nordholm, Y. H. Leung, "Performance Limits of the Generalized Sidelobe Cancelling Structure in an Isotropic Noise Field", Journal of the Acoustical Society of America, vol. 107, no. 2, pp. 1057-1060, February 2000.
-  Mitra, Sanjit Kumar, "Digital Signal Processing", McGraw-Hill Companies Inc., 1998, ISBN 0-07-042953-7.