EP1104925A1

EP1104925A1 - Method for processing speech signals by substracting a noise function

Info

Publication number: EP1104925A1
Application number: EP99124195A
Authority: EP
Inventors: Jens Erik Pedersen
Original assignee: Siemens AG
Current assignee: Siemens AG
Priority date: 1999-12-03
Filing date: 1999-12-03
Publication date: 2001-06-06

Abstract

The invention relates to a method for processing speech signals by subtracting a noise function. The method comprises the transformation of noise from a time domain into an other domain, preferably a frequency domain, using FFT and the weighting of the noise spectrum to generate a noise signal using the signal-to-noise ratio of the speech signal or an upper and lower limit for a noise amplitude in the frequency domain or an analysis of a noise type or an envelope of the speech signal in the frequency domain. The weighted noise signal is then subtracted from the speech signal spectrum to generate a speech signal with reduced noise. After transforming the speech signal from the frequency domain to the time domain, speech coding is performed on the speech signal with reduced noise. The invention improves the speech quality by avoiding the reduction of too much noise from the speech signal.

Description

Prior Art

The invention relates to methods for processing noise speech signals by subtracting a noise function in accordance with the generic class of the independent claims.
In the chapter 8 titled "Speech Enhancement" in J.R. Deller, J.G. Proakis and J.H.L. Hansen: "Discrete Time Processing of Speech Signals", Macmillan Publishing Company, 1993, a method for weighting a noise signal in the frequency domain for a speech signal is presented. The weighting is performed using predefined values independent of time or of frequencies or of properties of the speech signal or of properties of the noise signal. This weighting leads to a reduction of musical tones.

Advantages of the Invention

The methods for processing speech signals by subtracting a noise function having the characterising features of the independent claims have the advantage that the weighting of the noise signal depends either on amplitudes of an acquired signal spectrum or on a noise type. By depending on the acquired signal spectrum simultaneously a dependence on frequency and time is achieved. Therefore, an improved speech quality is realised by minimising thereby artefacts such as musical tones. Thus, if the invention is used in mobile phones, better quality phone calls will be the result but also other applications such as speech recognition and audio recording benefit from the invention by a reduction of the artefacts and therefore of an improved speech recognition or an improved audio recording.
The invention is improved by detecting a noise type of the detected noise spectrum, for example noise in a car, noise in an office or noise in the street. This knowledge is used to adapt the noise signal to its noise type. In this way, an improved weighting of the noise signal is achieved, and it will lead to an improved noise reduction of the speech signal.
The features the dependent claims enable further improvements of the invention.
It is an advantage to use for the weighting of amplitudes of the noise signal an upper and a lower limit. The upper limit limits the amplitudes of the spectral noise signal exceeding the upper limit, so that this method avoids musical tones. This leaves a small part of the original noise in the final signal, so that the listener can still hear what environment the speaker is in. The lower limit sets the lower limit as an amplitude for those amplitudes of the spectral noise signal being under the lower limit. This is an easy and effective method for improving the noise signal.
Furthermore, it is an advantage of the invention to use a signal-to-noise ratio of the speech signal to determine a weighting of the noise signal. For a very low signal-to-noise ratio, it is useless to perform a noise reduction because the signal is so weak as compared to the noise, so that a noise reduction would introduce unwanted audible effects, musical tones, because the noise signal is mixed up with the speech signal. If the signal-to-noise ratio is very high, then noise reduction will not be necessary, since the signal quality is already very high, so that a further reduction of the noise would not improve the quality of reproduced speech signals. It is an advantage to use a predefined weighting function for the noise signal for those speech signals having a signal-to-noise ratio between the lower and the upper limit. In this way, a signal quality dependent noise reduction is performed.
It is an advantage to use an envelope of the signal spectrum containing speech and noise as a lower limit for weighting factors for the spectral noise signal, while the envelope itself is weighted by predefined factors which stem from listening tests with test persons. The envelope of the speech signal is used to avoid a too large noise reduction where the speech signal has its energy. A frequency in the speech signal with low energy will have lower weighting than a frequency with high energy. Thus, the method leads to a signal strength dependent reduction of noise over frequency and time.
It is an advantage that a method of acquiring a spectral noise signal is used either during speech or when no speech is present. If the spectral noise signal is acquired during speech, it will be an excellent spectral noise signal because it is in time very near to the speech signal for which it is used for noise reduction. If the spectral noise signal is estimated during a time interval without speech, the processing of the spectral noise signal will be straight forward and very easy, since the noise is not masked by the speech signal.

Drawing

Exemplary embodiments of the invention are shown in the figure and elucidated in detail in the description below.
Figure 1 shows a block diagram of a speech signal processing, figure 2 shows a flow chart of a noise reduction in speech signals, figure 3 shows a flow chart of weighting a noise signal depending on the signal-to-noise ratio of a speech signal, figure 4 shows a flow chart of a method for reducing noise in a speech signal using upper and lower limits for the spectral power of the noise, figure 5 shows an amplitude/frequency diagram with a noise signal and two limits, figure 6 shows a flow chart of a noise reduction method using stored noise types and figure 7 shows a flow chart of a noise reduction method in speech signals using an envelope of the signal spectrum.

Description

When speaking to a phone the speech is degraded by background noise. The speech and the background noise are converted by a microphone of the phone set into electrical signals. To improve the speech quality for listener at the receiver, this background noise has to be removed or at least to be considerably reduced. If a too large amplitude from an amplitude of a speech signal is removed, the speech signal will exhibit unwanted audible effects - so called musical tones - when reconverted to acoustic signals by a loudspeaker.
The musical tones are more disturbing for listeners than noise, since the auditory recognition system of an human being tries to find an interpretation of those musical tones whereas noise is easily regarded by a listener as noise and, consequently, the noise does not interfere with the recognition of the speech as long as the signal strength of the noise is not too high.
In mobile communications, frequency bandwidth for transmission and reception of radio signals is precious and service providers who operate a mobile communication network want to put more and more users in their allocated bandwidths. Therefore, an effective speech coding which reduces the necessary bandwidths for transmission considerably but retains an excellent speech quality is of high importance to design a successful transmitter for radio signals.
Speech coding removes therefore redundancy from the speech signal. The information per bit is considerably increased. Noise can influence the speech coding procedure and therefore what the speech information containing bit will look like. This can lead to a poor audio quality. The reduction of noise before speech coding sets in is a precondition for excellent speech quality at the receiver.
Noise is a stochastic signal and therefore cannot be predicted. There are peaks and drops in the noise signal which one has to cope with if noise reduction and avoiding musical tones are the goals. By transforming noise and a speech signal from a time domain to a frequency domain, it is possible to identify whether a present signal is speech with noise or only noise. Alternatively, this can be done in the time domain.
In Fig. 1, a block diagram of a speech signal processing for reducing noise is shown. A microphone 50 with attached electronics is connected to a transforming unit 51 for transforming signals from the time domain to a frequency domain generating a signal spectrum. The microphone transduces acoustical waves into electrical signal, the attached electronics amplifying and digitising the electrical signals. The digitised signals are therefore fed into the transforming unit 51.
The signal spectrum is then fed from the transforming unit 51 into a decision unit 52 deciding whether the signal spectrum is noise or noise and speech signals. If it is a noise signal, the noise signal is then transferred over a first output of the decision unit 52 to a noise processor 53 weighting the noise signal generating a noise function. If the signal spectrum consists of noise and speech signals then the signal spectrum is transferred from the decision unit 52 to an adder 54 subtracting the noise function coming from the noise processor 53 from the signal spectrum coming from the decision unit 52 in order to generate speech signals with reduced noise. The signal spectrum is always transferred to the adder 54, whereas the noise function is only updated when only a noise signal is present in the signal spectrum.
These speech signals are then transformed from the frequency domain to the time domain by a retransforming unit 55. At the output of the retransforming unit 55 the speech signals with reduced noise in the time domain are ready for further processing.
The transforming unit 51, the decision unit 52, the noise processor 53, the adder 54 and the retransforming unit 55 are implemented on one processor as different software programs or functions. Alternatively, more than one processor can be implemented to perform the above mentioned tasks.
The noise processor 53 performs either one of the following algorithms for processing the noise signal considering amplitudes of the signal spectrum after the invention:
a) amplitudes of the noise signal above a first predefined limit are set equal to that limit, amplitudes of the noise signal below a second predefined limit are set equal to that limit.
b) Amplitudes of the noise signal are multiplied with an envelope of the signal spectrum if the signal spectrum contains speech signals.
c) Amplitudes of the noise signal are weighted according to a signal-to-noise ration of the signal spectrum if the signal spectrum contains speech signals.
In addition, a noise processing is performed using stored noise spectra by comparing those stored noise spectra with the noise signal.
In Fig. 2, a flow chart of a method for noise reduction and reduction of the musical tones of speech signals after the invention is shown. This noise reduction and all other noise reduction methods are implemented on a processor.
Preferably, on a processor already present in a mobile phone. Thus, noise reduction methods are implemented in software running on processors.
In step 1 of the noise reduction method, acoustical waves emitted by a speaker are converted by a mobile phone into electrical signals using a microphone as a transducer. The electrical signal are then amplified, filtered and digitised using a signal processing unit connected to the microphone.
In step 2 of the noise reduction method, the electrical signals are transformed from a time domain to a frequency domain in order to generate the signal spectrum.
In a mobile phone, a processor is placed performing the transformation of the digitised electrical signals from the time domain into the frequency domain using a Fast Fourier Transform (FFT). The FFT is a well-known algorithm for processors to perform the transformation of the signals of the time domain into the frequency domain. This transformation consists of a sampled signal in the time domain and of using the samples of the sampled signal for a well-known equation to perform the transformation.
Alternatively, other transform techniques can be used. One widely known technique is the use of wavelets. Wavelets are mathematical functions cutting up data into different frequency components and then studying each component with a resolution matched to its scale. Wavelets perform especially well on discontinuities and spikes in the signals to be analysed.
In step 3, it is checked by the processor whether the signal spectrum represents speech signals with noise or a noise signal. This is done using a Voice Activity Detector (VAD) algorithm. The VAD algorithm is one of the GSM (Global System for Mobile Communication) speech coders and it detects whether there is speech activity or not by comparing a spectral power density of the signal spectrum with predefined values. Does the spectral power density exceed those predefined values, the VAD decides that speech is present. Alternatively, a similar algorithm can be implemented on a separate processor which is useful for distinguishing between speech and background noise because a processor is fully dedicated to voice activity detection.
Apart from detecting noise in speech interruptions, it is possible to detect noise during speech. This is explained below.
If there is no speech activity, then a background noise signal is updated in predefined time intervals, for example 480 ms. This saves bandwidth for other mobile phones to communicate. GSM is a widely used standard for digital cellular mobile communications.
If the signal spectrum is a noise signal, then in step 10 the noise signal will be processed using predefined factors stored in the mobile phone. These factors have been determined using the knowledge on a human auditory reception system. Alternatively, this step can be omitted.
If the signal spectrum is a speech signal, then, in step 5, a signal-to-noise ratio of the speech signal depending on the frequency is calculated. For this, the noise signal is subtracted from the speech signal and then a resulting difference is divided by the noise signal. This is done for certain frequencies in the acquired spectrum of the speech signal. The number of those frequencies determines the accuracy and the complexity of the noise reduction method.
In step 4 after having performed step 10, the noise signal is weighted to generate a noise function by a function depending on the signal-to-noise ratio of the speech signal using thereby the result of step 5. Thus, at a frequency for which the signal-to-noise ratio is determined the noise signal is weighted. Therefore, according to this solution, the weighting of the noise signal is only possible if speech signals are present.
In Fig. 3, this method of weighting the noise signal is shown. In step 13, the method is started. In step 14, the signal-to-noise ratio of the speech signals is compared with a first predefined limit. If the signal-to-noise ratio is higher than the first predefined limit, then, in step 15, the spectral noise signal is set to zero, since the speech signals are due to a very high signal-to-noise ratio of an excellent quality, so that an improvement by reducing noise will not lead to an audible improvement for the listener.
If the signal-to-noise ratio of the speech signals is below the first predefined limit, then, in step 16, the signal-to-noise ratio of the speech signals is compared to a second predefined limit. If the signal-to-noise ratio of the speech signals is above the second predefined limit, then in step 17 the weighting of the noise signal is performed after a predefined function depending on the signal-to-noise ratio.
If the signal-to-noise ratio of the speech signals is below the second predefined limit, then, in step 18, the signal-to-noise ratio of the speech signals is compared to a third predefined limit. If the speech signals are above the third predefined limit, then, in step 19, the noise signal is weighted by a constant weighting factor. The predefined function of step 17 connects the zero-value weighting appearing in step 15 with this constant weighting factor in step 19.
If the signal-to-noise ratio of the speech signal is below or equal the third predefined limit, then, in step 20, the signal-to-noise ratio of the speech signals is compared to a fourth predefined limit. If the signal-to-noise ratio of the speech signals is above this fourth predefined limit, then, in step 21, the noise signal is weighted with a predefined function.
If the signal-to-noise ratio of the speech signals is equal or below the fourth predefined limit, then, in step 22, the weighting of the noise signal for the frequencies where the signal-to-noise ratio is equal or below the fourth predefined limit is set to zero. This is done because if the signal-to-noise ratio is that low it is already so noisy that a reduction of the noise would not lead to an improvement but it would introduce unwanted audible effects. After step 15, 17, 19, 21 and 22 this method ends in step 23. The predefined function applied in step 21 for weighting the noise signal connects linearly the constant weighting factor and zero. Alternatively, parabolic or exponential functions can be implemented for these predefined functions. The predefined limits are set according to listening tests.
In step 6, the weighted noise signal, that is the noise function, is stored in a processor. In step 7, the noise function is subtracted from the signal spectrum of the original speech signals. These are the speech signals before the unweighted noise signal was subtracted for calculating the signal-to-noise ratio.
In step 8, the spectrum of the speech signal with reduced noise is transformed from the frequency domain into the time domain using inverse FFT. In step 9, speech coding on the speech signal with reduced noise in the time domain is performed.
In Fig. 4, another method of weighting the noise signal is presented. In step 24, a speech signal or noise are converted to an electrical signal using a microphone.
Attached electronics to the microphone amplifies and digitises the electrical signal.
In step 25, the electrical signal is transformed form the time domain to the frequency domain using FFT generating the signal spectrum. In step 26, it is detected using VAD whether speech signals or a noise signal are present. If a noise signal is present, in step 28, the noise signal is weighted.
In Fig. 5, the weighting of the spectral noise signal with an amplitude s as a function of the frequency f is shown. To avoid very high amplitudes of the noise signal, a limit 11 for the high amplitudes is set. Is an amplitude equal or above the limit 11, then it is set to this limit 11. This avoids that too much noise is removed from the speech signal and thereby the appearance of musical tones. An exemplary signal is added to Fig. 4 to show amplitudes of that signal exceeding the limits in both directions.
In addition, an optionally lower limit 12 is also added in this method, so that very low noise amplitudes in the spectrum of the noise which are equal or under the limit 12 are set to the limit 12. As stated previously, background noise does not necessarily disturb a listener and it gives the listener an impression that a connection between him and a speaker is still in existence if only noise is transmitted. Here, the noise processed with the upper and the lower limit is transmitted. For this method, the upper limit 11 must be included whereas the lower limit can be included.
In step 29, the weighted noise signal is stored, so that when a speech signal is present, the weighted noise signal is subtracted from the speech signal to generate a speech signal with reduced noise in step 30. In step 31, the speech signal with reduced noise is transformed from the frequency domain to the time domain using inverse FFT. In step 32, speech coding on the speech signal with reduced noise in the time domain is performed.
In Fig. 6, another method for reducing noise in a speech signal is presented in a flow chart. In step 33, a speech signal or noise are converted to an electrical signal using a microphone. Attached electronics to the microphone amplifies and digitises the electrical signal.
In step 34, the electrical signal is transformed form the time domain to the frequency domain using FFT. In step 35, it is detected using VAD whether a speech signal or a noise signal is present. If a spectral noise signal is present, that means no speech signal, in step 36, the noise signal is weighted.
The weighting of the noise signal is to analyse the noise spectrum by using stored noise spectra. If the noise type is detected, for example noise in a car, noise in an environment where many people are speaking or noise in a street, then the measured noise can be weighted according to its noise type optimising both the noise reduction and the reduction of artefacts (musical tones).
In step 37, the weighted noise signal is stored, so that when a speech signal is present, the weighted noise signal is subtracted from the speech signal to generate a speech signal with reduced noise in step 38. In step 39, the speech signal with reduced noise is transformed from the frequency domain to the time domain using inverse FFT. In step 40, speech coding on the speech signal with reduced noise in the time domain is performed.
In Fig. 7, another method for reducing noise in a speech signal is presented in a flow chart. In step 41, a speech signal or noise are converted to an electrical signal using a microphone. Attached electronics to the microphone amplifies and digitises the electrical signal.
In step 42, the electrical signal is transformed form the time domain to the frequency domain using FFT. In step 43, it is detected using VAD whether a speech signal or a noise signal is present. If a spectral noise signal is present, that means no speech signal, in step 43, the noise signal is weighted with an envelope of the speech signal. Thus, the weighting occurs only when a speech signal is present. The envelope of the speech signal is calculated using FFT, also in step 42. It is practically the same as the speech signal, so the weighting is performed using the speech signal.
The envelope is used to avoid a noise reduction of the speech signal which would lead to unwanted audible effects that means too much noise reduction is avoided. Especially at those frequencies where most of the speech signal energy is located. In addition, the envelope of the speech signal is weighted by factors that are stored already in the processor. These factors have been found by using listening tests.
In step 45, the weighted noise signal is stored, so that the weighted noise signal is subtracted from the present speech signal to generate a speech signal with reduced noise in step 46. In step 47, the speech signal with reduced noise is transformed from the frequency domain to the time domain using inverse FFT. In step 48, speech coding on the speech signal with reduced noise in the time domain is performed.
Apart from speech coding, the invention is usable for other applications. Speech recognition demands a signal with good signal-to-noise ratio for a proper recognition of the speech. Thus, a speech recognition system would considerably benefit from using the invention.
Another application is audio recording either for only audio reproduction or in combination with video recording.
Especially for live recordings suffering heavily from background noise, an improved noise reduction algorithm focussing on improving speech and/or music quality benefits from the invention.
The noise spectrum is either acquired when no speech signal is present or during a speech signal. The first solution is straight forward, since the background noise is transformed from the time domain to the frequency domain to generate the noise spectrum.
To acquire the noise spectrum during the speech signal, the speech signal with the noise is transformed from the time domain to the frequency domain and then by analysing the speech signal, the speech signal itself is modelled using the processor and then this model of the speech signal is subtracted from the transformed speech signal to generate a noise spectrum as the difference between the measured speech signal and the model. To generate this model of the speech signal, the processor uses stored knowledge on the spoken language in order to estimate what was said.

Claims

Method for processing speech signals by subtracting a noise function, whereby the noise function is determined by measuring a signal containing noise or noise and speech, transforming the measured signal from-a time domain to a frequency domain to generate a signal spectrum, deriving a noise signal from the signal spectrum, characterised in that the measured noise signal is weighted by multiplying a function considering amplitudes of the signal spectrum in order to generate the noise function.
Method according to claim 1 wherein the signal spectrum is acquired in speech interruptions.
Method according to claim 1 wherein the signal spectrum is acquired during speech by modelling the speech signals and thereby identifying the noise signal.
Method according to claim 2 or 3 wherein amplitudes of the measured noise signal being above a predefined upper limit are set equal to the upper limit and that amplitudes of the measured noise signal being below a lower limit are set equal to the lower limit.
Method according to claim 2 or 3 wherein the measured noise signal is weighted by multiplying the amplitudes of the measured noise signal with an envelope of the signal spectrum.
Method according to claim 2 or 3 wherein the measured noise signal is weighted by multiplying a function considering amplitudes of a signal-to-noise ratio of the signal spectrum.
Method according to claim 4 wherein the measured noise signal is multiplied with zero if the signal-to-noise ratio of the signal spectrum is above a first predefined limit, that the measured noise signal is multiplied with a first weighting function if the signal-to-noise ratio of the signal spectrum is between the first predefined limit and a second predefined limit, that the measured noise function is multiplied with a constant factor if the signal-to-noise ratio of the signal spectrum is between the second predefined limit and a third predefined limit, that the measured noise signal is multiplied with a second weighting function if the signal-to-noise ratio of the signal spectrum is between the third predefined limit and a fourth predefined limit and that the measured noise function is multiplied with zero if the signal-to-noise ratio of the signal spectrum is below the fourth predefined limit.
Method for processing speech signals by subtracting a noise function, whereby the noise function is determined by measuring a signal containing noise or noise and speech, transforming the measured signal from a time domain to a frequency domain generating a signal spectrum, deriving a noise signal from the signal spectrum, chracterised in that the noise signal is compared with stored noise spectra in order to determine a fitting noise spectrum being used as the noise function.
Method according to claim 1 wherein the signal spectrum containing noise is acquired in speech interruptions.
Method according to claim 1 wherein the signal spectrum containing speech and noise is acquired during speech by modelling the speech and thereby identifying the noise.