EP1208561B1

EP1208561B1 - A method and apparatus for noise reduction in speech signals

Info

Publication number: EP1208561B1
Application number: EP00925105A
Authority: EP
Inventors: Kjeld Hermansen
Original assignee: Noisecom APS
Current assignee: Noisecom APS
Priority date: 1999-05-19
Filing date: 2000-05-16
Publication date: 2005-01-26
Anticipated expiration: 2020-05-16
Also published as: DE60017758D1; WO2000072305A3; EP1208561A2; AU4394300A; WO2000072305A2; ATE288121T1

Abstract

A method and apparatus for reducing noise in speech signals employing a priori knowledge regarding human speech production and knowledge of the speech dynamics, such as knowledge regarding the production of sounds in the mouth and throat region. Further employing a new model based spectral subtraction algorithm. Employs replacing part of the original signal containing voiced sounds by synthetically produced pulses. Improves the noise reduction without affecting the wanted (speech) components of the signal. The method may be used for low SNR (SNR<10dB). Advantageous when used in connection with communication using cellular phones.

Description

The present invention relates to noise reduction in speech signals, in particular noise reduction in speech signals employed in telecommunication, most particularly in telecommunication employing cellular phones.
Noise when added to a speech signal can impair the quality of the signal, reduce intelligibility, and increase listener fatigue. It is therefore of great importance to reduce noise in a speech signal, e.g. in relation to telecommunication, especially when employing cellular phones, or in relation to hearing aids.
Various methods of noise reduction in speech signals are known. These methods include spectral subtraction and other filtering methods. The noise reduction may e.g. be based on an estimate of the noise spectrum. Such methods depend on stationarity in the noise signal to perform optimally. As the noise in a speech signal is often non-stationary, the estimated noise spectrum used for spectral subtraction will be different from the actual noise spectrum during speech activity. This results in short duration random tones in the noise reduced signal, and such random tone noise tends to be very irritating to listen to due to psycho-acoustic effects.
WO 99/01942 disdoses a method of reducing the noise in a speech signal using spectral subtraction. According to this method a model based representation describing the quasi-stationary part of the speech signal is generated and manipulated, and the resulting speech signal is generated using the manipulated model and a second signal derived from the speech signal.
In 'Speech Synthesis Based on Feature Extraction to enhance noise-corrupted speech', L. I. Witzke et al., IECON' 1994, pages 1946-1951, there is disclosed a method for improving the intelligibility of noise-corrupted speech. This is done by extracting speech features (fundamental frequency, voiced/unvoiced decision, and formant frequencies) from the noise-corrupted speech signal. The speech signal can then be synthesized from the extracted features, whereby the intelligibility of the noise-corrupted speech signal is improved. The formant bandwidth is estimated from the extracted features, resulting in inaccuracy when regenerating the speech signal.
However, when employing known methods for noise reduction in speech signals, the actual speech signal as well as the noise is reduced. It is desirable to be able to eliminate noise in a signal without eliminating the actual signal.
The object of the present invention is to provide a method of noise reduction in speech signals which reduces the noise even more than known methods. It is further an object to provide a method of noise reduction in speech signals which reduces the noise without affecting the actual speech signal, i.e. a method which eliminates, or at least considerably reduces, unwanted components of a signal without, or at least only to a very limited extend, reducing wanted components.
The method according to the present invention employs a new model based spectral subtraction algorithm for noise suppression/noise reduction in speech. This new algorithm benefits from available knowledge of the speech dynamics. The method yields better results - especially for low Signal to Noise Ratios (SNR) - with less distortions and artefacts, like musical tones, than any other methods, e.g. the usual spectral subtraction.
The role of noise suppression in speech processing such as speech coding has gained an increased importance due to the advent of digital cellular telephones. With the low data rate speech coding algorithms the speech quality tends to degrade drastically in high noise. To prevent such quality loss, noise suppression must be achieved without introducing artefacts, speech distortion or significant loss of speech intelligibility.
Using an additive noise model the noisy signal is modelled as a sum of the speech signal and the noise assuming statistical independence. The spectral subtraction provides an estimate of the signal spectrum as the difference between the noisy spectrum and an estimate of the noise/background spectrum, the latter is obtained during periods of silence.
The main problem in spectral subtraction is artefacts caused by errors in the estimation of the difference spectrum. Using smoothed versions of the difference spectrum in order to avoid musical tone leads to audible signal distortions.
Transformation of the estimated speech signal to time domain requires knowledge of the phase of the signal. In most situations one uses the phase of the noisy signal. This works well for high Signal to Noise Ratios (SNR >10 dB). The problem is to use this phase for low SNR, which is a serious drawback of the classical spectral subtraction. A possibility to handle this problem is to use an alternative description of the signal.
One way of describing a signal in the frequency domain is by magnitude and phase. For speech this description does not relate directly to the articulation parameters. By using a model-based description of the signal it is possible to benefit from a priori knowledge of the speech and noise. These model parameters relate closely to the speech production. The speech signal is decomposed into two components: a generator signal (the residual signal) and a filter modelling of the vocal tract. This results in a separation of speech into a transient and a quasi-stationary part.
Determination of the noise free/reduced synthesis filter is done by a combination of classical spectral subtraction and model based characterisation of the difference spectrum of noisy speech and background noise.
The auto correlation function of the quasi stationary part of the speech is mapped into an LPC model spectrum of order 10 and the so-called f, b and g parameters (f formant frequency, b bandwidth and g gain) are determined from this spectrum. This is a pseudo decomposition of the spectrum into second order sections.
One of the main problems in spectral subtraction is the "musical tones". Smoothing in time of concatenating spectra is often used with the result of reduced artefacts but also influencing the speech. The present method opens up for an efficient noise elimination in the quasi stationary component as well as in the transient part. Smoothing through lowpass filtering of the f, b and g parameters reduces noise in the quasi stationary part of the signal. The f, b and g parameters are organised in "fbg packets" called "fbg processes", and to each process is allocated a Kalman filter whose parameters correspond to the signal/speech model and noise. It is possible to benefit from the fact that the upper frequency of the f, b and g parameters is about 10 Hz.
Using the present method it is possible to further reduce noise and artefacts without influencing the speech signal. A noise robust pitch detector combined with a synthetic glottal pulse generator produces the new residual signal for voiced sounds). For unvoiced speech the residual signal is used as is. This residual signal is input to the noise free/reduced synthesis filter with noise free/reduced speech as output.
The dynamics of the synthesis filter are now constrained via the f, b and g parameters to the range 1 Hz to 10 Hz eliminating the main part of the usual musical tones and leaving the signal/speech component almost unchanged. The input to the synthesis filter depends on the SNR. For heavy noise a robust pitch detector determines the period of the synthetic glottal pulses used as input to the synthesis filter.
Thus, the present invention relates to a method of noise reduction in a speech signal as set forth in claim 1.
The frequency is preferably the formant frequency of the speech signal. The speech signal is preferably transmitted via a telecommunications means, most preferably via a cellular phone, but it may alternatively or additionally be transmitted via other means such as a hearing aid or other suitable microphone/speaker arrangements. Such microphone/speaker arrangements may be connected to telephones and/or video conference arrangements, thus allowing the person or persons using such arrangements to move freely within a certain distance from the telephone/video conference arrangement in the room in which the telephone/video conference arrangement is positioned. When using existing similar arrangements this is not possible due to the noise generated in the signals. Other suitable microphone/speaker arrangements may alternatively or additionally be employed. The parameters are preferably smoothed electronically.
The dynamic information regarding the f, b and g parameters may alternatively or additionally comprise information regarding the difference in frequency between the present speech signal and a previously measured speech signal. If such information is compared to knowledge concerning the human voice regarding the capability to change frequency within a certain time interval, it may be determined whether the present speech signal is in fact the speech signal that was previously measured, i.e. whether the present speech signal and the previously measured speech signal are in fact one and the same. If the difference in frequency exceeds a certain limit, the limit being determined on the basis of knowledge of the human voice and its capability of changing the frequency within a certain time interval, the two signals can not be the same. If the difference in frequency does not exceed such a limit, the two signal may be the same.
The dynamic information regarding the f, b and g parameters most preferably contain knowledge regarding the development of said parameters in time.
The a priori knowledge regarding the dynamics of the human voice may alternatively or additionally comprise knowledge regarding the maximum frequency span of a speech signal as described above.
The a priori knowledge regarding human speech production may be compared to measured parameters of the present speech signal as described above. The a priori knowledge may be obtained e.g. by knowledge regarding the anatomy of the mouth and throat region and/or of the vocal chord. The a priori knowledge may alternatively or additionally be based on a number of previous measurements of relevant parameters as described above. Such previous measurements, or alternatively or additionally a representative extract of such measurements, may be stored in look up tables. Such look up tables are preferably stored electronically in a computer or the like, but may alternatively or additionally be stored in a printed medium such as a book or a sheet of paper.
Preferably, the method comprises a step in which the speech signal is deemed to belong to a process, the process being a signal which may extend over one or more measurement frames. The process is preferably a formant process. It may, e.g., correspond to the pronunciation of a word.
The process is an active process at a certain time if it extends over one or more preceding measurement frames. Thus, the process is active if there is a detectable signal. A process may also be regarded as active if there is presently no detectable signal, but such a signal has been present for a predefined number of measurement frames preceding the present measurement frame. Thereby a process may be kept artificially alive even though the signal disappears for a short time interval. This is very useful in telecommunication employing cellular phones, since the signal may temporarily disappear during such communication, e.g. due to noise or fall out due to an uneven geographical distribution of sending masts resulting in an uneven network coverage. It is not desirable to deem a process "inactive" in such a case.
The smoothing step may comprise the step of determining whether a new formant frequency belongs to an active process. This may be based on a comparison between the a priori knowledge regarding human speech production and the obtained dynamic information.
The method may in this case further comprise the step of defining a new process in case the new formant frequency does not belong to an active process, and the new formant frequency is then deemed to belong to said new process.
The process may be deemed to be inactive in case no new formant frequency is deemed to belong to said process. Thus, in case the signal is permanently terminated, the process is deemed to be inactive.
The method may further comprise the step of artificially maintaining the speech signal for a predetermined number of measurement frames in case the corresponding process is abruptly deemed to be inactive. This makes it possible to keep a process alive in case the signal is temporarily interrupted as described above.
The predetermined number of measurement frames may correspond to the maximum duration of the speech signal. Thus, a process may be artificially maintained for a time interval corresponding to the time interval it normally takes to produce such a sound.
The maximum duration of the speech signal is preferably between 40 ms and 80 ms, such as between 50 ms and 70 ms, such as approximately 60 ms.
The new formant may be deemed to belong to an active process if the difference in frequency between said formant and said process does not exceed a predetermined level as described above.
The predetermined level is preferably between 200 Hz and 600 Hz, such as between 300 Hz and 500 Hz, such as approximately 400 Hz.
The smoothing step preferably comprises the step of filtering the f, b and g parameters.
The filtering step is most preferably performed using a first order Infinite Impulse Response (IIR) filter, but it may alternatively or additionally be performed using any other suitable kind of filter.
The first order IIR filter is preferably a feedback filter of the form: y[n]=b·x[n]+a·y[n-1], wherein x designates the speech signal, y designates the filter output, and wherein a and b are parameters to be determined, and wherein the parameters a and b are preferably determined by using model knowledge of the speech process.
The method may further comprise the steps of

determining whether voiced speech is present,
using a noise eliminated pitch period to create a synthetic glottal pulse in case voiced speech is present, and
replacing at least part of the original speech signal by said synthetic glottal pulse in case voiced speech is present.

The noise eliminated pitch period may be noise eliminated by using known methods or in the manner described above. Most preferably, it is noise eliminated using known methods as well as in the manner described above.
The determining step may comprise the steps of comparing the variance of the speech signal to an upper threshold value and to a lower threshold value. Voiced speech is considered to be present in case the variance of the speech signal exceeds the lower threshold value.
In case the variance of the speech signal also exceeds the upper threshold value, the speech signal is considered to contain purely voiced speech, i.e. no unvoiced component is present. In this case the original speech signal is completely replaced by the synthetic glottal pulse.
If the variance of the speech signal is between the two threshold values voiced as well as unvoiced components are considered to be present. In this case the original speech signal is replaced by a new pulse which is an appropriate combination of the synthetic glottal pulse and the original speech signal.
The determining step may comprise the steps of comparing the first formant gain of the speech signal to an upper threshold value and to a lower threshold value. Voiced speech is present in case the first formant gain of the speech signal exceeds the lower threshold value.
The remarks above apply equally in this case.
The noise eliminated pitch period is preferably found from a residual signal of the speech signal.
The replacing step is preferably performed by fading out a residual signal and fading in the synthetic glottal pulse. Here "the synthetic glottal pulse" is to be understood as either a completely synthetic signal or an appropriate combination of the created synthetic pulse and the original speech signal as described above.
Preferably, at least the smoothing step is performed by a computer system, and the speech signal is most preferably generated in a cellular phone.
The present invention further relates to an apparatus for performing noise reduction in a speech signal as set forth in claim 28.
The means for obtaining dynamic information regarding f, b and g parameters preferably comprises one or more suitable detectors, such as microphones, and/or one or more computers.
The smoothing means preferably comprises one or more computers and/or one or more suitable electronically filters, such as low pass filters and/or high pass filters and/or Infinite Impulse Response (IIR) filters and/or any other suitable kind of fillers.
The means for obtaining and storing a priori knowledge regarding human speech production preferably comprises one or more computers, most preferably comprising electronically storage means. It may further comprise one or more look up tables, the tables being created by using empirically obtained data (i.e. previous measurements of e.g. relevant parameters such as the dynamic information mentioned above) and/or by using theoretical calculation of relevant parameters. Such calculations may be based on knowledge regarding the anatomy of the mouth and throat region of humans as described above.
In an apparatus as described above, wherein the speech signal is deemed to belong to a process, the process being a signal which may extend over one or more measurement frames, and wherein the process is an active process at a certain time if it extends over one or more preceding measurement frames, the smoothing means may comprise means for determining whether a new formant frequency belongs to an active process. Such means may comprise means for comparing obtained relevant parameters of the speech signal (as described above) and theoretical and/or empirical data. It is then determined from a predefined set of criteria whether the new formant frequency belongs to the active process or not. This is further described above.
The smoothing means preferably comprises means for filtering the f, b and g parameters. It may comprise one or more electronically filters, such as low pass filters and/or high pass filters and/or IIR filters and/or any other suitable filters.
The apparatus may further comprise

determining means for determining whether voiced speech is present,
creating means for creating a synthetic glottal pulse by using a noise eliminated pitch period, and
replacing means for replacing at least part of the original speech signal by said synthetic glottal pulse in case voiced speech is present.

The creating means may comprise one or more tone generators and/or it may comprise a computer.
The replacing means may comprise one or more faders, so that the synthetic pulse may be faded in as the original signal is faded out.
The determining means may comprise comparing means for comparing the variance of the speech signal to an upper threshold value and to a lower threshold value. Voiced speech is present in case the variance of the speech signal exceeds the lower threshold value. The comparison is most preferably performed by one or more computers having computer storage means for storing information regarding the threshold values. The threshold values may be obtained by theoretical calculations and/or by previous measurements, i.e. it may be empirically obtained.
The determining means may comprise comparing means for comparing the first formant gain of the speech signal to an upper threshold value and to a lower threshold value. Voiced speech is present in case the first formant gain of the speech signal exceeds the lower threshold value. The above is equally applicable here.
The apparatus may further comprise means for producing a noise eliminated pitch period from a residual signal of the speech signal.
A new model based spectral subtraction method has been presented, the method being applicable for noise elimination for high as well as low SNR avoiding linear and non-linear artefacts like musical tones. The method is attractive because of its flexibility, modularity and numerical robustness. Furthermore it is very suitable for real time implementation.
The invention will now be further described with reference to the accompanying drawings in which
Fig. 1 shows the frequency of a speech signal as a function of time indicating two different processes,
Fig. 2 is a flow diagram illustrating part of the smoothing step of the method according to the present invention,
Fig. 3 is a flow diagram illustrating the control of the maximum duration of a speech signal,
Fig. 4 is a flow diagram illustrating the allocation of a sign to a process,
Fig. 5 is a flow diagram illustrating the adjustment of the pitch of a speech signal,
Fig. 6 is a flow diagram illustrating the step of determining whether voiced speech is present and the step of replacing a noise eliminated pitch period by a synthetic glottal pulse,
Fig. 7a shows the intonation of a speech signal as a function of time,
Fig. 7b shows a synthetic internal glottal pulse,
Fig. 7c shows two pitch periods having different lengths,
Fig. 8 shows non-filtered f, b and g parameters of a speech signal ('three', 'four', 'five'),
Fig. 9 shows filtered f, b and g parameters corresponding to Fig. 8,
Fig. 10 shows an original noisy speech signal ('three', 'four', 'five'),
Fig. 11 shows the signal of Fig. 10, but with synthetic glottal pulses replacing the pitch periods where voiced speech is present.
Fig. 12 shows the speech signal of Figs. 10 and 11, but with the noise eliminated,
Fig. 13 shows a frequency spectrum corresponding to Fig. 10, and
Fig. 14 shows a frequency spectrum corresponding to Fig. 12.
Fig. 1 shows the frequency of a speech signal as a function of time. Two processes ('process 1' and 'process 2') are present.
When the signal of a process disappears it may be due to the signal ''falling out", e.g. due to noise or a bad connection, or it may be due to the fact that the person stops speaking, i.e. the signal actually disappears. If the process has existed for a certain amount of time it is artificially kept alive for some time after the disappearance of the signal as indicated by "Δd" and the dotted lines of the figure.
At the time t₁ the signal of 'process 1' disappears and is artificially kept alive as described above. At time t₂ a new signal is detected, and this new signal is deemed to belong to 'process 1'. This is because the time elapsed (t₂-t₁) is smaller than Δd, and because the difference in frequency between the new signal and the signal of 'process 1' is below a predefined limit indicated by "Δf". The signal of'process 1', and the new signal are connected as indicated by the grey line.
The signal of 'process 2' also appears approximately at time t,. However, it is deemed not to belong to 'process 1', since the difference in frequency between the two signals exceeds the predefined limit (Δf). Since no other active processes are present a new process ('process 2') is defined, and the signal is deemed to belong to this process.
Eventually 'process 1' as well as 'process 2' expires since the signals disappear and no new signals appear within the predefined time limit, Δd.
Referring to Figs. 2-4 a filtering algorithm is described, the algorithm having the following four steps:
1. Determining whether new formant frequencies belong to active processes.
2. Elimination of non-updated processes.
3. Allocation of a sign to a certain process.
4. Filtering the processes and outputting the f, b and g parameters and the sign.
Fig. 2 is a flow diagram illustrating step one as defined above. New f, b and g parameters are used as input (FBG new), and the frequency nearest to a process is found. It is not possible to investigate the new formants one by one, since two or more of the new formants may fulfil the criteria for belonging to a certain process. Therefore, the formant having the frequency being nearest to the process is deemed to belong to said process.
In case the speech signal is found to belong to a process which already exists, that process is updated in accordance with the new signal. In case it is found that the speech signal does not belong to any of the existing processes, a new process is created, and the speech signal is deemed to belong to the new process. At this stage of the algorithm it is also tested whether the difference in frequency between the new formant and the existing process exceeds a certain predetermined level. The level may be set in the control system.
If additional new f, b and g parameters are present the formant which has thus been allocated to a process is "removed" from the "new parameters" and the steps described above are repeated until there are no more new f, b and g parameters left.
Fig. 3 illustrates step 2 as defined above. It is investigated whether an active process has been updated. In case it has been updated nothing further is done during the present measurement frame.
In case the process has not been updated during the present measurement frame it is investigated how long the process in question has been present. In case it has only been present during one measurement frame the process is eliminated. In case it has been present for more than one measurement frame the age of the process is "counted down". Thus, the process will be regarded as having been present for one measurement frame less the next time the algorithm is performed. This is to ensure that an otherwise active process is not eliminated just because it is not updated during one measurement frame. Such processes will be kept artificially alive for a number of measurement frames depending on the maximum duration of the processes. This may be performed by maintaining the intemal values of the process during the measurement frame wherein the process was last updated, or it may be performed in a more sophisticated manner based on predictions made according to earlier measurements.
Fig. 4 is a flow diagram illustrating step 3 of the above algorithm. It is first investigated whether the sign of a given process (Process number X) is locked. If this is the case the energy in the process is compared to a lower threshold value ("energi_lav"). If the energy in the process is below said lower threshold value the sign is unlocked. Otherwise the sign is maintained in a locked mode. The energy in the process is then compared to an upper threshold value ("energi_hoej"). If the energy in the process is above said upper threshold value the process is allocated a sign in relation to locked processes. Otherwise it is investigated whether the process has a sign allocated. If this is not the case, the process is allocated a sign in relation to locked processes. Otherwise the mark is not updated, i.e. the previously allocated sign is maintained.
The formants are allocated mutually alternating signs in order to improve the quality of the sound. Processes having frequencies which are between the frequencies of two different other processes (the two different processes having different signs) are allocated a sign according to the process being closest in frequency to the formant, i.e. the sign allocated to this process is opposite to the sign of the "closest" process.
The allocation of signs to the processes is very important, since it makes it possible to "hold on" to a given process and thereby to maintain the control of the processes. This is very important for the execution of step 4 of the algorithm above. However; the allocation of signs per se is known and well described in the prior art.
In step 4 of the algorithm above the f, b and g parameters are individually filtered. Preferably, a first order Infinite Impulse Response (IIR) filter is used, the filter most preferably having the form: y[n]=b·x[n]+a·y[n-1].
The coefficients a and b are calculated using a time constant in order to ensure that the filtering is independent of the frame shift. a is chosen in such a way that said time constant relates to the time constant of the speech signal, and b is chosen in such a way that the DC amplification is 1 (corresponding to the integral of the impulse response).
Fig. 5 is a flow diagram illustrating the adjustment of the pitch of a speech signal. This function determines the pitch period of a source sequence. The general idea of the function is based upon the assumption that the pitch is the most dominant periodical component in the speech signal. It is furthermore assumed that the pitch frequency, for physiological reasons, is limited to a certain frequency span. Thus, the main issue of determination of the pitch period is the calculation of the auto correlation and the determination of the pitch as the index of the maximum value of the auto correlation in a limited time interval.
The pitch sequence, which is used as an input to the function, is squared in order to avoid negative values and in order to enhance dynamical differences. The pitch sequence should be a frame of the speech signal or of the residual signal.
The squared pitch sequence is then rectified. This step emphasises the periodicity of the pitch by using knowledge regarding the structure of the pitch sequences. This is due to the fact that the pitch is much more powerful than other potential periodical components of the speech signal or the residual signal, and due to the fact that said other components are hidden by the rectification.
In case there are linear trends in the squared and rectified pitch sequence, such trends may now be removed. This may be done by creating a linear least squares approximation to the input sequence and then to subtract this approximation from the sequence. The linear least squares approximation may be done using the following equations: y = a · x + b
b = y - a · x wherein x and y designate the mean values of x and y, respectively.
Then the auto correlation is calculated for "allowed" pitch periods, i.e. for pitch periods having a duration which is between a lower threshold value and an upper threshold value, where said threshold values may be set initially. The calculated auto correlation is subsequently scaled using a linear weighting function. This is done in order to obtain a robust pitch detection.
In the next step the index of the maximum value of the weighted auto correlation function is used as an initial guess for the pitch period. In case the pitch period thus determined is large, and in case a shorter pitch period is more likely the pitch period is adjusted accordingly. A shorter pitch period may be more likely if e.g. the half pitch period is also an "allowed" pitch period. In this case it is possible that the sub-harmonic period of the pitch has been detected in stead of the actual pitch.
If the initial pitch period is not large or if a shorter pitch period is not more likely, the initial guess is maintained as the pitch period. Finally, the pitch period is used as an output.
Fig. 6 is a flow diagram illustrating the steps of the method according to the present invention in which it is determined whether voiced speech is present, and in which the noise eliminated pitch period is replaced by a synthetic glottal pulse.
The input parameter "x", which may be the variance or the gain of a signal, is compared to a lower threshold value ("taerskelnedre") as well as to an upper threshold value ("taerskeloevre"). If "x" is smaller than "taerskelnedre" the speech signal is considered to be completely unvoiced, and the original residual signal is maintained ("alfa1"=1 and "alfa2"=0). If "x" exceeds "taerskeloevre" the speech signal is considered to be completely voiced, and the original signal is replaced by a purely synthetic pitch pulse ("alfa1"=0 and "alfa2"=1). If "taerskelnedre" < "x" < "taerskeloevre" the speech signal is considered to be a mixture of voiced and unvoiced signals, and the original signal is replaced by a synthetic signal which is a suitable mixture of the original residual signal and the synthetic pitch pulse as described below. Synthetic residual = alfa1 · residual + alfa2 · synthetic pitch pulse wherein "Synthetic residual" is the resulting synthetic residual signal which replaces the original residual signal, "residual" is the original residual signal and "synthetic pitch pulse" is the purely synthetic pitch pulse which is created. "alfa1" and "alfa2" are determined as: alfa2 = x - taerskelnedre taerskeloevre - taerskelnedre alfa1 = 1 - alfa2
Fig. 7a-c illustrate how the speech signal may be replaced by a synthetic glottal pulse in case voiced speech is present in the signal.
Fig. 7a shows the intonation of a speech signal as a function of time. The intonation at time t2 is slightly larger than the intonation at time t1. The period where no signal is detected may be a period of silence or it may be a period where only completely unvoiced speech is present, i.e. a period during which no intonation may be detected, since an intonation may only be detected if the vocal chord is active, i.e. when voiced speech is present.
Fig. 7b shows a synthetic internal glottal pulse. In reality glottal pulses are very dependent upon the person speaking. The glottal pulse shown in Fig. 7b, however, is an "average" pulse which is constructed in such a way that it has a wide spectrum and at the same time has the maximum length.
Fig. 7c shows a signal with the synthetic glottal pulse of Fig. 7b phased in instead of a noisy signal. The synthetic glottal pulse has a certain length. The synthetic signal is artificially "extended" by a "zero-signal", so as to match the length of the original pitch period, 'ipitch'. In Fig. 7c the length of the second pulse is slightly larger than the length of the first pulse ('ipitch(t2)'>'ipitch(t1)'). This is due to the fact that the intonation at time t2 is slightly larger than the intonation at time t1 as indicated in Fig. 7a. It is clear that the only difference between the two pulses is the length of the "zero-signal" following the synthetic glottal pulse.
Figs. 8 and 9 show non-filtered and filtered formant tracks (f, b and g parameters), respectively, for the words 'three', 'four' and 'five'. The parameters have been filtered using a so-called Kalman filter. When comparing Fig. 8 and Fig. 9, it is clear that the noise of the signal is considerably reduced during the filtering process, i.e. the fluctuations are reduced. It is also clear that the dynamics of the speech are at the same time left nearly unchanged. What is achieved is thus a speech signal, wherein the noise, which was initially present, is removed or at least considerably reduced in such a way that the original signal, i.e. the actual speech, is left nearly unchanged. It is thus possible to remove unwanted components of a signal (i.e. noise) without removing or changing wanted components of the signal (i.e. actual speech).
Figs. 10-12 also show speech signals representing the words 'three', 'four' and 'five'. Fig. 10 shows the original speech signal, including noise components. It is clear that this signal is very noisy, i.e. the signal to noise ratio (SNR) is very small.
In Fig. 11 part of the signal has been replaced by synthetic glottal pulses, and the figure thus represents the output of the flow diagram of Fig. 6. As is readily seen, the shifts between the regions in which the original signal has been replaced and the regions in which the original signal has been maintained are very abrupt. This is because the difference between the lower threshold value ("taerskelnedre") and the upper threshold value ("taerskeloevre") is relatively small. It is therefore very likely that the variance or gain (whichever parameter is chosen) is either below the lower threshold value or above the upper threshold value, rather than being between the two values. That is, the sound is most often considered to be either completely voiced or completely unvoiced, rather than being considered to contain voiced as well as unvoiced components.
Fig. 12 shows a signal in which the noise has been reduced. Conventional methods as well as the method according to the invention has been employed. It is very clear that the SNR has improved considerably as compared to Fig. 10.
Fig. 13 and Fig. 14 show frequency spectra corresponding to Fig. 10 and Fig. 12, respectively. When comparing Figs. 13 and 14 it is also clear that the SNR has improved considerably during the filtering process.
Thus, a method and apparatus for noise reduction in speech signals has been provided, the method and apparatus reducing the noise in the signal without, or at least only to a very limited extend, reducing the actual signal.

Claims

A method of reducing the amount of noise in a noisy speech signal, comprising the steps of

obtaining from a speech signal model based representations describing the quasi-stationary part of the speech,

obtaining, from said model based representation, dynamic information regarding frequency (f), bandwidth (b), and gain (g) parameters of said speech signal in relation to time,

defining processes as a function of time by letting said f, b, and g parameters belong to a process according to a priori knowledge regarding the dynamics of the human voice,

smoothing the f, b and g parameters with respect to time, the smoothing step being performed on said processes.
A method according to claim 1, wherein the a priori knowledge regarding the dynamics of the human voice comprises knowledge regarding the maximum frequency span of a speech signal.
A method according to claim 1 or 2, wherein the speech signal is deemed to belong to a process, the process being a signal which may extend over one or more measurement frames.
A method according to claim 3, wherein the process is an active process at a certain time if it extends over one or more preceding measurement frames.
A method according to claim 3 or 4, wherein the smoothing step comprises the step of determining whether a new formant frequency belongs to an active process.
A method according to claim 5, further comprising the step of defining a new process in case the new formant frequency does not belong to an active process, and wherein the new formant frequency is then deemed to belong to said new process.
A method according to any of claims 4-6, wherein a process is deemed to be inactive in case no new formant frequency is deemed to belong to said process.
A method according to claim 7, further comprising the step of artificially maintaining the speech signal for a predetermined number of measurement frames in case the corresponding process is abruptly deemed to be inactive.
A method according to claim 8, wherein the predetermined number of measurement frames corresponds to the maximum duration of the speech signal.
A method according to claim 9, wherein the maximum duration of the speech signal is between 40 ms and 80 ms.
A method according to claim 10, wherein the maximum duration of the speech signal is between 50 ms and 70 ms.
A method according to claim 11, wherein the maximum duration of the speech signal is approximately 60 ms.
A method according to any of claims 5-12, wherein the new formant is deemed to belong to an active process if the difference in frequency between said formant and said process does not exceed a predetermined level.
A method according to claim 13, wherein the predetermined level is between 200 Hz and 600 Hz.
A method according to claim 14, wherein the predetermined level is between 300 Hz and 500 Hz.
A method according to claim 15, wherein the predetermined level is approximately 400 Hz.
A method according to any of the preceding claims, wherein the smoothing step comprises the step of filtering the f, b and g parameters.
A method according to claim 17, wherein the filtering step is performed using a first order Infinite Impulse Response (IIR) filter.
A method according to claim 18, wherein the first order IIR filter is a feedback filter of the form: y[n]=b·x[n]+a·y[n-1], wherein x designates the speech signal, y designates the filter output, and wherein a and b are parameters to be determined.
A method according to claim 19, wherein the parameters a and b are determined by using model knowledge of the speech process.
A method according to any of the preceding claims, further comprising the steps of

determining whether voiced speech is present,

using a noise eliminated pitch period to create a synthetic glottal pulse in case voiced speech is present, and

replacing at least part of the original speech signal by said synthetic glottal pulse in case voiced speech is present.
A method according to claim 21, wherein the determining step comprises the steps of comparing the variance of the speech signal to an upper threshold value and to a lower threshold value, and wherein voiced speech is present in case the variance of the speech signal exceeds the lower threshold value.
A method according to claim 21 or 22, wherein the determining step comprises the steps of comparing the first formant gain of the speech signal to an upper threshold value and to a lower threshold value, and wherein voiced speech is present in case the first formant gain of the speech signal exceeds the lower threshold value.
A method according to any of claims 21-23, wherein the noise eliminated pitch period is found from a residual signal of the speech signal.
A method according to any of claims 21-24, wherein the replacing step is performed by fading out a residual signal and fading in the synthetic glottal pulse.
A method according to any of the preceding claims, wherein at least the smoothing step is performed by a computer system.
A method according to any of the preceding claims, wherein the speech signal is generated in a cellular phone.
An apparatus for performing noise reduction in a speech signal, the apparatus comprising

means for obtaining from a speech signal model based representations describing the quasi-stationary part of the speech,

means for obtaining dynamic information regarding frequency (f), bandwidth (b) and gain (g) parameters of said speech signal in relation to time,

means for defining processes as a function of time by letting said f, b, and g parameters belong to a process according to a priori knowledge regarding the dynamics of the human voice,

smoothing means for smoothing the processes with respect to time.
An apparatus according to claim 28, wherein the a priori knowledge comprises the maximum frequency span of a speech signal.
An apparatus according to claim 28 or 29, wherein the speech signal is deemed to belong to a process, the process being a signal which may extend over one or more measurement frames, and wherein the process is an active process at a certain time if it extends over one or more preceding measurement frames, and wherein the smoothing means comprises means for determining whether a new formant frequency belongs to an active process.
An apparatus according to any of claims 28-30, wherein the smoothing means comprises means for filtering the f, b and g parameters.
An apparatus according to any of claims 28-30, further comprising

determining means for determining whether voiced speech is present,

creating means for creating a synthetic glottal pulse by using a noise eliminated pitch period, and

replacing means for replacing at least part of the original speech signal by said synthetic glottal pulse in case voiced speech is present.
An apparatus according to claim 32, wherein the determining means comprises comparing means for comparing the variance of the speech signal to an upper threshold value and to a lower threshold value, and wherein voiced speech is present in case the variance of the speech signal exceeds the lower threshold value.
An apparatus according to claim 32 or 33, wherein the determining means comprises comparing means for comparing the first formant gain of the speech signal to an upper threshold value and to a lower threshold value, and wherein voiced speech is present in case the first formant gain of the speech signal exceeds the lower threshold value.
An apparatus according to any of claims 32-34, further comprising means for producing a noise eliminated pitch period from a residual signal of the speech signal.