EP1141949A1 - Elimination of noise from a speech signal - Google Patents
Elimination of noise from a speech signalInfo
- Publication number
- EP1141949A1 EP1141949A1 EP00979526A EP00979526A EP1141949A1 EP 1141949 A1 EP1141949 A1 EP 1141949A1 EP 00979526 A EP00979526 A EP 00979526A EP 00979526 A EP00979526 A EP 00979526A EP 1141949 A1 EP1141949 A1 EP 1141949A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- noise
- input signal
- spectral components
- correlation
- signal
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 230000008030 elimination Effects 0.000 title description 7
- 238000003379 elimination reaction Methods 0.000 title description 7
- 230000003595 spectral effect Effects 0.000 claims abstract description 73
- 238000000034 method Methods 0.000 claims abstract description 32
- 238000001228 spectrum Methods 0.000 claims description 33
- 239000013598 vector Substances 0.000 description 13
- 238000012545 processing Methods 0.000 description 9
- 230000000875 corresponding effect Effects 0.000 description 6
- 230000006870 function Effects 0.000 description 6
- 238000010183 spectrum analysis Methods 0.000 description 6
- 230000002596 correlated effect Effects 0.000 description 5
- 238000006243 chemical reaction Methods 0.000 description 4
- 238000012360 testing method Methods 0.000 description 4
- AZFKQCNGMSSWDS-UHFFFAOYSA-N MCPA-thioethyl Chemical group CCSC(=O)COC1=CC=C(Cl)C=C1C AZFKQCNGMSSWDS-UHFFFAOYSA-N 0.000 description 3
- 238000012549 training Methods 0.000 description 3
- 238000013518 transcription Methods 0.000 description 3
- 230000035897 transcription Effects 0.000 description 3
- 230000003044 adaptive effect Effects 0.000 description 2
- 238000013459 approach Methods 0.000 description 2
- 230000002411 adverse Effects 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000009795 derivation Methods 0.000 description 1
- 230000001066 destructive effect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
Definitions
- the invention relates to a method for reducing noise in a noisy time-varying input signal, such as a speech signal.
- the invention further relates to an apparatus for reducing noise in a noisy time-varying input signal.
- the presence of noise in a time-varying input signal hinders the accuracy and quality of processing the signal. This is particularly the case for processing a speech signal, such as for instance occurs when a speech signal is encoded.
- the presence of noise is even more destructive if the signal is ultimately not presented to a user, who can relatively well cope with the presence of noise, but if the signal is ultimately processed automatically, as for instance is the case with a speech signal that is recognized automatically.
- Increasingly automatic speech recognition and coding systems are used. Although the performance of such systems is continuously improving, it is desired that the accuracy be increased further, particularly in adverse environments, such as having a low signal-to-noise ratio (SNR) or a low bandwidth signal.
- SNR signal-to-noise ratio
- speech recognition systems compare a representation of an input speech signal against a model ⁇ x of reference signals, such as hidden Markov models (HMMs) built from representations of a training speech signal.
- the representations are usually observation vectors with LPC or cepstral components.
- the reference signals are usually relatively clean (high SNR, high bandwidth), whereas the input signal during actual use is distorted (lower SNR, and/or lower bandwidth). It is, therefore, desired to eliminate at least part of the noise present in the input signal in order to obtain a noise- suppressed signal.
- the conventional spectral subtraction technique involves determining the spectral components of the noisy-speech and estimating the spectral components of the noise.
- the spectral components may, for instance, be calculated using a Fast Fou ⁇ er transform (FFT).
- FFT Fast Fou ⁇ er transform
- the noise spectral components may be estimated once from a part of a signal with predominantly representative noise.
- the noise is estimated 'on-the-fly', for instance each time a 'silent' part is detected m the input signal with no significant amount of speech signal.
- the noise- suppressed speech is estimated by subtracting an average noise spectrum from the noisy speech spectrum:
- S(w;m), Y(w;m), and N(w,m) are the magnitude spectrums of the estimated speech s, noisy speech y and noise n, w and m are the frequency and time indices, respectively.
- NSR con NSR con
- the correlation equation is given by:
- the correlation coefficient sn may be fixed, for instance based on analyzing representative input signals.
- the correlation coefficient sn is estimated based on the actual input signal.
- the estimation is based on minimizing a negative spectrum ratio.
- the expected negative spectrum ratio R is defined as:
- the correlation coefficient is advantageously obtained by following gradient operation.
- the correlation coefficient can be learned along the direction of NSR decrement. Preferably, this is done in an iterative algorithm.
- the equation representing the correlated spectral subtraction may be solved directly. Preferably, the equation is solved in an iterative manner, improving the estimate of the clean speech.
- the figure shows a block diagram of a conventional speech processing system wherein the invention can be used.
- the noise reduction according to the invention is particularly useful for processing noisy speech signals, such as coding such a signal or automatically recognizing such a signal.
- noisy speech signals such as coding such a signal or automatically recognizing such a signal.
- a person skilled in the art can equally well apply the noise elimination technique in a speech coding system.
- Speech recognition systems such as large vocabulary continuous speech recognition systems, typically use a collection of recognition models to recognize an input pattern. For instance, an acoustic model and a vocabulary may be used to recognize words and a language model may be used to improve the basic recognition result.
- the figure illustrates a typical structure of a large vocabulary continuous speech recognition system 100. The following definitions are used for describing the system and recognition method: ⁇ x : a set of trained speech models
- the system 100 comprises a spectral analysis subsystem 110 and a unit matching subsystem 120.
- the speech input signal SIS
- OV observation vector
- the speech signal is digitized (e.g. sampled at a rate of 6.67 kHz.) and pre-processed, for instance by applying pre-emphasis.
- Consecutive samples are O grouped (blocked) into frames, corresponding to, for instance, 32 msec, of speech signal.
- LPC Linear Predictive Coding
- an acoustic model provides the first term of equation (a).
- the acoustic model is used to estimate the probability P(Y
- a speech recognition unit is represented by a sequence of acoustic references. Various forms of speech recognition units may be used. As an example, a whole word or even a group of words may be represented by one speech recognition unit.
- a word model (WM) provides for each word of a given vocabulary a transcription in a sequence of acoustic references.
- a whole word is represented by a speech recognition unit, in which case a direct relationship exists between the word model and the speech recognition unit.
- a word model is given by a lexicon 134, describing the sequence of sub-word units relating to a word of the vocabulary, and the sub-word models 132, describing sequences of acoustic references of the involved speech recognition unit.
- a word model composer 136 composes the word model based on the sub-word model 132 and the lexicon 134.
- the (sub-)word models are typically based in Hidden Markov Models (HMMs), which are widely used to stochastically model speech signals.
- HMMs Hidden Markov Models
- each recognition unit word model or sub-word model
- an HMM whose parameters are estimated from a training set of data.
- An HMM state corresponds to an acoustic reference.
- Various techniques are known for modeling a reference, including discrete or continuous probability densities.
- Each sequence of acoustic references which relate to one specific utterance is also referred as an acoustic transcription of the utterance. It will be appreciated that if other recognition techniques than HMMs are used, details of the acoustic transcription will be different.
- a word level matching system 130 of The figure matches the observation vectors against all sequences of speech recognition units and provides the likelihoods of a match between the vector and a sequence. If sub-word units are used, constraints can be placed on the matching by using the lexicon 134 to limit the possible sequence of sub- word units to sequences in the lexicon 134. This reduces the outcome to possible sequences of words.
- a sentence level matching system 140 may be used which, based on a language model (LM), places further constraints on the matching so that the paths investigated are those corresponding to word sequences which are proper sequences as specified by the language model.
- the language model provides the second term P(W) of equation (a).
- P(W) of equation (a) Combining the results of the acoustic model with those of the language model, results in an outcome of the unit matching subsystem 120 which is a recognized sentence (RS) 152.
- the language model used in pattern recognition may include syntactical and/or semantical constraints 142 of the language and the recognition task.
- a language model based on syntactical constraints is usually referred to as a grammar 144.
- N-gram word models are widely used.
- wlw2w3...wj-l) is approximated by P(wj
- bigrams or trigrams are used.
- wlw2w3...wj-l) is approximated by P(wj
- the speech processing system may be implemented using conventional hardware.
- a speech recognition system may be implemented on a computer, such as a PC, where the speech input is received via a microphone and digitized by a conventional audio interface card. All additional processing takes place in the form of software procedures executed by the CPU.
- the speech may be received via a telephone connection, e.g. using a conventional modem in the computer.
- the speech processing may also be performed using dedicated hardware, e.g. built around a DSP.
- the noise elimination according to the invention may be performed in a preprocessing step before the spectral analysis subsystem 100.
- the noise elimination is integrated in the spectral analysis subsystem 100, for instance to avoid that several conversions from the time domain to the spectral domain and vice versa are required.
- All hardware and processing capabilities for performing the invention are normally present in a speech recognition or speech coding system.
- the noise elimination technique according to the invention is normally executed on a processor, such as a DSP or microprocessor of a personal computer, under control of a suitable program. Programming the elementary functions of the noise elimination technique, such as performing a conversion from the time domain to the spectral domain, falls well within the range of a skilled person.
- the speech signal y can be transformed into a set of spectral components Y(k). It will be appreciated that if already a suitable conversion to the time domain had taken place, it is sufficient to retrieve the spectral components resulting from such a conversion.
- ⁇ S(k) ⁇ , ⁇ N(k) ⁇ , and ⁇ Y(k) ⁇ be the corresponding magnitude of the spectrums of the time-domain signals s, n, and y, respectively.
- individual spectral components are forced to be positive. It does not allow the situation wherein an individual spectral component Y(k) of the noisy speech y is less than the corresponding spectral component N(k) of the noise signal n.
- Equation (8) has two possible solutions. The positive solution which is greater than close to will be chosen since the direction of ⁇ SR decrement is preferred.
- a preferred iterative algorithm for estimating ⁇ s(k) ⁇ with specified correlation coefficient, ⁇ sn is as follows: LOOP k ( 0 : N-l )
- the outer loop k deals with all individual spectral components.
- the inner loop is performed until the iteration has converged (no significant change occurs anymore in the estimated speech).
- the correlation coefficient ⁇ OT is estimated based on the actual input signal y.
- the function of negative spectrum ratio (NSR) for the correlated spectral subtraction algorithm according to the invention is defined as follows:
- the f NS function shown in equation (5) is a zero-one function.
- a smoothed zero-one, sigmoid function family is preferably used.
- the following function boss is advantageously used for further derivation due to its differentiability.
- the correlation coefficient is preferably obtained by the following gradient operation:
- the correlation coefficient can be learned along the direction of decrease in NSR. This imphes to reduce the residual noise in the estimated spectrum using the proposed correlated spectral subtraction (CSS) algorithm.
- the block indicated as block 1 is the same as used for the iterative algorithm assuming a fixed correlation coefficient sn .
- the iterative solution in block one also the one-step solution of equations (7) or (8) may be used.
- the resulting estimated spectral components of the noise-eliminated signal may be converted back to the time-domain.
- the spectral components may be used directly for the subsequent further processing, like coding or automatically recognizing the signal.
Abstract
A method for reducing noise in a noisy time-varying speech input signal y includes receiving the input signal y and deriving a plurality of spectral component signals representing respective magnitudes |Y(k)| of spectral components of the input signal y. A correlation coefficient ηsn is obtained which indicates a correlation in the spectral domain between a clean speech signal component s and a noise signal component n present in the input signal y(y = s + n). Magnitudes of respective noise-suppressed spectral components Ŝ(k) are estimated by solving a correlation equation which gives a relationship between the magnitudes of the respective spectral components |Y(k)| of the noisy input signal y, the spectral components |S(k)| of the clean speech signal s, and the spectral components |N(k)| of the noise signal n, where the equation includes the correlation based on the obtained correlation coefficient ηsn. Preferably, the correlation equation is given by |Y(k)|a = |S(k)|a + |N(k)|a + η¿sn?|S(k)||N(k)|.
Description
Elimination of noise from a speech signal.
The invention relates to a method for reducing noise in a noisy time-varying input signal, such as a speech signal. The invention further relates to an apparatus for reducing noise in a noisy time-varying input signal.
The presence of noise in a time-varying input signal hinders the accuracy and quality of processing the signal. This is particularly the case for processing a speech signal, such as for instance occurs when a speech signal is encoded. The presence of noise is even more destructive if the signal is ultimately not presented to a user, who can relatively well cope with the presence of noise, but if the signal is ultimately processed automatically, as for instance is the case with a speech signal that is recognized automatically. Increasingly automatic speech recognition and coding systems are used. Although the performance of such systems is continuously improving, it is desired that the accuracy be increased further, particularly in adverse environments, such as having a low signal-to-noise ratio (SNR) or a low bandwidth signal. Normally, speech recognition systems compare a representation of an input speech signal against a model Λx of reference signals, such as hidden Markov models (HMMs) built from representations of a training speech signal. The representations are usually observation vectors with LPC or cepstral components.
In practice a mismatch exists between the conditions under which the reference signals (and thus the models) were obtained and the input signal conditions. The reference signals are usually relatively clean (high SNR, high bandwidth), whereas the input signal during actual use is distorted (lower SNR, and/or lower bandwidth). It is, therefore, desired to eliminate at least part of the noise present in the input signal in order to obtain a noise- suppressed signal.
A conventional way of estimating a noise-suppressed speech signal ('clean' speech) is to use a spectral subtraction technique. In the discrete-time domain, noise speech y can be represented as: y(i) = s(i) + n(i), 0 ≤i<T-l, (1) where s, n, y denote clean speech, noise and noisy speech respectively, and where T denotes the length of the speech and i is the time index. The conventional spectral subtraction
technique involves determining the spectral components of the noisy-speech and estimating the spectral components of the noise. The spectral components may, for instance, be calculated using a Fast Fouπer transform (FFT). The noise spectral components may be estimated once from a part of a signal with predominantly representative noise. Preferably, the noise is estimated 'on-the-fly', for instance each time a 'silent' part is detected m the input signal with no significant amount of speech signal. In the general spectral subtraction technique, the noise- suppressed speech is estimated by subtracting an average noise spectrum from the noisy speech spectrum:
where S(w;m), Y(w;m), and N(w,m) are the magnitude spectrums of the estimated speech s, noisy speech y and noise n, w and m are the frequency and time indices, respectively. The case of a = 2 is referred to as power spectral subtraction. The subtraction is usually called magnitude spectral subtraction if a = 1.
Due to the subtraction, the estimated spectrum is not guaranteed to be positive in the conventional spectral subtraction techmques. US 5,749,068 descπbes setting those spectral components to zero for which the subtraction yields a negative outcome:
S(w) = max{Y(w) - a.N(w),θ} (3)
Setting the spectral components to zero (or a low default value) is referred to as 'taking floor' for the negative spectral components. The parameter a, with a positive value, designates the degree of eliminating noise components. US 5,749,068 descπbes an advanced way of estimating the spectral components of the noise, but still the conventional spectral subtraction of equation (3) is used.
Taking floor for negative spectral components provides a major limitation of spectral subtraction techniques, introducing residual noise with musical tone artifacts into the estimated speech.
In order to investigate the limitation of the conventional spectrum subtraction techniques, the inventor has earned out an experiment for calculating the ratio of negative spectrum (i.e. the relative number of spectral components which would have a negative value) The negative spectrum ratio NSRcon for the conventional spectral subtraction technique is defined as follows:
NSRcon
1 x < 0, fNsix) = [0 otherwise. (5)
where is the corresponding magnitude spectrum of the testing speech y,
is the noise spectrum estimated from a pause (non-speech segment), k denotes the £-th spectrum component and M represents the total number of spectral components over which the ratio is determined, for instance the number of spectral components in one frame or in the whole testing utterance.
The following table gives the negative spectrum ratio NSRcon for various signal to noise ratios (SNRs) with a =2. It has been found that the negative spectrum ratio NSRcon even reaches 34.6% at clean conditions. This illustrates that particularly at higher SNR level the conventional spectral subtraction technique introduces some residual noise, limiting the use of the technique.
It is an object of the invention to overcome the limitation of the conventional spectral subtraction technique.
To meet the object of the invention, the method for reducing noise in a noisy time-varying input signal y, such as a speech signal, includes: receiving the noisy time-varying input signal;
deriving from the signal a plurality of spectral component signals representing respective magnitudes of spectral components of the input signal; obtaining a correlation coefficient sn indicative of a correlation in the spectral domain between a clean speech signal component s and a noise signal component n present in the input signal (y = s + n); and estimating magnitudes of respective noise-suppressed spectral components
S(k) by solving an equation giving a relationship between the magnitudes of the respective spectral components \Y(k)\ of the noisy input signal y, the spectral components \S(k)\ of the clean speech signal s, and the spectral components \N(k)\ of the noise signal n, where the equation includes the correlation based on the obtained correlation coefficient sn. Preferably, the correlation equation is given by:
\Y(k)\a =\S(k)\a +\N(k)\a + γsn\S(k)\\N(k)\ where a could be 1 or 2 for magnitude and power spectrum, respectively. Instead of a conventional spectral subtraction this equation is solved which is based on a correlation coefficient γ_„ between the clean speech s and the noise n in the spectral domain. Solving the equation can be seen as 'correlated spectral subtraction' (CSS).
The correlation coefficient sn may be fixed, for instance based on analyzing representative input signals. Preferably, the correlation coefficient sn is estimated based on the actual input signal. Advantageously, the estimation is based on minimizing a negative spectrum ratio. Preferably, the expected negative spectrum ratio R is defined as:
where advantageously the 'zero-one' function fns is given by the differentiable function: 1
/«(*) = l + exp(- α- x + β)
By applying the theory of adaptive learning algorithm, the correlation coefficient is advantageously obtained by following gradient operation.
The correlation coefficient can be learned along the direction of NSR decrement. Preferably, this is done in an iterative algorithm.
The equation representing the correlated spectral subtraction may be solved directly. Preferably, the equation is solved in an iterative manner, improving the estimate of the clean speech.
These and other aspects of the invention will be apparent from and elucidated with reference to the embodiments shown in the drawings.
The figure shows a block diagram of a conventional speech processing system wherein the invention can be used.
General description of a speech recognition system
The noise reduction according to the invention is particularly useful for processing noisy speech signals, such as coding such a signal or automatically recognizing such a signal. Here a general description of a speech recognition system is given. A person skilled in the art can equally well apply the noise elimination technique in a speech coding system.
Speech recognition systems, such as large vocabulary continuous speech recognition systems, typically use a collection of recognition models to recognize an input pattern. For instance, an acoustic model and a vocabulary may be used to recognize words and a language model may be used to improve the basic recognition result. The figure illustrates a typical structure of a large vocabulary continuous speech recognition system 100. The following definitions are used for describing the system and recognition method: Λx: a set of trained speech models
X: the original speech which matches the model, Λx Y: the testing speech
Ay. the matched models for testing environment
W: the word sequence
S: the decoded sequences that can be words, syllables, sub-word units, states or mixture components, or other suitable representations. The system 100 comprises a spectral analysis subsystem 110 and a unit matching subsystem 120. In the spectral analysis subsystem 110 the speech input signal (SIS) is spectrally and/or temporally analyzed to calculate a representative vector of features (observation vector, OV). Typically, the speech signal is digitized (e.g. sampled at a rate of 6.67 kHz.) and pre-processed, for instance by applying pre-emphasis. Consecutive samples are
O grouped (blocked) into frames, corresponding to, for instance, 32 msec, of speech signal. Successive frames partially overlap, for instance, 16 msec. Often the Linear Predictive Coding (LPC) spectral analysis method is used to calculate for each frame a representative vector of features (observation vector). The feature vector may, for instance, have 24, 32 or 63 components. The standard approach to large vocabulary continuous speech recognition is to assume a probabilistic model of speech production, whereby a specified word sequence W = WιW2W3...Wq produces a sequence of acoustic observation vectors Y = yιy2y3—yτ- The recognition error can be statistically minimized by determining the sequence of words w1w2w3...wq which most probably caused the observed sequence of observation vectors yιy2y3— yT (over time t=l,..., T), where the observation vectors are the outcome of the spectral analysis subsystem 110. This results in determining the maximum a posteriori probability: max P(W|Y, Λx), for all possible word sequences W By applying Bayes' theorem on conditional probabilities, P(W|Y, Λx) is given by:
Since P(Y) is independent of W, the most probable word sequence is given by:
W = arg ma P(y,W | Λ = arg maxP<? \W,Az).P(W) (a)
In the unit matching subsystem 120, an acoustic model provides the first term of equation (a). The acoustic model is used to estimate the probability P(Y|W) of a sequence of observation vectors Y for a given word string W. For a large vocabulary system, this is usually performed by matching the observation vectors against an inventory of speech recognition units. A speech recognition unit is represented by a sequence of acoustic references. Various forms of speech recognition units may be used. As an example, a whole word or even a group of words may be represented by one speech recognition unit. A word model (WM) provides for each word of a given vocabulary a transcription in a sequence of acoustic references. In most small vocabulary speech recognition systems, a whole word is represented by a speech recognition unit, in which case a direct relationship exists between the word model and the speech recognition unit. In other small vocabulary systems, for instance used for recognizing a relatively large number of words (e.g. several hundreds), or in large vocabulary systems, use can be made of linguistically based sub-word units, such as phones, diphones or syllables, as well as derivative units, such as fenenes and fenones. For such systems, a word model is given by a lexicon 134, describing the sequence of sub-word units relating to a word of the vocabulary, and the sub-word models 132, describing sequences of
acoustic references of the involved speech recognition unit. A word model composer 136 composes the word model based on the sub-word model 132 and the lexicon 134. The (sub-)word models are typically based in Hidden Markov Models (HMMs), which are widely used to stochastically model speech signals. Using such an approach, each recognition unit (word model or sub-word model) is typically characterized by an HMM, whose parameters are estimated from a training set of data. For large vocabulary speech recognition systems usually a limited set of, for instance 40, sub-word units is used, since it would require a lot of training data to adequately train an HMM for larger units. An HMM state corresponds to an acoustic reference. Various techniques are known for modeling a reference, including discrete or continuous probability densities. Each sequence of acoustic references which relate to one specific utterance is also referred as an acoustic transcription of the utterance. It will be appreciated that if other recognition techniques than HMMs are used, details of the acoustic transcription will be different.
A word level matching system 130 of The figure matches the observation vectors against all sequences of speech recognition units and provides the likelihoods of a match between the vector and a sequence. If sub-word units are used, constraints can be placed on the matching by using the lexicon 134 to limit the possible sequence of sub- word units to sequences in the lexicon 134. This reduces the outcome to possible sequences of words.
Furthermore a sentence level matching system 140 may be used which, based on a language model (LM), places further constraints on the matching so that the paths investigated are those corresponding to word sequences which are proper sequences as specified by the language model. As such the language model provides the second term P(W) of equation (a). Combining the results of the acoustic model with those of the language model, results in an outcome of the unit matching subsystem 120 which is a recognized sentence (RS) 152. The language model used in pattern recognition may include syntactical and/or semantical constraints 142 of the language and the recognition task. A language model based on syntactical constraints is usually referred to as a grammar 144. The grammar 144 used by the language model provides the probability of a word sequence W = W!W2w3...Wq, which in principle is given by: P( ) = P(w1)P(w2|w1).P(w3|w1w2)...P(Wq| w1w2w3...wq).
Since in practice it is infeasible to reliably estimate the conditional word probabilities for all words and all sequence lengths in a given language, N-gram word models are widely used. In an N-gram model, the term P(wj| wlw2w3...wj-l) is approximated by P(wj| wj-N+l...wj-l). In
practice, bigrams or trigrams are used. In a trigram, the term P(wj| wlw2w3...wj-l) is approximated by P(wj| wj-2wj-l).
The speech processing system according to the invention may be implemented using conventional hardware. For instance, a speech recognition system may be implemented on a computer, such as a PC, where the speech input is received via a microphone and digitized by a conventional audio interface card. All additional processing takes place in the form of software procedures executed by the CPU. In particular, the speech may be received via a telephone connection, e.g. using a conventional modem in the computer. The speech processing may also be performed using dedicated hardware, e.g. built around a DSP.
The noise elimination according to the invention may be performed in a preprocessing step before the spectral analysis subsystem 100. Preferably, the noise elimination is integrated in the spectral analysis subsystem 100, for instance to avoid that several conversions from the time domain to the spectral domain and vice versa are required. All hardware and processing capabilities for performing the invention are normally present in a speech recognition or speech coding system. The noise elimination technique according to the invention is normally executed on a processor, such as a DSP or microprocessor of a personal computer, under control of a suitable program. Programming the elementary functions of the noise elimination technique, such as performing a conversion from the time domain to the spectral domain, falls well within the range of a skilled person.
Detailed description of the invention
Details are given for speech signals. Other signals can be processed in a corresponding way. As described above, in the discrete-time domain noise speech y can be represented as: y(i) = s(i) + n(i), 0 ≤i≤T-l, (1) where s, n, y denote clean speech, noise and noisy speech respectively, and where T denotes the length of the speech and i is the time index. Using conventional techniques, such as a Fast Fourier transform, the speech signal y can be transformed into a set of spectral components Y(k). It will be appreciated that if already a suitable conversion to the time domain had taken place, it is sufficient to retrieve the spectral components resulting from such a conversion.
Let \S(k)\ , \N(k)\ , and \Y(k)\ be the corresponding magnitude of the spectrums of the time-domain signals s, n, and y, respectively. Using the conventional spectral subtraction techniques, individual spectral components are forced to be positive. It does not
allow the situation wherein an individual spectral component Y(k) of the noisy speech y is less than the corresponding spectral component N(k) of the noise signal n.
The following correlation is assumed to exist between the speech signal and the noise signal: |r(*)|α =\s(k)\a +|N(*)|α + r,n|s(*)[_v(*)| (6) where γsn denotes the correlation coefficient of speech and noise in the spectral domain and a could be 1 or 2 for magnitude and power spectrum, respectively. Using this correlation as the basis for estimating the clean speech spectrum (and as such using a correlated spectral subtraction) makes it possible to have the situation wherein
< \N(k)\a if γm < 0. Let
and Iw. be the estimates of the magnitude spectrums of the clean speech signal s and the noise signal n, respectively. Preferably,
is estimated from pause
(non-speech segment). Based on equation (6), \s(k)\ can be calculated by solving the equation in one step or by using an iterative algorithm. The one-step solution are give in the following equations (7) and (8) for the cases wherein a=\ or α=2, respectively:
Equation (8) has two possible solutions. The positive solution which is greater than
close to
will be chosen since the direction of ΝSR decrement is preferred. A preferred iterative algorithm for estimating \s(k)\ with specified correlation coefficient, γsn , is as follows:
LOOP k ( 0 : N-l )
Initialization: Sm(kf
(9)
LOOP £
IF Threshold THEN STOP
ELSE £ = £ + ! END LOOP £ END LOOP k
The outer loop k deals with all individual spectral components. The inner loop is performed until the iteration has converged (no significant change occurs anymore in the estimated speech).
The above described algorithm can be used for a fixed correlation coefficient γOT. In a further embodiment according to the invention, the correlation coefficient γOT is estimated based on the actual input signal y. To this end, the function of negative spectrum ratio (NSR) for the correlated spectral subtraction algorithm according to the invention is defined as follows:
The fNS function shown in equation (5) is a zero-one function. In order to derive the relation between the correlation coefficient ysn and NSR, a smoothed zero-one, sigmoid function family is preferably used. For example, the following function „ is advantageously used for further derivation due to its differentiability.
1
/«(*) = l + exp(- «• ;. + ?) (13)
Exemplary values for c and β are 1.0 and 0.0, respectively. Then, the expected negative spectrum ratio R is defined as follows:
By applying the theory of adaptive learning algorithm, the correlation coefficient is preferably obtained by the following gradient operation:
The correlation coefficient can be learned along the direction of decrease in NSR. This imphes to reduce the residual noise in the estimated spectrum using the proposed correlated spectral subtraction (CSS) algorithm.
The algorithm of estimating
with a minimum NSR based correlation coefficient γOT is as follows:
Initialization: m = 0 r n } = non-zero, initial guess.
(m+1) _ v(m)
If sn i sn
,(m) < Threshold 2 THEN STOP
END LOOP m
The block indicated as block 1 is the same as used for the iterative algorithm assuming a fixed correlation coefficient sn. Instead of using the iterative solution in block one, also the one-step solution of equations (7) or (8) may be used.
It will be appreciated that after the noise has been eliminated as described above, the resulting estimated spectral components of the noise-eliminated signal may be converted back to the time-domain. Where possible the spectral components may be used directly for the subsequent further processing, like coding or automatically recognizing the signal.
Claims
1. A method for reducing noise in a noisy time-varying input signal y, such as a speech signal; the method including: receiving the noisy time-varying input signal y; deriving from the input signal y a plurality of spectral component signals representing respective magnitudes | Y(k)\ of spectral components of the input signal y; obtaining a correlation coefficient ysn indicative of a correlation in the spectral domain between a clean speech signal component s and a noise signal component n present in the input signal y (y = s + n); and estimating magnitudes of respective noise-suppressed spectral components S(k) by solving a correlation equation giving a relationship between the magnitudes of the respective spectral components \Y(k)\ of the noisy input signal y, the spectral components \S(k)\ of the clean speech signal s, and the spectral components \N(k)\ of the noise signal n, where the equation includes the correlation based on the obtained correlation coefficient ysn.
2. The method as claimed in claim 1, wherein the correlation coefficient ysn is predetermined.
3. The method as claimed in claim 1, wherein the step of obtaining the correlation coefficient ysn includes estimating the correlation coefficient ysn.
4. The method as claimed in claim 3, wherein the step of estimating the correlation coefficient ysn includes determining a minimum negative spectrum ratio.
5. The method as claimed in claim 4, wherein the negative spectrum ratio NSR represents a proportion of spectral components S(k) which would be negative based on the solution of the correlation equation.
6. The method as claimed in claim 5, wherein the method includes: initializing the correlation coefficient ysn with a non-zero value; and
iteratively: performing the step of solving the correlation equation to obtain S(k) and estimating a new correlation coefficient based on a gradient decent of the negative spectrum ratio NSR for S(k) .
1. The method as claimed in claim 1, wherein the step of solving the correlation equation includes iteratively estimating the noise-suppressed spectrum S(k) .
8. The method as claimed in claim 7, wherein method includes calculating an initial estimate of a magnitude of the noise-suppressed spectrum S(0) (k) by subtracting a magnitude of an estimate of the respective spectral components N(k) of the noise signal n from a magnitude of the respective spectral components Y(k) of the noisy input signal y.
9. The method as claimed in claim 7, wherein the step of performing the iterative spectrum estimation includes in each iteration: estimating a magnitude of an auxiliary noise-suppressed spectrum based on the correlation equation where a term with the correlation coefficient ysn is based on a current estimate of a magnitude of the noise-suppressed spectrum S l) (k) ; and estimating a new magnitude of the noise-suppressed spectrum S(,+l) (k) based the estimated magnitude of the auxiliary noise-suppressed spectrum and on the current estimate of a magnitude of the noise-suppressed spectrum S(/) (k) .
10. An apparatus for reducing noise in a noisy time-varying input signal y, such as a speech signal; the apparatus including: an input for receiving the noisy time-varying input signal y; means for deriving from the input signal y a plurality of spectral component signals representing respective magnitudes \Y(k)\ of spectral components of the input signal y; means for obtaining a correlation coefficient ysn indicative of a correlation in the spectral domain between a clean speech signal component s and a noise signal component n present in the input signal y (y = s + n); and
means for estimating magnitudes of respective noise-suppressed spectral components S(k) by solving a correlation equation giving a relationship between the magnitudes of the respective spectral components \Y(k)\ of the noisy input signal y, the spectral components \S(k)\ of the clean speech signal s, and the spectral components |N(£ | of the noise signal n, where the equation includes the correlation based on the obtained correlation coefficient ysn.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP00979526A EP1141949A1 (en) | 1999-10-29 | 2000-10-27 | Elimination of noise from a speech signal |
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP99203565 | 1999-10-29 | ||
EP99203565 | 1999-10-29 | ||
PCT/EP2000/010713 WO2001031640A1 (en) | 1999-10-29 | 2000-10-27 | Elimination of noise from a speech signal |
EP00979526A EP1141949A1 (en) | 1999-10-29 | 2000-10-27 | Elimination of noise from a speech signal |
Publications (1)
Publication Number | Publication Date |
---|---|
EP1141949A1 true EP1141949A1 (en) | 2001-10-10 |
Family
ID=8240796
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP00979526A Withdrawn EP1141949A1 (en) | 1999-10-29 | 2000-10-27 | Elimination of noise from a speech signal |
Country Status (3)
Country | Link |
---|---|
EP (1) | EP1141949A1 (en) |
JP (1) | JP2003513320A (en) |
WO (1) | WO2001031640A1 (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP4434813B2 (en) * | 2004-03-30 | 2010-03-17 | 学校法人早稲田大学 | Noise spectrum estimation method, noise suppression method, and noise suppression device |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP3452443B2 (en) * | 1996-03-25 | 2003-09-29 | 三菱電機株式会社 | Speech recognition device under noise and speech recognition method under noise |
-
2000
- 2000-10-27 JP JP2001534144A patent/JP2003513320A/en active Pending
- 2000-10-27 EP EP00979526A patent/EP1141949A1/en not_active Withdrawn
- 2000-10-27 WO PCT/EP2000/010713 patent/WO2001031640A1/en not_active Application Discontinuation
Non-Patent Citations (1)
Title |
---|
See references of WO0131640A1 * |
Also Published As
Publication number | Publication date |
---|---|
JP2003513320A (en) | 2003-04-08 |
WO2001031640A1 (en) | 2001-05-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP2216775B1 (en) | Speaker recognition | |
US6950796B2 (en) | Speech recognition by dynamical noise model adaptation | |
US6125345A (en) | Method and apparatus for discriminative utterance verification using multiple confidence measures | |
KR100766761B1 (en) | Method and apparatus for constructing voice templates for a speaker-independent voice recognition system | |
US20060165202A1 (en) | Signal processor for robust pattern recognition | |
JP4531166B2 (en) | Speech recognition method using reliability measure evaluation | |
US20060206321A1 (en) | Noise reduction using correction vectors based on dynamic aspects of speech and noise normalization | |
US8615393B2 (en) | Noise suppressor for speech recognition | |
EP1465154B1 (en) | Method of speech recognition using variational inference with switching state space models | |
JPH0850499A (en) | Signal identification method | |
EP0470245A1 (en) | Method for spectral estimation to improve noise robustness for speech recognition. | |
JP3451146B2 (en) | Denoising system and method using spectral subtraction | |
Chowdhury et al. | Bayesian on-line spectral change point detection: a soft computing approach for on-line ASR | |
JPH075892A (en) | Voice recognition method | |
EP1116219B1 (en) | Robust speech processing from noisy speech models | |
US20040064315A1 (en) | Acoustic confidence driven front-end preprocessing for speech recognition in adverse environments | |
AU776919B2 (en) | Robust parameters for noisy speech recognition | |
JP4705414B2 (en) | Speech recognition apparatus, speech recognition method, speech recognition program, and recording medium | |
GB2385697A (en) | Speech recognition | |
Liao et al. | Joint uncertainty decoding for robust large vocabulary speech recognition | |
Haton | Automatic speech recognition: A Review | |
FI111572B (en) | Procedure for processing speech in the presence of acoustic interference | |
Kotnik et al. | Efficient noise robust feature extraction algorithms for distributed speech recognition (DSR) systems | |
WO2001031640A1 (en) | Elimination of noise from a speech signal | |
JP2007508577A (en) | A method for adapting speech recognition systems to environmental inconsistencies |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LI LU MC NL PT SE |
|
17P | Request for examination filed |
Effective date: 20011105 |
|
17Q | First examination report despatched |
Effective date: 20030918 |
|
RBV | Designated contracting states (corrected) |
Designated state(s): DE FR GB |
|
GRAP | Despatch of communication of intention to grant a patent |
Free format text: ORIGINAL CODE: EPIDOSNIGR1 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN |
|
18D | Application deemed to be withdrawn |
Effective date: 20050322 |