WO2001031640A1 - Elimination of noise from a speech signal - Google Patents

Elimination of noise from a speech signal Download PDF

Info

Publication number
WO2001031640A1
WO2001031640A1 PCT/EP2000/010713 EP0010713W WO0131640A1 WO 2001031640 A1 WO2001031640 A1 WO 2001031640A1 EP 0010713 W EP0010713 W EP 0010713W WO 0131640 A1 WO0131640 A1 WO 0131640A1
Authority
WO
WIPO (PCT)
Prior art keywords
noise
input signal
spectral components
correlation
signal
Prior art date
Application number
PCT/EP2000/010713
Other languages
English (en)
French (fr)
Inventor
Chao-Shih J. Huang
Original Assignee
Koninklijke Philips Electronics N.V.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Koninklijke Philips Electronics N.V. filed Critical Koninklijke Philips Electronics N.V.
Priority to JP2001534144A priority Critical patent/JP2003513320A/ja
Priority to EP00979526A priority patent/EP1141949A1/en
Publication of WO2001031640A1 publication Critical patent/WO2001031640A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise

Definitions

  • the invention relates to a method for reducing noise in a noisy time-varying input signal, such as a speech signal.
  • the invention further relates to an apparatus for reducing noise in a noisy time-varying input signal.
  • the presence of noise in a time-varying input signal hinders the accuracy and quality of processing the signal. This is particularly the case for processing a speech signal, such as for instance occurs when a speech signal is encoded.
  • the presence of noise is even more destructive if the signal is ultimately not presented to a user, who can relatively well cope with the presence of noise, but if the signal is ultimately processed automatically, as for instance is the case with a speech signal that is recognized automatically.
  • Increasingly automatic speech recognition and coding systems are used. Although the performance of such systems is continuously improving, it is desired that the accuracy be increased further, particularly in adverse environments, such as having a low signal-to-noise ratio (SNR) or a low bandwidth signal.
  • SNR signal-to-noise ratio
  • speech recognition systems compare a representation of an input speech signal against a model ⁇ x of reference signals, such as hidden Markov models (HMMs) built from representations of a training speech signal.
  • the representations are usually observation vectors with LPC or cepstral components.
  • the reference signals are usually relatively clean (high SNR, high bandwidth), whereas the input signal during actual use is distorted (lower SNR, and/or lower bandwidth). It is, therefore, desired to eliminate at least part of the noise present in the input signal in order to obtain a noise- suppressed signal.
  • the conventional spectral subtraction technique involves determining the spectral components of the noisy-speech and estimating the spectral components of the noise.
  • the spectral components may, for instance, be calculated using a Fast Fou ⁇ er transform (FFT).
  • FFT Fast Fou ⁇ er transform
  • the noise spectral components may be estimated once from a part of a signal with predominantly representative noise.
  • the noise is estimated 'on-the-fly', for instance each time a 'silent' part is detected m the input signal with no significant amount of speech signal.
  • the noise- suppressed speech is estimated by subtracting an average noise spectrum from the noisy speech spectrum:
  • S(w;m), Y(w;m), and N(w,m) are the magnitude spectrums of the estimated speech s, noisy speech y and noise n, w and m are the frequency and time indices, respectively.
  • NSR con NSR con
  • the correlation equation is given by:
  • the correlation coefficient sn may be fixed, for instance based on analyzing representative input signals.
  • the correlation coefficient sn is estimated based on the actual input signal.
  • the estimation is based on minimizing a negative spectrum ratio.
  • the expected negative spectrum ratio R is defined as:
  • the correlation coefficient is advantageously obtained by following gradient operation.
  • the correlation coefficient can be learned along the direction of NSR decrement. Preferably, this is done in an iterative algorithm.
  • the equation representing the correlated spectral subtraction may be solved directly. Preferably, the equation is solved in an iterative manner, improving the estimate of the clean speech.
  • the figure shows a block diagram of a conventional speech processing system wherein the invention can be used.
  • the noise reduction according to the invention is particularly useful for processing noisy speech signals, such as coding such a signal or automatically recognizing such a signal.
  • noisy speech signals such as coding such a signal or automatically recognizing such a signal.
  • a person skilled in the art can equally well apply the noise elimination technique in a speech coding system.
  • Speech recognition systems such as large vocabulary continuous speech recognition systems, typically use a collection of recognition models to recognize an input pattern. For instance, an acoustic model and a vocabulary may be used to recognize words and a language model may be used to improve the basic recognition result.
  • the figure illustrates a typical structure of a large vocabulary continuous speech recognition system 100. The following definitions are used for describing the system and recognition method: ⁇ x : a set of trained speech models
  • the system 100 comprises a spectral analysis subsystem 110 and a unit matching subsystem 120.
  • the speech input signal SIS
  • OV observation vector
  • the speech signal is digitized (e.g. sampled at a rate of 6.67 kHz.) and pre-processed, for instance by applying pre-emphasis.
  • Consecutive samples are O grouped (blocked) into frames, corresponding to, for instance, 32 msec, of speech signal.
  • LPC Linear Predictive Coding
  • an acoustic model provides the first term of equation (a).
  • the acoustic model is used to estimate the probability P(Y
  • a speech recognition unit is represented by a sequence of acoustic references. Various forms of speech recognition units may be used. As an example, a whole word or even a group of words may be represented by one speech recognition unit.
  • a word model (WM) provides for each word of a given vocabulary a transcription in a sequence of acoustic references.
  • a whole word is represented by a speech recognition unit, in which case a direct relationship exists between the word model and the speech recognition unit.
  • a word model is given by a lexicon 134, describing the sequence of sub-word units relating to a word of the vocabulary, and the sub-word models 132, describing sequences of acoustic references of the involved speech recognition unit.
  • a word model composer 136 composes the word model based on the sub-word model 132 and the lexicon 134.
  • the (sub-)word models are typically based in Hidden Markov Models (HMMs), which are widely used to stochastically model speech signals.
  • HMMs Hidden Markov Models
  • each recognition unit word model or sub-word model
  • an HMM whose parameters are estimated from a training set of data.
  • An HMM state corresponds to an acoustic reference.
  • Various techniques are known for modeling a reference, including discrete or continuous probability densities.
  • Each sequence of acoustic references which relate to one specific utterance is also referred as an acoustic transcription of the utterance. It will be appreciated that if other recognition techniques than HMMs are used, details of the acoustic transcription will be different.
  • a word level matching system 130 of The figure matches the observation vectors against all sequences of speech recognition units and provides the likelihoods of a match between the vector and a sequence. If sub-word units are used, constraints can be placed on the matching by using the lexicon 134 to limit the possible sequence of sub- word units to sequences in the lexicon 134. This reduces the outcome to possible sequences of words.
  • a sentence level matching system 140 may be used which, based on a language model (LM), places further constraints on the matching so that the paths investigated are those corresponding to word sequences which are proper sequences as specified by the language model.
  • the language model provides the second term P(W) of equation (a).
  • P(W) of equation (a) Combining the results of the acoustic model with those of the language model, results in an outcome of the unit matching subsystem 120 which is a recognized sentence (RS) 152.
  • the language model used in pattern recognition may include syntactical and/or semantical constraints 142 of the language and the recognition task.
  • a language model based on syntactical constraints is usually referred to as a grammar 144.
  • N-gram word models are widely used.
  • wlw2w3...wj-l) is approximated by P(wj
  • bigrams or trigrams are used.
  • wlw2w3...wj-l) is approximated by P(wj
  • the speech processing system may be implemented using conventional hardware.
  • a speech recognition system may be implemented on a computer, such as a PC, where the speech input is received via a microphone and digitized by a conventional audio interface card. All additional processing takes place in the form of software procedures executed by the CPU.
  • the speech may be received via a telephone connection, e.g. using a conventional modem in the computer.
  • the speech processing may also be performed using dedicated hardware, e.g. built around a DSP.
  • the noise elimination according to the invention may be performed in a preprocessing step before the spectral analysis subsystem 100.
  • the noise elimination is integrated in the spectral analysis subsystem 100, for instance to avoid that several conversions from the time domain to the spectral domain and vice versa are required.
  • All hardware and processing capabilities for performing the invention are normally present in a speech recognition or speech coding system.
  • the noise elimination technique according to the invention is normally executed on a processor, such as a DSP or microprocessor of a personal computer, under control of a suitable program. Programming the elementary functions of the noise elimination technique, such as performing a conversion from the time domain to the spectral domain, falls well within the range of a skilled person.
  • the speech signal y can be transformed into a set of spectral components Y(k). It will be appreciated that if already a suitable conversion to the time domain had taken place, it is sufficient to retrieve the spectral components resulting from such a conversion.
  • ⁇ S(k) ⁇ , ⁇ N(k) ⁇ , and ⁇ Y(k) ⁇ be the corresponding magnitude of the spectrums of the time-domain signals s, n, and y, respectively.
  • individual spectral components are forced to be positive. It does not allow the situation wherein an individual spectral component Y(k) of the noisy speech y is less than the corresponding spectral component N(k) of the noise signal n.
  • Equation (8) has two possible solutions. The positive solution which is greater than close to will be chosen since the direction of ⁇ SR decrement is preferred.
  • a preferred iterative algorithm for estimating ⁇ s(k) ⁇ with specified correlation coefficient, ⁇ sn is as follows: LOOP k ( 0 : N-l )
  • the outer loop k deals with all individual spectral components.
  • the inner loop is performed until the iteration has converged (no significant change occurs anymore in the estimated speech).
  • the correlation coefficient ⁇ OT is estimated based on the actual input signal y.
  • the function of negative spectrum ratio (NSR) for the correlated spectral subtraction algorithm according to the invention is defined as follows:
  • the f NS function shown in equation (5) is a zero-one function.
  • a smoothed zero-one, sigmoid function family is preferably used.
  • the following function boss is advantageously used for further derivation due to its differentiability.
  • the correlation coefficient is preferably obtained by the following gradient operation:
  • the correlation coefficient can be learned along the direction of decrease in NSR. This imphes to reduce the residual noise in the estimated spectrum using the proposed correlated spectral subtraction (CSS) algorithm.
  • the block indicated as block 1 is the same as used for the iterative algorithm assuming a fixed correlation coefficient sn .
  • the iterative solution in block one also the one-step solution of equations (7) or (8) may be used.
  • the resulting estimated spectral components of the noise-eliminated signal may be converted back to the time-domain.
  • the spectral components may be used directly for the subsequent further processing, like coding or automatically recognizing the signal.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
PCT/EP2000/010713 1999-10-29 2000-10-27 Elimination of noise from a speech signal WO2001031640A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
JP2001534144A JP2003513320A (ja) 1999-10-29 2000-10-27 音声信号からの雑音の消去
EP00979526A EP1141949A1 (en) 1999-10-29 2000-10-27 Elimination of noise from a speech signal

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
EP99203565.9 1999-10-29
EP99203565 1999-10-29

Publications (1)

Publication Number Publication Date
WO2001031640A1 true WO2001031640A1 (en) 2001-05-03

Family

ID=8240796

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2000/010713 WO2001031640A1 (en) 1999-10-29 2000-10-27 Elimination of noise from a speech signal

Country Status (3)

Country Link
EP (1) EP1141949A1 (ja)
JP (1) JP2003513320A (ja)
WO (1) WO2001031640A1 (ja)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7596495B2 (en) * 2004-03-30 2009-09-29 Yamaha Corporation Current noise spectrum estimation method and apparatus with correlation between previous noise and current noise signal

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5749068A (en) * 1996-03-25 1998-05-05 Mitsubishi Denki Kabushiki Kaisha Speech recognition apparatus and method in noisy circumstances

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5749068A (en) * 1996-03-25 1998-05-05 Mitsubishi Denki Kabushiki Kaisha Speech recognition apparatus and method in noisy circumstances

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
HUANG J ET AL: "An energy-constrained signal subspace method for speech enhancement and recognition in white and colored noises", SPEECH COMMUNICATION,NL,ELSEVIER SCIENCE PUBLISHERS, AMSTERDAM, vol. 26, no. 3, November 1998 (1998-11-01), pages 165 - 181, XP004152155, ISSN: 0167-6393 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7596495B2 (en) * 2004-03-30 2009-09-29 Yamaha Corporation Current noise spectrum estimation method and apparatus with correlation between previous noise and current noise signal

Also Published As

Publication number Publication date
JP2003513320A (ja) 2003-04-08
EP1141949A1 (en) 2001-10-10

Similar Documents

Publication Publication Date Title
EP2216775B1 (en) Speaker recognition
US6950796B2 (en) Speech recognition by dynamical noise model adaptation
US6125345A (en) Method and apparatus for discriminative utterance verification using multiple confidence measures
KR100766761B1 (ko) 화자-독립형 보이스 인식 시스템용 보이스 템플릿을구성하는 방법 및 장치
US20060165202A1 (en) Signal processor for robust pattern recognition
JP4531166B2 (ja) 信頼性尺度の評価を用いる音声認識方法
US20060206321A1 (en) Noise reduction using correction vectors based on dynamic aspects of speech and noise normalization
US8615393B2 (en) Noise suppressor for speech recognition
EP1465154B1 (en) Method of speech recognition using variational inference with switching state space models
EP0470245A1 (en) SPECTRAL EVALUATION PROCEDURE FOR IMPROVING RESISTANCE TO NOISE IN VOICE RECOGNITION.
JP3451146B2 (ja) スペクトルサブトラクションを用いた雑音除去システムおよび方法
JPH0850499A (ja) 信号識別方法
Chowdhury et al. Bayesian on-line spectral change point detection: a soft computing approach for on-line ASR
JPH075892A (ja) 音声認識方法
EP1116219B1 (en) Robust speech processing from noisy speech models
US20040064315A1 (en) Acoustic confidence driven front-end preprocessing for speech recognition in adverse environments
AU776919B2 (en) Robust parameters for noisy speech recognition
JP4705414B2 (ja) 音声認識装置、音声認識方法、音声認識プログラムおよび記録媒体
GB2385697A (en) Speech recognition
Liao et al. Joint uncertainty decoding for robust large vocabulary speech recognition
Haton Automatic speech recognition: A Review
FI111572B (fi) Menetelmä puheen käsittelemiseksi akustisten häiriöiden läsnäollessa
Kotnik et al. Efficient noise robust feature extraction algorithms for distributed speech recognition (DSR) systems
EP1141949A1 (en) Elimination of noise from a speech signal
JP2007508577A (ja) 音声認識システムの環境的不整合への適応方法

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): JP US

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE

WWE Wipo information: entry into national phase

Ref document number: 2000979526

Country of ref document: EP

121 Ep: the epo has been informed by wipo that ep was designated in this application
WWE Wipo information: entry into national phase

Ref document number: 09869386

Country of ref document: US

ENP Entry into the national phase

Ref country code: JP

Ref document number: 2001 534144

Kind code of ref document: A

Format of ref document f/p: F

WWP Wipo information: published in national office

Ref document number: 2000979526

Country of ref document: EP

WWW Wipo information: withdrawn in national office

Ref document number: 2000979526

Country of ref document: EP