WO2000072305A2 - Procede et dispositif de reduction du bruit dans des signaux vocaux - Google Patents

Procede et dispositif de reduction du bruit dans des signaux vocaux Download PDF

Info

Publication number
WO2000072305A2
WO2000072305A2 PCT/DK2000/000263 DK0000263W WO0072305A2 WO 2000072305 A2 WO2000072305 A2 WO 2000072305A2 DK 0000263 W DK0000263 W DK 0000263W WO 0072305 A2 WO0072305 A2 WO 0072305A2
Authority
WO
WIPO (PCT)
Prior art keywords
speech signal
speech
signal
parameters
threshold value
Prior art date
Application number
PCT/DK2000/000263
Other languages
English (en)
Other versions
WO2000072305A3 (fr
Inventor
Kjeld Hermansen
Original Assignee
Noisecom Aps
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Noisecom Aps filed Critical Noisecom Aps
Priority to DE60017758T priority Critical patent/DE60017758D1/de
Priority to EP00925105A priority patent/EP1208561B1/fr
Priority to PCT/DK2000/000263 priority patent/WO2000072305A2/fr
Priority to AT00925105T priority patent/ATE288121T1/de
Priority to AU43943/00A priority patent/AU4394300A/en
Publication of WO2000072305A2 publication Critical patent/WO2000072305A2/fr
Publication of WO2000072305A3 publication Critical patent/WO2000072305A3/fr

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering

Definitions

  • the present invention relates to noise reduction in speech signals, in particular noise reduction in speech signals employed in telecommunication, most particularly in telecommunication employing cellular phones.
  • Noise when added to a speech signal can impair the quality of the signal, reduce intelligibility, and increase listener fatigue. It is therefore of great importance to reduce noise in a speech signal, e.g. in relation to telecommunication, especially when employing cellular phones, or in relation to hearing aids.
  • noise reduction in speech signals include spectral subtraction and other filtering methods.
  • the noise reduction may e.g. be based on an estimate of the noise spectrum. Such methods depend on stationarity in the noise signal to perform optimally. As the noise in a speech signal is often non-stationary, the estimated noise spectrum used for spectral subtraction will be different from the actual noise spectrum during speech activity. This results in short duration random tones in the noise reduced signal, and such random tone noise tends to be very irritating to listen to due to psycho-acoustic effects.
  • WO 99/01942 discloses a method of reducing the noise in a speech signal using spectral subtraction. According to this method a model based representation describing the quasi- stationary part of the speech signal is generated and manipulated, and the resulting speech signal is generated using the manipulated model and a second signal derived from the speech signal.
  • the object of the present invention is to provide a method of noise reduction in speech signals which reduces the noise even more than known methods. It is further an object to provide a method of noise reduction in speech signals which reduces the noise without affecting the actual speech signal, i.e. a method which eliminates, or at least considerably reduces, unwanted components of a signal without, or at least only to a very limited extend, reducing wanted components.
  • the method according to the present invention employs a new model based spectral subtraction algorithm for noise suppression/noise reduction in speech.
  • This new algorithm benefits from available knowledge of the speech dynamics.
  • the method yields better results - especially for low Signal to Noise Ratios (SNR) - with less distortions and artefacts, like musical tones, than any other methods, e.g. the usual spectral subtraction.
  • SNR Signal to Noise Ratios
  • noise suppression in speech processing such as speech coding has gained an increased importance due to the advent of digital cellular telephones. With the low data rate speech coding algorithms the speech quality tends to degrade drastically in high noise. To prevent such quality loss, noise suppression must be achieved without introducing artefacts, speech distortion or significant loss of speech intelligibility.
  • noisy signal is modelled as a sum of the speech signal and the noise assuming statistical independence.
  • the spectral subtraction provides an estimate of the signal spectrum as the difference between the noisy spectrum and an estimate of the noise/background spectrum, the latter is obtained during periods of silence.
  • Transformation of the estimated speech signal to time domain requires knowledge of the phase of the signal. In most situations one uses the phase of the noisy signal. This works well for high Signal to Noise Ratios (SNR >10 dB). The problem is to use this phase for low SNR, which is a serious drawback of the classical spectral subtraction. A possibility to handle this problem is to use an alternative description of the signal.
  • the speech signal is decomposed into two components: a generator signal (the residual signal) and a filter modelling of the vocal tract. This results in a separation of speech into a transient and a quasi-stationary part. Filtering of the smoothed residual signal in the synthesis filter produces the noise reduced output signal.
  • Determination of the noise free/reduced synthesis filter is done by a combination of classical spectral subtraction and model based characterisation of the difference spectrum of noisy speech and background noise.
  • the noisy speech is separated into a quasi stationary part via Fast Fourier Transform (FFT) and a transient part (residual signal) via inverse filtering, instead of splitting the signal into amplitude and phase.
  • FFT Fast Fourier Transform
  • inverse filtering instead of splitting the signal into amplitude and phase.
  • the auto correlation function of the quasi stationary part of the speech is mapped into an LPC model spectrum of order 10 and the so-called f, b and g parameters (f formant frequency, b bandwidth and g gain) are determined from this spectrum. This is a pseudo decomposition of the spectrum into second order sections.
  • a noise robust pitch detector combined with a synthetic glottal pulse generator produces the new residual signal for voiced sounds).
  • the residual signal is used as is.
  • This residual signal is input to the noise free/reduced synthesis filter with noise free/reduced speech as output.
  • the dynamics of the synthesis filter are now constrained via the f, b and g parameters to the range 1 Hz to 10 Hz eliminating the main part of the usual musical tones and leaving the signal/speech component almost unchanged.
  • the input to the synthesis filter depends on the SNR.
  • a robust pitch detector determines the period of the synthetic glottal pulses used as input to the synthesis filter.
  • the present invention relates to a method of noise reduction in a speech signal, comprising the steps of
  • the frequency is preferably the formant frequency of the speech signal.
  • the speech signal is preferably transmitted via a telecommunications means, most preferably via a cellular phone, but it may alternatively or additionally be transmitted via other means such as a hearing aid or other suitable microphone/speaker arrangements.
  • Such microphone/speaker arrangements may be connected to telephones and/or video conference arrangements, thus allowing the person or persons using such arrangements to move freely within a certain distance from the telephone/video conference arrangement in the room in which the telephone/video conference arrangement is positioned. When using existing similar arrangements this is not possible due to the noise generated in the signals.
  • Other suitable microphone/speaker arrangements may alternatively or additionally be employed.
  • the parameters are preferably smoothed electronically.
  • the dynamic information regarding the f, b and g parameters may comprise information regarding the duration of the speech signal.
  • Such information may relate to how long a sound (e.g. a word being spoken, voiced or unvoiced sounds) has been detectable. This may concern words as they are spoken and/or it may concern words after they have been spoken, so that the information regards, e.g., the time it takes to pronounce a word. If the information concerns how long a sound has been detectable as, e.g., the word is spoken, this information may be compared to knowledge regarding the time normally used to pronounce the word or a similar word, and this comparison may provide information regarding whether the word may be considered to be finished or not.
  • the dynamic information regarding the f, b and g parameters may alternatively or additionally comprise information regarding the difference in frequency between the present speech signal and a previously measured speech signal. If such information is compared to knowledge concerning the human voice regarding the capability to change frequency within a certain time interval, it may be determined whether the present speech signal is in fact the speech signal that was previously measured, i.e. whether the present speech signal and the previously measured speech signal are in fact one and the same. If the difference in frequency exceeds a certain limit, the limit being determined on the basis of knowledge of the human voice and its capability of changing the frequency within a certain time interval, the two signals can not be the same. If the difference in frequency does not exceed such a limit, the two signal may be the same.
  • the dynamic information regarding the f, b and g parameters most preferably contain knowledge regarding the development of said parameters in time.
  • the a priori knowledge regarding human speech production may comprise knowledge regarding the maximum duration of a speech signal as described above.
  • the a priori knowledge regarding human speech production may alternatively or additionally comprise knowledge regarding the maximum frequency span of a speech signal as described above.
  • the a priori knowledge regarding human speech production may be compared to measured parameters of the present speech signal as described above.
  • the a priori knowledge may be obtained e.g. by knowledge regarding the anatomy of the mouth and throat region and/or of the vocal chord.
  • the a priori knowledge may alternatively or additionally be based on a number of previous measurements of relevant parameters as described above. Such previous measurements, or alternatively or additionally a representative extract of such measurements, may be stored in look up tables.
  • look up tables are preferably stored electronically in a computer or the like, but may alternatively or additionally be stored in a printed medium such as a book or a sheet of paper.
  • the method comprises a step in which the speech signal is deemed to belong to a process, the process being a signal which may extend over one or more measurement frames.
  • the process is preferably a formant process. It may, e.g., correspond to the pronunciation of a word.
  • the process is an active process at a certain time if it extends over one or more preceding measurement frames. Thus, the process is active if there is a detectable signal.
  • a process may also be regarded as active if there is presently no detectable signal, but such a signal has been present for a predefined number of measurement frames preceding the present measurement frame. Thereby a process may be kept artificially alive even though the signal disappears for a short time interval.
  • the smoothing step may comprise the step of determining whether a new formant frequency belongs to an active process. This may be based on a comparison between the a priori knowledge regarding human speech production and the obtained dynamic information.
  • the method may in this case further comprise the step of defining a new process in case the new formant frequency does not belong to an active process, and the new formant frequency is then deemed to belong to said new process.
  • the process may be deemed to be inactive in case no new formant frequency is deemed to belong to said process. Thus, in case the signal is permanently terminated, the process is deemed to be inactive.
  • the method may further comprise the step of artificially maintaining the speech signal for a predetermined number of measurement frames in case the corresponding process is abruptly deemed to be inactive. This makes it possible to keep a process alive in case the signal is temporarily interrupted as described above.
  • the predetermined number of measurement frames may correspond to the maximum duration of the speech signal.
  • a process may be artificially maintained for a time interval corresponding to the time interval it normally takes to produce such a sound.
  • the maximum duration of the speech signal is preferably between 40 ms and 80 ms, such as between 50 ms and 70 ms, such as approximately 60 ms.
  • the new formant may be deemed to belong to an active process if the difference in frequency between said formant and said process does not exceed a predetermined level as described above.
  • the predetermined level is preferably between 200 Hz and 600 Hz, such as between 300 Hz and 500 Hz, such as approximately 400 Hz.
  • the smoothing step preferably comprises the step of filtering the f, b and g parameters.
  • the filtering step is most preferably performed using a first order Infinite Impulse Response (MR) filter, but it may alternatively or additionally be performed using any other suitable kind of filter.
  • MR Infinite Impulse Response
  • the first order IIR filter is preferably a feedback filter of the form:
  • x designates the speech signal
  • y designates the filter output
  • a and b are parameters to be determined, and wherein the parameters a and b are preferably determined by using model knowledge of the speech process.
  • the method may further comprise the steps of
  • the noise eliminated pitch period may be noise eliminated by using known methods or in the manner described above. Most preferably, it is noise eliminated using known methods as well as in the manner described above.
  • the determining step may comprise the steps of comparing the variance of the speech signal to an upper threshold value and to a lower threshold value. Voiced speech is considered to be present in case the variance of the speech signal exceeds the lower threshold value.
  • the speech signal is considered to contain purely voiced speech, i.e. no unvoiced component is present. In this case the original speech signal is completely replaced by the synthetic glottal pulse.
  • the original speech signal is replaced by a new pulse which is an appropriate combination of the synthetic glottal pulse and the original speech signal.
  • the determining step may comprise the steps of comparing the first formant gain of the speech signal to an upper threshold value and to a lower threshold value. Voiced speech is present in case the first formant gain of the speech signal exceeds the lower threshold value.
  • the noise eliminated pitch period is preferably found from a residual signal of the speech signal.
  • the replacing step is preferably performed by fading out a residual signal and fading in the synthetic glottal pulse.
  • the synthetic glottal pulse is to be understood as either a completely synthetic signal or an appropriate combination of the created synthetic pulse and the original speech signal as described above.
  • At least the smoothing step is performed by a computer system, and the speech signal is most preferably generated in a cellular phone.
  • the present invention further relates to an apparatus for performing noise reduction in a speech signal, the apparatus comprising - means for obtaining dynamic information regarding frequency (f), bandwidth (b) and gain (g) parameters of a speech signal in relation to time,
  • the means for obtaining dynamic information regarding f, b and g parameters preferably comprises one or more suitable detectors, such as microphones, and/or one or more computers.
  • the smoothing means preferably comprises one or more computers and/or one or more suitable electronically filters, such as low pass filters and/or high pass filters and/or Infinite Impulse Response (MR) filters and/or any other suitable kind of filters.
  • suitable electronically filters such as low pass filters and/or high pass filters and/or Infinite Impulse Response (MR) filters and/or any other suitable kind of filters.
  • MR Infinite Impulse Response
  • the means for obtaining and storing a priori knowledge regarding human speech production preferably comprises one or more computers, most preferably comprising electronically storage means. It may further comprise one or more look up tables, the tables being created by using empirically obtained data (i.e. previous measurements of e.g. relevant parameters such as the dynamic information mentioned above) and/or by using theoretical calculation of relevant parameters. Such calculations may be based on knowledge regarding the anatomy of the mouth and throat region of humans as described above.
  • the means for obtaining dynamic information regarding f, b and g parameters of a speech signal in relation to time may comprise means for obtaining information regarding the duration of the speech signal. This may include a timer.
  • the means for obtaining dynamic information regarding f, b and g parameters of a speech signal in relation to time may comprise means for obtaining information regarding the difference in frequency between the present speech signal and a previously measured speech signal.
  • This preferably includes means for measuring the frequency of the signal, such as a frequency meter, and means for comparing the measured frequency and a previously measured frequency.
  • it preferably comprises storing means, such as the storing means of a computer, for storing previously measured frequencies. The comparison is most preferably performed by the computer.
  • the means for obtaining and storing a priori knowledge regarding human speech production may comprise means for obtaining and storing knowledge regarding the maximum duration of a speech signal and/or it may comprise means for obtaining and storing knowledge regarding the maximum frequency span of a speech signal.
  • This preferably includes one or more computers.
  • the smoothing means may comprise means for determining whether a new formant frequency belongs to an active process. Such means may comprise means for comparing obtained relevant parameters of the speech signal (as described above) and theoretical and/or empirical data. It is then determined from a predefined set of criteria whether the new formant frequency belongs to the active process or not. This is further described above.
  • the smoothing means preferably comprises means for filtering the f, b and g parameters. It may comprise one or more electronically filters, such as low pass filters and/or high pass filters and/or MR filters and/or any other suitable filters.
  • the apparatus may further comprise - determining means for determining whether voiced speech is present,
  • the creating means may comprise one or more tone generators and/or it may comprise a computer.
  • the replacing means may comprise one or more faders, so that the synthetic pulse may be faded in as the original signal is faded out.
  • the determining means may comprise comparing means for comparing the variance of the speech signal to an upper threshold value and to a lower threshold value. Voiced speech is present in case the variance of the speech signal exceeds the lower threshold value.
  • the comparison is most preferably performed by one or more computers having computer storage means for storing information regarding the threshold values.
  • the threshold values may be obtained by theoretical calculations and/or by previous measurements, i.e. it may be empirically obtained.
  • the determining means may comprise comparing means for comparing the first formant gain of the speech signal to an upper threshold value and to a lower threshold value. Voiced speech is present in case the first formant gain of the speech signal exceeds the lower threshold value. The above is equally applicable here.
  • the apparatus may further comprise means for producing a noise eliminated pitch period from a residual signal of the speech signal.
  • a new model based spectral subtraction method has been presented, the method being applicable for noise elimination for high as well as low SNR avoiding linear and non-linear artefacts like musical tones.
  • the method is attractive because of its flexibility, modularity and numerical robustness. Furthermore it is very suitable for real time implementation.
  • Fig. 1 shows the frequency of a speech signal as a function of time indicating two different processes
  • Fig. 2 is a flow diagram illustrating part of the smoothing step of the method according to the present invention
  • Fig. 3 is a flow diagram illustrating the control of the maximum duration of a speech signal
  • Fig. 4 is a flow diagram illustrating the allocation of a sign to a process
  • Fig. 5 is a flow diagram illustrating the adjustment of the pitch of a speech signal
  • Fig. 6 is a flow diagram illustrating the step of determining whether voiced speech is present and the step of replacing a noise eliminated pitch period by a synthetic glottal pulse
  • Fig. 7a shows the intonation of a speech signal as a function of time
  • Fig. 7b shows a synthetic internal glottal pulse
  • Fig. 7c shows two pitch periods having different lengths
  • Fig. 8 shows non-filtered f, b and g parameters of a speech signal ('three', 'four', 'five'),
  • Fig. 9 shows filtered f, b and g parameters corresponding to Fig. 8,
  • Fig. 10 shows an original noisy speech signal ('three', 'four', 'five'),
  • Fig. 11 shows the signal of Fig. 10, but with synthetic glottal pulses replacing the pitch periods where voiced speech is present,
  • Fig. 12 shows the speech signal of Figs. 10 and 11 , but with the noise eliminated
  • Fig. 13 shows a frequency spectrum corresponding to Fig. 10
  • Fig. 14 shows a frequency spectrum corresponding to Fig. 12.
  • Fig. 1 shows the frequency of a speech signal as a function of time. Two processes ('process 1 ' and 'process 2') are present.
  • the signal of a process disappears it may be due to the signal "falling out", e.g. due to noise or a bad connection, or it may be due to the fact that the person stops speaking, i.e. the signal actually disappears. If the process has existed for a certain amount of time it is artificially kept alive for some time after the disappearance of the signal as indicated by " ⁇ d" and the dotted lines of the figure.
  • the signal of 'process 1' disappears and is artificially kept alive as described above.
  • a new signal is detected, and this new signal is deemed to belong to 'process 1'. This is because the time elapsed (t 2 - ) is smaller than ⁇ d, and because the difference in frequency between the new signal and the signal of 'process 1' is below a predefined limit indicated by " ⁇ f" .
  • the signal of 'process 1', and the new signal are connected as indicated by the grey line.
  • the signal of 'process 2' also appears approximately at time . However, it is deemed not to belong to 'process 1', since the difference in frequency between the two signals exceeds the predefined limit ( ⁇ f). Since no other active processes are present a new process ('process 2') is defined, and the signal is deemed to belong to this process.
  • Fig. 2 is a flow diagram illustrating step one as defined above. New f, b and g parameters are used as input (FBG new), and the frequency nearest to a process is found. It is not possible to investigate the new formants one by one, since two or more of the new formants may fulfil the criteria for belonging to a certain process. Therefore, the formant having the frequency being nearest to the process is deemed to belong to said process.
  • the speech signal is found to belong to a process which already exists, that process is updated in accordance with the new signal.
  • a new process is created, and the speech signal is deemed to belong to the new process.
  • the difference in frequency between the new formant and the existing process exceeds a certain predetermined level. The level may be set in the control system.
  • Fig. 3 illustrates step 2 as defined above. It is investigated whether an active process has been updated. In case it has been updated nothing further is done during the present measurement frame.
  • Fig. 4 is a flow diagram illustrating step 3 of the above algorithm. It is first investigated whether the sign of a given process (Process number X) is locked. If this is the case the energy in the process is compared to a lower threshold value ("energijav"). If the energy in the process is below said lower threshold value the sign is unlocked. Otherwise the sign is maintained in a locked mode. The energy in the process is then compared to an upper threshold value ("energijioej"). If the energy in the process is above said upper threshold value the process is allocated a sign in relation to locked processes. Otherwise it is investigated whether the process has a sign allocated. If this is not the case, the process is allocated a sign in relation to locked processes. Otherwise the mark is not updated, i.e. the previously allocated sign is maintained.
  • the formants are allocated mutually alternating signs in order to improve the quality of the sound.
  • Processes having frequencies which are between the frequencies of two different other processes are allocated a sign according to the process being closest in frequency to the formant, i.e. the sign allocated to this process is opposite to the sign of the "closest" process.
  • step 4 of the algorithm above the f, b and g parameters are individually filtered.
  • a first order Infinite Impulse Response (MR) filter is used, the filter most preferably having the form:
  • the coefficients a and b are calculated using a time constant in order to ensure that the filtering is independent of the frame shift, a is chosen in such a way that said time constant relates to the time constant of the speech signal, and b is chosen in such a way that the DC amplification is 1 (corresponding to the integral of the impulse response).
  • Fig. 5 is a flow diagram illustrating the adjustment of the pitch of a speech signal.
  • This function determines the pitch period of a source sequence.
  • the general idea of the function is based upon the assumption that the pitch is the most dominant periodical component in the speech signal. It is furthermore assumed that the pitch frequency, for physiological reasons, is limited to a certain frequency span. Thus, the main issue of determination of the pitch period is the calculation of the auto correlation and the determination of the pitch as the index of the maximum value of the auto correlation in a limited time interval.
  • the pitch sequence which is used as an input to the function, is squared in order to avoid negative values and in order to enhance dynamical differences.
  • the pitch sequence should be a frame of the speech signal or of the residual signal.
  • the squared pitch sequence is then rectified. This step emphasises the periodicity of the pitch by using knowledge regarding the structure of the pitch sequences. This is due to the fact that the pitch is much more powerful than other potential periodical components of the speech signal or the residual signal, and due to the fact that said other components are hidden by the rectification.
  • x and y designate the mean values of x and y, respectively.
  • the auto correlation is calculated for "allowed" pitch periods, i.e. for pitch periods having a duration which is between a lower threshold value and an upper threshold value, where said threshold values may be set initially.
  • the calculated auto correlation is subsequently scaled using a linear weighting function. This is done in order to obtain a robust pitch detection.
  • the index of the maximum value of the weighted auto correlation function is used as an initial guess for the pitch period.
  • the pitch period is adjusted accordingly.
  • a shorter pitch period may be more likely if e.g. the half pitch period is also an "allowed" pitch period. In this case it is possible that the sub-harmonic period of the pitch has been detected in stead of the actual pitch. If the initial pitch period is not large or if a shorter pitch period is not more likely, the initial guess is maintained as the pitch period. Finally, the pitch period is used as an output.
  • Fig. 6 is a flow diagram illustrating the steps of the method according to the present invention in which it is determined whether voiced speech is present, and in which the noise eliminated pitch period is replaced by a synthetic glottal pulse.
  • the input parameter "x" which may be the variance or the gain of a signal, is compared to a lower threshold value ("taerskelnedre") as well as to an upper threshold value
  • the speech signal is considered to be a mixture of voiced and unvoiced signals, and the original signal is replaced by a synthetic signal which is a suitable mixture of the original residual signal and the synthetic pitch pulse as described below.
  • Synthetic residual is the resulting synthetic residual signal which replaces the original residual signal
  • residual is the original residual signal
  • synthetic pitch pulse is the purely synthetic pitch pulse which is created
  • Fig. 7a-c illustrate how the speech signal may be replaced by a synthetic glottal pulse in case voiced speech is present in the signal.
  • Fig. 7a shows the intonation of a speech signal as a function of time.
  • the intonation at time t2 is slightly larger than the intonation at time t1.
  • the period where no signal is detected may be a period of silence or it may be a period where only completely unvoiced speech is present, i.e. a period during which no intonation may be detected, since an intonation may only be detected if the vocal chord is active, i.e. when voiced speech is present.
  • Fig. 7b shows a synthetic internal glottal pulse.
  • glottal pulses are very dependent upon the person speaking.
  • the glottal pulse shown in Fig. 7b is an "average" pulse which is constructed in such a way that it has a wide spectrum and at the same time has the maximum length.
  • Fig. 7c shows a signal with the synthetic glottal pulse of Fig. 7b phased in instead of a noisy signal.
  • the synthetic glottal pulse has a certain length.
  • the synthetic signal is artificially "extended" by a "zero-signal", so as to match the length of the original pitch period, 'ipitch'.
  • the length of the second pulse is slightly larger than the length of the first pulse ('ipitch(t2)'>'ipitch(t1 )'). This is due to the fact that the intonation at time t2 is slightly larger than the intonation at time t1 as indicated in Fig. 7a. It is clear that the only difference between the two pulses is the length of the "zero-signal" following the synthetic glottal pulse.
  • Figs. 8 and 9 show non-filtered and filtered formant tracks (f, b and g parameters), respectively, for the words 'three', 'four' and 'five'.
  • the parameters have been filtered using a so-called Kalman filter.
  • Kalman filter When comparing Fig. 8 and Fig. 9, it is clear that the noise of the signal is considerably reduced during the filtering process, i.e. the fluctuations are reduced. It is also clear that the dynamics of the speech are at the same time left nearly unchanged. What is achieved is thus a speech signal, wherein the noise, which was initially present, is removed or at least considerably reduced in such a way that the original signal, i.e. the actual speech, is left nearly unchanged. It is thus possible to remove unwanted components of a signal (i.e. noise) without removing or changing wanted components of the signal (i.e. actual speech).
  • Figs. 10-12 also show speech signals representing the words 'three', 'four' and 'five'.
  • Fig. 10 shows the original speech signal, including noise components. It is clear that this signal is very noisy, i.e. the signal to noise ratio (SNR) is very small.
  • SNR signal to noise ratio
  • Fig. 11 part of the signal has been replaced by synthetic glottal pulses, and the figure thus represents the output of the flow diagram of Fig. 6.
  • the shifts between the regions in which the original signal has been replaced and the regions in which the original signal has been maintained are very abrupt. This is because the difference between the lower threshold value (“taerskelnedre”) and the upper threshold value (“taerskeloevre”) is relatively small.
  • the variance or gain (whichever parameter is chosen) is either below the lower threshold value or above the upper threshold value, rather than being between the two values. That is, the sound is most often considered to be either completely voiced or completely unvoiced, rather than being considered to contain voiced as well as unvoiced components.
  • Fig. 12 shows a signal in which the noise has been reduced. Conventional methods as well as the method according to the invention has been employed. It is very clear that the SNR has improved considerably as compared to Fig. 10.
  • Fig. 13 and Fig. 14 show frequency spectra corresponding to Fig. 10 and Fig. 12, respectively.
  • SNR has improved considerably during the filtering process.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Noise Elimination (AREA)
  • Reduction Or Emphasis Of Bandwidth Of Signals (AREA)
  • Mobile Radio Communication Systems (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
  • Machine Translation (AREA)

Abstract

L'invention concerne un procédé et un dispositif de réduction du bruit dans des signaux vocaux, utilisant des connaissances a priori la production de la parole et des connaissances de dynamique de la parole, telles que des connaissances portant sur la production de sons dans les régions de la bouche et de la gorge. Le procédé et le dispositif de l'invention utilisent également un nouveau modèle basé sur un algorithme de soustraction spectrale, consistant à remplacer une partie du signal original contenant des sons par des impulsions de synthèse, ce qui améliore la réduction du bruit sans affecter les composantes (vocales) du signal recherchées. Le procédé, qui peut être utilisé pour un SNR faible (SNR<10 dB), résulte avantageux lorsqu'on l'utilise dans le cadre de communications utilisant des téléphones cellulaires.
PCT/DK2000/000263 1999-05-19 2000-05-16 Procede et dispositif de reduction du bruit dans des signaux vocaux WO2000072305A2 (fr)

Priority Applications (5)

Application Number Priority Date Filing Date Title
DE60017758T DE60017758D1 (de) 1999-05-19 2000-05-16 VERFAHREN UND VORRICHTUNG ZUR GERäUSCHREDUZIERUNG IN SPRACHSIGNALEN
EP00925105A EP1208561B1 (fr) 1999-05-19 2000-05-16 Procede et dispositif de reduction du bruit dans des signaux vocaux
PCT/DK2000/000263 WO2000072305A2 (fr) 1999-05-19 2000-05-16 Procede et dispositif de reduction du bruit dans des signaux vocaux
AT00925105T ATE288121T1 (de) 1999-05-19 2000-05-16 Verfahren und vorrichtung zur geräuschreduzierung in sprachsignalen
AU43943/00A AU4394300A (en) 1999-05-19 2000-05-16 A method and apparatus for noise reduction in speech signals

Applications Claiming Priority (5)

Application Number Priority Date Filing Date Title
DKPA199900691 1999-05-19
DKPA199900691 1999-05-19
DKPA200000201 2000-02-08
DKPA200000201 2000-02-08
PCT/DK2000/000263 WO2000072305A2 (fr) 1999-05-19 2000-05-16 Procede et dispositif de reduction du bruit dans des signaux vocaux

Publications (2)

Publication Number Publication Date
WO2000072305A2 true WO2000072305A2 (fr) 2000-11-30
WO2000072305A3 WO2000072305A3 (fr) 2008-01-10

Family

ID=26064462

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/DK2000/000263 WO2000072305A2 (fr) 1999-05-19 2000-05-16 Procede et dispositif de reduction du bruit dans des signaux vocaux

Country Status (5)

Country Link
EP (1) EP1208561B1 (fr)
AT (1) ATE288121T1 (fr)
AU (1) AU4394300A (fr)
DE (1) DE60017758D1 (fr)
WO (1) WO2000072305A2 (fr)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004059614A2 (fr) * 2002-12-31 2004-07-15 Microsound A/S Procede et appareil permettant d'augmenter la qualite de perception de signaux de parole synthetises
US9666204B2 (en) 2014-04-30 2017-05-30 Qualcomm Incorporated Voice profile management and speech signal generation
US10332520B2 (en) 2017-02-13 2019-06-25 Qualcomm Incorporated Enhanced speech generation
CN112969130A (zh) * 2020-12-31 2021-06-15 维沃移动通信有限公司 音频信号处理方法、装置和电子设备

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0537948A2 (fr) * 1991-10-18 1993-04-21 AT&T Corp. Méthode et appareil pour le lissage des formes d'onde de la période fondamentale
US5479560A (en) * 1992-10-30 1995-12-26 Technology Research Association Of Medical And Welfare Apparatus Formant detecting device and speech processing apparatus

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE69509555T2 (de) * 1994-11-25 1999-09-02 Fink Verfahren zur veränderung eines sprachsignales mittels grundfrequenzmanipulation

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0537948A2 (fr) * 1991-10-18 1993-04-21 AT&T Corp. Méthode et appareil pour le lissage des formes d'onde de la période fondamentale
US5479560A (en) * 1992-10-30 1995-12-26 Technology Research Association Of Medical And Welfare Apparatus Formant detecting device and speech processing apparatus

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
1994 IEEE International Conference on Acoustics, Speech, and Signal Processing, 1994., pages I/229-I/232 vol. 1, 19-22 April 1994 Gotoh Y. et al: "Using MAP estimated parameters to improve HMM speech recognition performance", abstract, summary XP002901259 *
20th International Conference on Industrial Electronics, Control and Instruments IECON'94., pages 1946-1951, vol. 3, 5-9 Sept. 1994, Witzke L.I. et al: "Speech synthesis based on feature extraction to enhance Noise-corrupted speech", abstract, page1948 - first paragraph page 1949, XP002901258 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004059614A2 (fr) * 2002-12-31 2004-07-15 Microsound A/S Procede et appareil permettant d'augmenter la qualite de perception de signaux de parole synthetises
WO2004059614A3 (fr) * 2002-12-31 2004-09-23 Microsound As Procede et appareil permettant d'augmenter la qualite de perception de signaux de parole synthetises
US9666204B2 (en) 2014-04-30 2017-05-30 Qualcomm Incorporated Voice profile management and speech signal generation
US9875752B2 (en) 2014-04-30 2018-01-23 Qualcomm Incorporated Voice profile management and speech signal generation
US10332520B2 (en) 2017-02-13 2019-06-25 Qualcomm Incorporated Enhanced speech generation
US10783890B2 (en) 2017-02-13 2020-09-22 Moore Intellectual Property Law, Pllc Enhanced speech generation
CN112969130A (zh) * 2020-12-31 2021-06-15 维沃移动通信有限公司 音频信号处理方法、装置和电子设备

Also Published As

Publication number Publication date
EP1208561A2 (fr) 2002-05-29
ATE288121T1 (de) 2005-02-15
EP1208561B1 (fr) 2005-01-26
AU4394300A (en) 2000-12-12
DE60017758D1 (de) 2005-03-03
WO2000072305A3 (fr) 2008-01-10

Similar Documents

Publication Publication Date Title
JP4764995B2 (ja) 雑音を含む音響信号の高品質化
AU771444B2 (en) Noise reduction apparatus and method
Tchorz et al. SNR estimation based on amplitude modulation analysis with applications to noise suppression
US8521530B1 (en) System and method for enhancing a monaural audio signal
US7454010B1 (en) Noise reduction and comfort noise gain control using bark band weiner filter and linear attenuation
CN109065067A (zh) 一种基于神经网络模型的会议终端语音降噪方法
US6182033B1 (en) Modular approach to speech enhancement with an application to speech coding
US20080031467A1 (en) Echo reduction system
CA2404027A1 (fr) Techniques de calcul de signaux de puissance d&#39;elimination du bruit de systemes de communication
Löllmann et al. Low delay noise reduction and dereverberation for hearing aids
US20080004868A1 (en) Sub-band periodic signal enhancement system
KR20140032354A (ko) 동적 마이크로폰 신호 믹서
US20080219457A1 (en) Enhancement of Speech Intelligibility in a Mobile Communication Device by Controlling the Operation of a Vibrator of a Vibrator in Dependance of the Background Noise
Soon et al. Wavelet for speech denoising
EP1208561B1 (fr) Procede et dispositif de reduction du bruit dans des signaux vocaux
US7392180B1 (en) System and method of coding sound signals using sound enhancement
US6975984B2 (en) Electrolaryngeal speech enhancement for telephony
RU2589298C1 (ru) Способ повышения разборчивости и информативности звуковых сигналов в шумовой обстановке
JP2001249676A (ja) 雑音が付加された周期波形の基本周期あるいは基本周波数の抽出方法
Krini et al. Model-based speech enhancement
Yang et al. Environment-Aware Reconfigurable Noise Suppression
King Enhancing single-channel speech in wind noise using coherent modulation comb filtering
Kurian et al. PNCC based speech enhancement and its performance evaluation using SNR Loss
US20130226568A1 (en) Audio signals by estimations and use of human voice attributes
Tchorz et al. Noise suppression based on neurophysiologically-motivated SNR estimation for robust speech recognition

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AT AU AZ BA BB BG BR BY CA CH CN CR CU CZ CZ DE DE DK DK DM DZ EE EE ES FI FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
NENP Non-entry into the national phase

Ref country code: DE

WWW Wipo information: withdrawn in national office

Country of ref document: DE

WWE Wipo information: entry into national phase

Ref document number: 2000925105

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 09979060

Country of ref document: US

REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

WWP Wipo information: published in national office

Ref document number: 2000925105

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: JP

WWG Wipo information: grant in national office

Ref document number: 2000925105

Country of ref document: EP