EP1850328A1 - Enhancement and extraction of formants of voice signals - Google Patents

Enhancement and extraction of formants of voice signals Download PDF

Info

Publication number
EP1850328A1
EP1850328A1 EP06013126A EP06013126A EP1850328A1 EP 1850328 A1 EP1850328 A1 EP 1850328A1 EP 06013126 A EP06013126 A EP 06013126A EP 06013126 A EP06013126 A EP 06013126A EP 1850328 A1 EP1850328 A1 EP 1850328A1
Authority
EP
European Patent Office
Prior art keywords
formants
filters
filtering
audio signal
frequency conversion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP06013126A
Other languages
German (de)
French (fr)
Inventor
Frank Joublin Dr.
Martin Heckmann Dr.
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Honda Research Institute Europe GmbH
Original Assignee
Honda Research Institute Europe GmbH
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Honda Research Institute Europe GmbH filed Critical Honda Research Institute Europe GmbH
Priority to EP06013126A priority Critical patent/EP1850328A1/en
Priority to JP2007061984A priority patent/JP2007293285A/en
Publication of EP1850328A1 publication Critical patent/EP1850328A1/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0316Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
    • G10L21/0364Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude for improving intelligibility
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/15Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being formant information

Definitions

  • the present invention generally relates to the processing of voice signals in order to enhance characteristics which are useful for a further technical use of the processed voice signal.
  • the invention particularly relates to the enhancement and extraction of formants from audio signals such as e.g. speech signals.
  • the proposed processing is useful e.g. for hearing aids, automatic speech recognition and the training of artificial speech synthesis with the extracted formants.
  • Formants are the distinguishing or meaningful frequency components of human speech. According to one definition (see e.g. http://en.wikipedia.org/wiki/Formants also for more details and citations) a formant is a peak in an acoustic frequency spectrum which results from the resonant frequencies of any acoustical system (acoustical tube). It is most commonly invoked in phonetics or acoustics involving the resonant frequencies of vocal tracts.
  • the detection of formants is useful e.g. in the framework of speech recognition systems and speech synthesizing systems.
  • Today's speech recognition systems work very good in well controlled, low-noise environments but show severe performance degradation when the distances between the speaker and the microphone varies or noise is present.
  • the formant frequencies i. e. the resonance frequencies of the vocal tract, are one of the cues for speech recognition.
  • Vowels are mostly recognized based on the formant frequencies and their transitions and also for consonants they play a very important role.
  • a different applicability of the formant transitions is their use for speech synthesis.
  • synthesis systems based on the concatenation of prerecorded blocks perform significantly better than those using directly formants or vocal tract filter shapes. But this is rather due to the difficulty in finding the right parameterization of these models and not an intrinsic problem to the sound generation.
  • driving such a formant based synthesis system with parameters extracted from measurements on humans produces naturally sounding speech.
  • Formant extraction algorithms can be used to perform this determination of the articulation parameters from large corpuses of speech and a learning algorithm can be developed which determines their correct setting during the speech synthesis process.
  • bandpass filters and first order LPC analysis to extract the formants.
  • the bandpass center frequencies are adapted based on the found location in the previous time step. Additionally a voiced/unvoiced decision is incorporated in the formant extraction.
  • a first aspect of the invention relates to a method for enhancing the formants of an audio signal, the method comprising the following steps:
  • the size of the filters used in step b.) can be adapted, in a configuration step, depending on center frequencies of the frequency conversion step.
  • the size of the filters used in step b.) can be adapted corresponding to the spectral resolution of the frequency conversion step.
  • the size of the filters used in step b.) can be adapted corresponding to expected formants which e.g. occur typically in speech signals.
  • step b. the fundamental frequency of the audio signal can be estimated and then essentially eliminated.
  • step b. the spectral distribution of the excitation of the acoustic tube can be estimated and an amplification of the spectrogram with the inverse of this distribution can be performed.
  • the envelope of the signal can be determined e.g. via rectification and low-pass filtering.
  • a Gammatone filter bank can be used for the frequency conversion step.
  • a reconstructive filtering can be applied on the result of step b.).
  • the reconstructive filtering can use filters adapted to the expected formants of the supplied audio signal, and the reconstruction is done by adding the impulse response of the used filters weighted with the response when filtering with said filter.
  • Pair Gabor filters can be used for the reconstructive filtering.
  • the width of the reconstructing filters is adapted corresponding to the spectral resolution of the frequency conversion step or the mean bandwidths of preset formants expected to be present in the supplied audio signal.
  • the enhanced formants can then be extracted from the signal for further use.
  • the method of enhancing the formants can be used e.g. for speech enhancement.
  • the method can be used together with a tracking algorithm in order to carry out a speech recognition on the supplied audio signal.
  • the method can be used to train artificial speech synthesis systems with the extracted formants.
  • the invention also relates to a computer program product, implementing such a method.
  • the invention further relates to a hearing aid comprising a computing unit designed to implement such a method.
  • the invention proposes a method and a system (see figure 10) which enhances the formants in the spectrogram and allows a subsequent extraction of the enhanced formants.
  • the invention proposes to apply e.g. a Gammatone filter bank on a supplied audio signal representation to obtain a spectro-temporal representation of the signals.
  • the audio signal is converted in the frequency domain.
  • the first stage in the system as shown in figure 10 is the application of a Gammatone filter bank on the signal.
  • the filter bank has e.g. 128 channels ranging from e.g. 80Hz to 5 kHz. From this signal the envelope is calculated via rectification and low-pass filtering. In Fig. 1 the results of this processing can be seen.
  • the fundamental frequency of voiced signal parts can be estimated and subsequently eliminated from the spectrogram.
  • the energy of the fundamental frequency is normally much higher than that of the harmonics.
  • the invention proposes to eliminate the fundamental frequency of voiced signal parts from the spectrogram.
  • an algorithm based on a histogram of zero crossing distances can be used to estimate the fundamental frequency.
  • any pitch estimation algorithm can be used for the estimation of the fundamental frequency.
  • the filter channels in the neighborhood of the found fundamental frequency are set to the noise floor.
  • a smoothing in the time domain and an optional sub-sampling is performed.
  • a filtering along the channel axis is performed.
  • the size of the filtering kernel is changed position-dependent, i. e. with wide kernels at low frequencies and narrow kernels at high frequencies. This takes into account the logarithmic arrangement of the center frequencies in the Gammatone filter bank.
  • the energy of the glottal excitation signal shows a general decay with frequency. Therefore low formants being excited by low harmonics have much more energy than high formants. In a similar way the noise like excitation, which has mostly energy in the high frequencies, has a much lower overall energy than the harmonic excitation. As a consequence in a speech signal the energy in the low frequencies is much higher than in the high frequencies.
  • a pre-emphasis of the spectrogram This pre-emphasis raises the energy of the high frequencies (compare Fig. 3).
  • a known way is to use a high-pass filter but as the audio signal is already represented in the spectro-temporal domain the invention proposes to weight the energy of the filter channels with an exponentially decreasing weight from the high to the low frequencies. Subsequently a smoothing along the frequency axis is carried out. Via this smoothing the energy of the single harmonics is spread and peaks at the formant location form.
  • the size of the smoothing kernel has to be set depending on the center frequencies. It has to be wide at low frequencies where the filter bandwidths and hence the increment of the center frequencies is low in order to cover the necessary frequency range.
  • Figure 4 shows the results of this operation.
  • the resulting spectrogram contains negative values due to the application of the Mexican Hat. Depending on the further processing it can be beneficial to set these negative values to zero, but in our case they have been kept as they permit a better enhancement of the formant tracks.
  • the formants were extracted from a synthetically generated speech signal for which we know the correct formant positions.
  • the system used for speech synthesis is ipox (see A. Dikensen and J. Coleman, "All-prosodic speech synthesis"in Progress in Speech Synthesis, J. P. H. van Santen, R. W. Sproat, J. P. Olive, and J. Hirschberg, Eds., pp.91-108. Springer, New York, 1995 .)
  • the ipox system is a compilation of rules which allows to generate words from phonemic input. More precisely it produces parameter files for the Klatt80 formant synthesizer. As test sentence"Five women played basketball” generated by a male voice has been used. By mixing babble noise from the Noisex database with varying SNR to the signal we can evaluate the robustness of the formant extraction process to additional noise.
  • Fig. 7 the result of the same signal with babble noise added at an SNR of 20 dB is shown. The location of the peaks is hardly affected by the additional noise.
  • Figure 11 shows a block diagram of the reconstructive filtering.
  • the result of the previous enhancement is filtered with a set of n parallel filters whose impulse responses are adapted to the expected structure.
  • This can for example be a set of n even Gabor filters with different orientations and frequencies.
  • Gabor filters are known to the skilled person and can be defined as linear filters whose impulse response is defined by a harmonic function multiplied by a Gaussian function.
  • the impulse responses (receptive fields) of these filters are then respectively weighted with the corresponding response when applying the filter to the data, i.e. the result of the preceding filtering step. Therefore during the reconstruction the filter does not only generate one single point at the center of the filter but a structure corresponding to the whole area of the impulse response.
  • the invention can be applied to the enhancement of speech signals, especially for the hearing impaired as it is known that enhancing the formants increases intelligibility for them.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

A method for extracting the formants of an audio signal comprises the following steps:
a.) applying a frequency conversion on an audio signal in order to determine the envelope of the signal,
b.) enhancing the formant tracks via filtering in the spectral domain.

Description

  • The present invention generally relates to the processing of voice signals in order to enhance characteristics which are useful for a further technical use of the processed voice signal. The invention particularly relates to the enhancement and extraction of formants from audio signals such as e.g. speech signals.
  • The proposed processing is useful e.g. for hearing aids, automatic speech recognition and the training of artificial speech synthesis with the extracted formants.
  • "Formants" are the distinguishing or meaningful frequency components of human speech. According to one definition (see e.g. http://en.wikipedia.org/wiki/Formants also for more details and citations) a formant is a peak in an acoustic frequency spectrum which results from the resonant frequencies of any acoustical system (acoustical tube). It is most commonly invoked in phonetics or acoustics involving the resonant frequencies of vocal tracts.
  • The detection of formants is useful e.g. in the framework of speech recognition systems and speech synthesizing systems. Today's speech recognition systems work very good in well controlled, low-noise environments but show severe performance degradation when the distances between the speaker and the microphone varies or noise is present. The formant frequencies, i. e. the resonance frequencies of the vocal tract, are one of the cues for speech recognition.
  • Vowels are mostly recognized based on the formant frequencies and their transitions and also for consonants they play a very important role.
  • Known speech recognition systems follow a purely probabilistic approach and use the formant frequencies and transitions only implicitly. The features they use approximate the positions of the formants but no explicit formant extraction or tracking is performed.
  • A different applicability of the formant transitions is their use for speech synthesis. Currently synthesis systems based on the concatenation of prerecorded blocks (diphone concatenation) perform significantly better than those using directly formants or vocal tract filter shapes. But this is rather due to the difficulty in finding the right parameterization of these models and not an intrinsic problem to the sound generation. For example driving such a formant based synthesis system with parameters extracted from measurements on humans produces naturally sounding speech. Formant extraction algorithms can be used to perform this determination of the articulation parameters from large corpuses of speech and a learning algorithm can be developed which determines their correct setting during the speech synthesis process.
  • Due to their key role a vast variety of algorithms to extract formants have been published. In most known approaches the formant frequencies, hence the poles of the vocal tract, are modeled directly, e. g. via Linear Predictive Coding or via an AM-FM modulation model. Another approach is the evaluation of the phase information to decide if a spectral peak is a formant.
  • It is also known to use bandpass filters and first order LPC analysis to extract the formants. The bandpass center frequencies are adapted based on the found location in the previous time step. Additionally a voiced/unvoiced decision is incorporated in the formant extraction.
  • Object of the present invention
  • It is the object of the present invention to propose an improved approach to enhance formants of audio signals, preferably speech signals.
  • This object is achieved by means of the features of the independent claims. The dependent claims develop further the central idea of the present invention.
  • A first aspect of the invention relates to a method for enhancing the formants of an audio signal, the method comprising the following steps:
    1. a.) applying a frequency conversion on an audio signal,
    2. b.) enhancing the formant tracks via filtering in the spectral domain.
  • The size of the filters used in step b.) can be adapted, in a configuration step, depending on center frequencies of the frequency conversion step.
  • The size of the filters used in step b.) can be adapted corresponding to the spectral resolution of the frequency conversion step.
  • The size of the filters used in step b.) can be adapted corresponding to expected formants which e.g. occur typically in speech signals.
  • Before step b.) the fundamental frequency of the audio signal can be estimated and then essentially eliminated.
  • Before step b.) the spectral distribution of the excitation of the acoustic tube can be estimated and an amplification of the spectrogram with the inverse of this distribution can be performed.
  • After step a.) the envelope of the signal can be determined e.g. via rectification and low-pass filtering.
  • A Gammatone filter bank can be used for the frequency conversion step.
  • A reconstructive filtering can be applied on the result of step b.).
  • The reconstructive filtering can use filters adapted to the expected formants of the supplied audio signal, and the reconstruction is done by adding the impulse response of the used filters weighted with the response when filtering with said filter.
  • Pair Gabor filters can be used for the reconstructive filtering.
  • The width of the reconstructing filters is adapted corresponding to the spectral resolution of the frequency conversion step or the mean bandwidths of preset formants expected to be present in the supplied audio signal.
  • The enhanced formants can then be extracted from the signal for further use.
  • The method of enhancing the formants can be used e.g. for speech enhancement.
  • The method can be used together with a tracking algorithm in order to carry out a speech recognition on the supplied audio signal.
  • The method can be used to train artificial speech synthesis systems with the extracted formants.
  • The invention also relates to a computer program product, implementing such a method.
  • The invention further relates to a hearing aid comprising a computing unit designed to implement such a method.
  • Further features, objects and advantages of the invention will now be explained with reference to the figures of the enclosed drawings.
    • Figure 1: Spectrogram of a speech signal after application of a Gammatone filter bank and envelope calculation. The sentence is a German male speaker saying "Ich hätte gerne eine Zugverbindung für morgen." from the Kiel Corpus of Spontaneous Speech
    • Figure 2: Spectrogram after elimination of the fundamental frequency for the same sentence as in Fig. 1.
    • Figure 3: Spectrogram after pre-emphasis for the same sentence as in Fig. 1.
    • Figure 4: Spectrogram after filtering with a Mexican Hat whose width is adapted to the center frequency for the same sentence as in Fig. 1 .
    • Figure 5: Spectrogram from Fig. 4 after normalization at each sample.
    • Figure 6: Enhanced spectrogram of a synthesized speech signal ("Five women played basketball"). The true formant tracks of the first four formants are depicted by black and yellow dashed lines.
    • Figure 7: Enhanced spectrogram of a synthesized speech signal when babble noise was added at 20 dB. The true formant tracks of the first four formants are depicted by dashed lines.
    • Figure 8: Enhanced spectrogram of a synthesized speech signal when babble noise was added at 10 dB. The true formant tracks of the first four formants are depicted by dashed lines.
    • Figure 9: Enhanced spectrogram of a synthesized speech signal when babble noise was added at 0 dB. The true formant tracks of the first four formants are depicted by dashed lines.
    • Figure 10: Schematic flow chart of the method.
    • Figure 11: Schematic flow chart of a reconstructive filtering
  • The invention proposes a method and a system (see figure 10) which enhances the formants in the spectrogram and allows a subsequent extraction of the enhanced formants.
  • Frequency conversion:
  • The invention proposes to apply e.g. a Gammatone filter bank on a supplied audio signal representation to obtain a spectro-temporal representation of the signals. In any case the audio signal is converted in the frequency domain.
  • The first stage in the system as shown in figure 10 is the application of a Gammatone filter bank on the signal. The filter bank has e.g. 128 channels ranging from e.g. 80Hz to 5 kHz. From this signal the envelope is calculated via rectification and low-pass filtering. In Fig. 1 the results of this processing can be seen.
  • Estimation and subsequent Elimination of the fundamental frequency:
  • In order to reduce the impact of the fundamental frequency on the position of the formants, especially the first formant, the fundamental frequency of voiced signal parts can be estimated and subsequently eliminated from the spectrogram.
  • In the excitation signal of the vocal tract the energy of the fundamental frequency is normally much higher than that of the harmonics. As a consequence of this unbalanced excitation of the first formant, with high energy at the fundamental frequency and significantly lower energy at the adjacent harmonics, it is difficult to extract its correct location. For this reason the invention proposes to eliminate the fundamental frequency of voiced signal parts from the spectrogram.
  • E.g. an algorithm based on a histogram of zero crossing distances can be used to estimate the fundamental frequency. In principle any pitch estimation algorithm can be used for the estimation of the fundamental frequency.
  • For the elimination of the fundamental frequency the filter channels in the neighborhood of the found fundamental frequency are set to the noise floor. In order to recreate smooth transitions after the elimination of the fundamental frequency and to reduce the computational load a smoothing in the time domain and an optional sub-sampling is performed.
  • The results of this processing can be seen in Fig. 2.
  • Filtering in the spectro-temporal domain in order to enhance the formants
  • In a next step the high frequencies are emphasized.
  • A filtering along the channel axis is performed. During the filtering the size of the filtering kernel is changed position-dependent, i. e. with wide kernels at low frequencies and narrow kernels at high frequencies. This takes into account the logarithmic arrangement of the center frequencies in the Gammatone filter bank.
  • The energy of the glottal excitation signal shows a general decay with frequency. Therefore low formants being excited by low harmonics have much more energy than high formants. In a similar way the noise like excitation, which has mostly energy in the high frequencies, has a much lower overall energy than the harmonic excitation. As a consequence in a speech signal the energy in the low frequencies is much higher than in the high frequencies. To overcome this problem we perform a pre-emphasis of the spectrogram. This pre-emphasis raises the energy of the high frequencies (compare Fig. 3).
  • A known way is to use a high-pass filter but as the audio signal is already represented in the spectro-temporal domain the invention proposes to weight the energy of the filter channels with an exponentially decreasing weight from the high to the low frequencies. Subsequently a smoothing along the frequency axis is carried out. Via this smoothing the energy of the single harmonics is spread and peaks at the formant location form.
  • When using a filter bank with a logarithmic arrangement of center frequencies, as in the case of the Gammatone filter bank, the size of the smoothing kernel has to be set depending on the center frequencies. It has to be wide at low frequencies where the filter bandwidths and hence the increment of the center frequencies is low in order to cover the necessary frequency range.
  • In contrast it is made small at high frequencies where the filter bandwidths are large. As smoothing kernel it is possible to use a Gaussian kernel, but we achieved better results with a Mexican Hat (Difference of Gaussian). The Mexican Hat operator enhances line like structures and suppresses regions in between these line like structures.
  • Figure 4 shows the results of this operation. The resulting spectrogram contains negative values due to the application of the Mexican Hat. Depending on the further processing it can be beneficial to set these negative values to zero, but in our case they have been kept as they permit a better enhancement of the formant tracks.
  • As can be seen the formant structure is now clearly visible as dark ridges in the spectrogram. Finally, a normalization of the values to the maximum at each sample is performed (compare Fig. 5). By doing so the formants are also visible in signal parts where the energy is relatively low.
  • In order to demonstrate the performance of the formant enhancement process the formants were extracted from a synthetically generated speech signal for which we know the correct formant positions. The system used for speech synthesis is ipox (see A. Dikensen and J. Coleman, "All-prosodic speech synthesis"in Progress in Speech Synthesis, J. P. H. van Santen, R. W. Sproat, J. P. Olive, and J. Hirschberg, Eds., pp.91-108. Springer, New York, 1995.) The ipox system is a compilation of rules which allows to generate words from phonemic input. More precisely it produces parameter files for the Klatt80 formant synthesizer. As test sentence"Five women played basketball" generated by a male voice has been used. By mixing babble noise from the Noisex database with varying SNR to the signal we can evaluate the robustness of the formant extraction process to additional noise.
  • Results for the clean signal can be seen in Fig. 6. The correct formant tracks for the first four formants are given by the dashed line. As can be seen from the plot our algorithm represents the formants quite accurately.
  • In Fig. 7 the result of the same signal with babble noise added at an SNR of 20 dB is shown. The location of the peaks is hardly affected by the additional noise.
  • In Fig. 8 we further increased the noise level to 10 dB. As a consequence the ridges in the enhanced spectrogram show more discontinuities but their location is still correct.
  • Finally we added babble noise at 0 dB in Fig. 9. The discontinuities increase further with the decreasing SNR but the location of the ridges does not change significantly even for such low SNR values.
  • Reconstruction step
  • In order to further enhance the formant frequencies an optional reconstruction step can be performed on the result of the filtering in the spectral domain. Figure 11 shows a block diagram of the reconstructive filtering.
  • For doing so the result of the previous enhancement (smoothing along the frequency axis) is filtered with a set of n parallel filters whose impulse responses are adapted to the expected structure. This can for example be a set of n even Gabor filters with different orientations and frequencies.
  • Gabor filters are known to the skilled person and can be defined as linear filters whose impulse response is defined by a harmonic function multiplied by a Gaussian function.
  • For the reconstruction the impulse responses (receptive fields) of these filters are then respectively weighted with the corresponding response when applying the filter to the data, i.e. the result of the preceding filtering step. Therefore during the reconstruction the filter does not only generate one single point at the center of the filter but a structure corresponding to the whole area of the impulse response.
  • Finally, all these responses are added up to form the resulting spectral representation being the result of the reconstruction. As a consequence the result will show structures in accordance with the impulse responses of the filters (e. g. lines when even Gabor filters are used). This is among others due to the fact that the set of Gabor filters used is not complete and hence is not able to reconstruct the original data perfectly but only a subset with properties defined by the subset used (line structures in our case). The width of these reconstruction filters can also be adapted in accordance with the spectral resolution or the expected formant bandwidth.
  • Applications of the invention
  • The invention can be applied to the enhancement of speech signals, especially for the hearing impaired as it is known that enhancing the formants increases intelligibility for them.
  • Combined with a tracking algorithm it is possible to use it for speech recognition or the learning of parameters for formant based speech synthesis.

Claims (19)

  1. A method for enhancing the formants of an audio signal, the method comprising the following steps:
    a.) applying a frequency conversion on an audio signal,
    b.) enhancing the formant tracks via filtering in the spectral domain.
  2. The method according to claim 1,
    wherein step b.) is carried out using a smoothening with a defined smoothening kernel.
  3. The method according to claim 1 or 2,
    wherein the size of filters used in step b.) is adapted depending on center frequencies of the frequency conversion step.
  4. The method according to claim 3,
    wherein the size of filters used in step b.) is adapted corresponding to the spectral resolution of the frequency conversion step.
  5. The method according to claim 4,
    wherein the size of filters used in step b.) is adapted corresponding to preset expected formants.
  6. The method according to any of the preceding claims,
    wherein before step b.) the fundamental frequency of the audio signal is estimated and then eliminated.
  7. The method according to any of the preceding claims,
    wherein before step b.) the spectral distribution of the excitation of the acoustic tube is estimated and an amplification of the spectrogram with the inverse of this distribution is performed.
  8. The method according to any of the preceding claims,
    wherein after step a.) the envelope of the signal is determined e.g. via rectification and low-pass filtering.
  9. The method according to any of the preceding claims,
    wherein a Gammatone filter bank is used for the frequency conversion step a.).
  10. The method according to any of the preceding claims,
    wherein a reconstructive filtering is applied on the result of step b.).
  11. The method according to claim 10,
    wherein the reconstructive filtering uses filters adapted to expected formants of the supplied audio signal, and
    the reconstruction is done by adding the impulse response of the used filters weighted with the response when filtering with said filter.
  12. The method according to claim 11,
    where pair Gabor filters are used for the reconstructive filtering.
  13. The method according to claim 11 or 12,
    wherein the width of the reconstructing filters is adapted corresponding to the spectral resolution of the frequency conversion step or the mean bandwidths of preset formants expected to be present in the supplied audio signal.
  14. The method according to any of the preceding claims, comprising the further step of extracting the enhanced formants.
  15. Using a method according to any of the preceding claims for speech enhancement -
  16. Using of a method according to any of claims 1 to 14 together with a tracking algorithm in order to carry out an automatic speech recognition on the supplied audio signal.
  17. Use of a method according to any of claims 1 to 14 to train artificial speech synthesis systems with the extracted formants.
  18. A computer program product, implementing a method according to any of the claims 1 to 14 when run on a computing device.
  19. A hearing aid comprising a computing unit designed to implement a method according to any of claims 1 to 14.
EP06013126A 2006-04-26 2006-06-26 Enhancement and extraction of formants of voice signals Withdrawn EP1850328A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP06013126A EP1850328A1 (en) 2006-04-26 2006-06-26 Enhancement and extraction of formants of voice signals
JP2007061984A JP2007293285A (en) 2006-04-26 2007-03-12 Enhancement and extraction of formants of voice signal

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
EP06008675 2006-04-26
EP06013126A EP1850328A1 (en) 2006-04-26 2006-06-26 Enhancement and extraction of formants of voice signals

Publications (1)

Publication Number Publication Date
EP1850328A1 true EP1850328A1 (en) 2007-10-31

Family

ID=36968222

Family Applications (1)

Application Number Title Priority Date Filing Date
EP06013126A Withdrawn EP1850328A1 (en) 2006-04-26 2006-06-26 Enhancement and extraction of formants of voice signals

Country Status (2)

Country Link
EP (1) EP1850328A1 (en)
JP (1) JP2007293285A (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2232700A1 (en) * 2007-12-21 2010-09-29 Srs Labs, Inc. System for adjusting perceived loudness of audio signals
US8538042B2 (en) 2009-08-11 2013-09-17 Dts Llc System for increasing perceived loudness of speakers
WO2014039028A1 (en) * 2012-09-04 2014-03-13 Nuance Communications, Inc. Formant dependent speech signal enhancement
WO2015147363A1 (en) * 2014-03-28 2015-10-01 숭실대학교산학협력단 Method for determining alcohol use by comparison of frequency frame of difference signal, and recording medium and device for implementing same
WO2015147362A1 (en) * 2014-03-28 2015-10-01 숭실대학교산학협력단 Method for determining alcohol use by comparison of high-frequency signals in difference signal, and recording medium and device for implementing same
US9312829B2 (en) 2012-04-12 2016-04-12 Dts Llc System for adjusting loudness of audio signals in real time
CN106486110A (en) * 2016-10-21 2017-03-08 清华大学 A kind of gamma bandpass filter group chip system supporting voice real-time decomposition/synthesis
US9613633B2 (en) 2012-10-30 2017-04-04 Nuance Communications, Inc. Speech enhancement
EP3113183A4 (en) * 2014-02-28 2017-07-26 National Institute of Information and Communications Technology Voice clarification device and computer program therefor
US9899039B2 (en) 2014-01-24 2018-02-20 Foundation Of Soongsil University-Industry Cooperation Method for determining alcohol consumption, and recording medium and terminal for carrying out same
US9916844B2 (en) 2014-01-28 2018-03-13 Foundation Of Soongsil University-Industry Cooperation Method for determining alcohol consumption, and recording medium and terminal for carrying out same
US9934793B2 (en) 2014-01-24 2018-04-03 Foundation Of Soongsil University-Industry Cooperation Method for determining alcohol consumption, and recording medium and terminal for carrying out same
US9943260B2 (en) 2014-03-28 2018-04-17 Foundation of Soongsil University—Industry Cooperation Method for judgment of drinking using differential energy in time domain, recording medium and device for performing the method

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5313528B2 (en) * 2008-03-18 2013-10-09 リオン株式会社 Hearing aid signal processing method

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4477925A (en) * 1981-12-11 1984-10-16 Ncr Corporation Clipped speech-linear predictive coding speech processor
US4820059A (en) * 1985-10-30 1989-04-11 Central Institute For The Deaf Speech processing apparatus and methods
EP0742548A2 (en) * 1995-05-12 1996-11-13 Mitsubishi Denki Kabushiki Kaisha Speech coding apparatus and method using a filter for enhancing signal quality
US6223151B1 (en) * 1999-02-10 2001-04-24 Telefon Aktie Bolaget Lm Ericsson Method and apparatus for pre-processing speech signals prior to coding by transform-based speech coders
US20050165608A1 (en) * 2002-10-31 2005-07-28 Masanao Suzuki Voice enhancement device
US20050197832A1 (en) * 2003-12-31 2005-09-08 Hearworks Pty Limited Modulation depth enhancement for tone perception
EP1600947A2 (en) * 2004-05-26 2005-11-30 Honda Research Institute Europe GmbH Subtractive cancellation of harmonic noise

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4477925A (en) * 1981-12-11 1984-10-16 Ncr Corporation Clipped speech-linear predictive coding speech processor
US4820059A (en) * 1985-10-30 1989-04-11 Central Institute For The Deaf Speech processing apparatus and methods
EP0742548A2 (en) * 1995-05-12 1996-11-13 Mitsubishi Denki Kabushiki Kaisha Speech coding apparatus and method using a filter for enhancing signal quality
US6223151B1 (en) * 1999-02-10 2001-04-24 Telefon Aktie Bolaget Lm Ericsson Method and apparatus for pre-processing speech signals prior to coding by transform-based speech coders
US20050165608A1 (en) * 2002-10-31 2005-07-28 Masanao Suzuki Voice enhancement device
US20050197832A1 (en) * 2003-12-31 2005-09-08 Hearworks Pty Limited Modulation depth enhancement for tone perception
EP1600947A2 (en) * 2004-05-26 2005-11-30 Honda Research Institute Europe GmbH Subtractive cancellation of harmonic noise

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
POTAMIANOS A ET AL: "Speech formant frequency and bandwidth tracking using multiband energy demodulation", ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 1995. ICASSP-95., 1995 INTERNATIONAL CONFERENCE ON DETROIT, MI, USA 9-12 MAY 1995, NEW YORK, NY, USA,IEEE, US, vol. 1, 9 May 1995 (1995-05-09), pages 784 - 787, XP010625350, ISBN: 0-7803-2431-5 *
RAO P ET AL: "Speech formant frequency estimation: evaluating a nonstationary analysis method", SIGNAL PROCESSING, AMSTERDAM, NL, vol. 80, no. 8, August 2000 (2000-08-01), pages 1655 - 1667, XP004222586, ISSN: 0165-1684 *

Cited By (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2232700A4 (en) * 2007-12-21 2012-10-10 Dts Llc System for adjusting perceived loudness of audio signals
US8315398B2 (en) 2007-12-21 2012-11-20 Dts Llc System for adjusting perceived loudness of audio signals
EP2232700A1 (en) * 2007-12-21 2010-09-29 Srs Labs, Inc. System for adjusting perceived loudness of audio signals
US9264836B2 (en) 2007-12-21 2016-02-16 Dts Llc System for adjusting perceived loudness of audio signals
US8538042B2 (en) 2009-08-11 2013-09-17 Dts Llc System for increasing perceived loudness of speakers
US9820044B2 (en) 2009-08-11 2017-11-14 Dts Llc System for increasing perceived loudness of speakers
US10299040B2 (en) 2009-08-11 2019-05-21 Dts, Inc. System for increasing perceived loudness of speakers
US9312829B2 (en) 2012-04-12 2016-04-12 Dts Llc System for adjusting loudness of audio signals in real time
US9559656B2 (en) 2012-04-12 2017-01-31 Dts Llc System for adjusting loudness of audio signals in real time
WO2014039028A1 (en) * 2012-09-04 2014-03-13 Nuance Communications, Inc. Formant dependent speech signal enhancement
CN104704560B (en) * 2012-09-04 2018-06-05 纽昂斯通讯公司 The voice signals enhancement that formant relies on
DE112012006876B4 (en) * 2012-09-04 2021-06-10 Cerence Operating Company Method and speech signal processing system for formant-dependent speech signal amplification
CN104704560A (en) * 2012-09-04 2015-06-10 纽昂斯通讯公司 Formant dependent speech signal enhancement
US9805738B2 (en) 2012-09-04 2017-10-31 Nuance Communications, Inc. Formant dependent speech signal enhancement
US9613633B2 (en) 2012-10-30 2017-04-04 Nuance Communications, Inc. Speech enhancement
US9934793B2 (en) 2014-01-24 2018-04-03 Foundation Of Soongsil University-Industry Cooperation Method for determining alcohol consumption, and recording medium and terminal for carrying out same
US9899039B2 (en) 2014-01-24 2018-02-20 Foundation Of Soongsil University-Industry Cooperation Method for determining alcohol consumption, and recording medium and terminal for carrying out same
US9916844B2 (en) 2014-01-28 2018-03-13 Foundation Of Soongsil University-Industry Cooperation Method for determining alcohol consumption, and recording medium and terminal for carrying out same
EP3113183A4 (en) * 2014-02-28 2017-07-26 National Institute of Information and Communications Technology Voice clarification device and computer program therefor
US9842607B2 (en) 2014-02-28 2017-12-12 National Institute Of Information And Communications Technology Speech intelligibility improving apparatus and computer program therefor
US9907509B2 (en) 2014-03-28 2018-03-06 Foundation of Soongsil University—Industry Cooperation Method for judgment of drinking using differential frequency energy, recording medium and device for performing the method
US9916845B2 (en) 2014-03-28 2018-03-13 Foundation of Soongsil University—Industry Cooperation Method for determining alcohol use by comparison of high-frequency signals in difference signal, and recording medium and device for implementing same
US9943260B2 (en) 2014-03-28 2018-04-17 Foundation of Soongsil University—Industry Cooperation Method for judgment of drinking using differential energy in time domain, recording medium and device for performing the method
CN106133833A (en) * 2014-03-28 2016-11-16 崇实大学校产学协力团 The method drunk is determined and for realizing record medium and the device of the method by the comparison of the high-frequency signal in differential signal
WO2015147362A1 (en) * 2014-03-28 2015-10-01 숭실대학교산학협력단 Method for determining alcohol use by comparison of high-frequency signals in difference signal, and recording medium and device for implementing same
WO2015147363A1 (en) * 2014-03-28 2015-10-01 숭실대학교산학협력단 Method for determining alcohol use by comparison of frequency frame of difference signal, and recording medium and device for implementing same
CN106486110A (en) * 2016-10-21 2017-03-08 清华大学 A kind of gamma bandpass filter group chip system supporting voice real-time decomposition/synthesis
CN106486110B (en) * 2016-10-21 2019-11-08 清华大学 It is a kind of to support voice real-time decomposition/synthesis gamma bandpass filter group chip system

Also Published As

Publication number Publication date
JP2007293285A (en) 2007-11-08

Similar Documents

Publication Publication Date Title
EP1850328A1 (en) Enhancement and extraction of formants of voice signals
CN103503060B (en) Speech syllable/vowel/phone boundary detection using auditory attention cues
Shrawankar et al. Techniques for feature extraction in speech recognition system: A comparative study
Magi et al. Stabilised weighted linear prediction
RU2557469C2 (en) Speech synthesis and coding methods
Wang et al. Robust speaker recognition using denoised vocal source and vocal tract features
Mowlaee et al. Interspeech 2014 special session: Phase importance in speech processing applications
Xiao et al. Normalization of the speech modulation spectra for robust speech recognition
Faundez-Zanuy et al. Nonlinear speech processing: overview and applications
CN108305639B (en) Speech emotion recognition method, computer-readable storage medium and terminal
Shahnaz et al. Pitch estimation based on a harmonic sinusoidal autocorrelation model and a time-domain matching scheme
CN108682432B (en) Speech emotion recognition device
Cabral et al. Glottal spectral separation for parametric speech synthesis
JPH10133693A (en) Speech recognition device
Narendra et al. Robust voicing detection and F 0 estimation for HMM-based speech synthesis
Hansen et al. Robust estimation of speech in noisy backgrounds based on aspects of the auditory process
Hasan et al. Preprocessing of continuous bengali speech for feature extraction
Shukla et al. Spectral slope based analysis and classification of stressed speech
van Santen et al. Estimating phrase curves in the general superpositional intonation model.
Bhukya et al. Robust methods for text-dependent speaker verification
Babacan et al. Parametric representation for singing voice synthesis: A comparative evaluation
Prakash et al. Fourier-Bessel cepstral coefficients for robust speech recognition
Ou et al. Probabilistic acoustic tube: a probabilistic generative model of speech for speech analysis/synthesis
López et al. Normal-to-shouted speech spectral mapping for speaker recognition under vocal effort mismatch
Ramabadran et al. The ETSI extended distributed speech recognition (DSR) standards: server-side speech reconstruction

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LI LT LU LV MC NL PL PT RO SE SI SK TR

AX Request for extension of the european patent

Extension state: AL BA HR MK YU

AKX Designation fees paid
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN

18D Application deemed to be withdrawn

Effective date: 20080503

REG Reference to a national code

Ref country code: DE

Ref legal event code: 8566