EP1339045A1 - Method for pre-processing speech - Google Patents
Method for pre-processing speech Download PDFInfo
- Publication number
- EP1339045A1 EP1339045A1 EP02004143A EP02004143A EP1339045A1 EP 1339045 A1 EP1339045 A1 EP 1339045A1 EP 02004143 A EP02004143 A EP 02004143A EP 02004143 A EP02004143 A EP 02004143A EP 1339045 A1 EP1339045 A1 EP 1339045A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- parameters
- speech
- likelihoods
- feature data
- frequency
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 238000000034 method Methods 0.000 title claims abstract description 46
- 238000007781 pre-processing Methods 0.000 title claims abstract description 24
- 239000000654 additive Substances 0.000 claims abstract description 7
- 230000000996 additive effect Effects 0.000 claims abstract description 7
- 238000012545 processing Methods 0.000 claims description 15
- 238000004590 computer program Methods 0.000 claims description 5
- 101000829367 Homo sapiens Src substrate cortactin Proteins 0.000 claims 4
- 102100023719 Src substrate cortactin Human genes 0.000 claims 4
- 230000003595 spectral effect Effects 0.000 description 7
- 238000001228 spectrum Methods 0.000 description 7
- 230000006870 function Effects 0.000 description 6
- 238000004364 calculation method Methods 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 230000005477 standard model Effects 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 238000013459 approach Methods 0.000 description 2
- 230000007704 transition Effects 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 230000006866 deterioration Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 238000011084 recovery Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/02—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
- G10L19/0204—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders using subband decomposition
- G10L19/0208—Subband vocoders
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
Definitions
- the present invention relates to a method for pre-processing speech and in particular to a method for pre-processing speech to be employed in a method for recognizing speech. More particular the invention relates to a method for pre-processing speech using noise-robust acoustic modeling by Union models with back-off.
- the object is achieved by a method for pre-processing speech according to the features of claim 1. Additionally, the object is achieved by an apparatus and a computer program product according to the features of claims 19 and 20, respectively. Preferred embodiments of the present invention are within the scope of the dependent subclaims.
- the inventive method for pre-processing speech comprises the steps of receiving a speech signal, analyzing said speech signal with respect to a given number of predetermined frequency bands and thereby generating acoustic feature data which are at least in part representative for said speech signal with respect to said frequency bands.
- inventive method comprises the step of deriving likelihoods for occurrences of speech elements or of sequences thereof within said speech signal based on said acoustic feature data or a derivative thereof, wherein before deriving said likelihoods parts of said acoustical feature data being representative for frequency bands which are at least assumed to be disturbed or distorted by an additive and band-limited noise signal are exchanged by exchange feature data, so as to generate modified acoustic feature data, said exchange feature data being representative for undisturbed and/or average speech.
- a set of frequency domain parameters is generated at least as a part of said acoustic feature data. Further, subsets of said frequency domain parameters are assigned to different frequency bands. According to that measure first of all the time domain signal, i.e. the amplitude of the voice signal as a function of time, is converted into a frequency domain signal, upon which certain frequency domain parameters may be obtained being representative for or describing the speech signal in the time domain and/or in the frequency domain.
- the frequency domain parameters are generated and designed so as to be representative for different frequency bands of the speech signal in the frequency domain. In principle, both the time and the frequency domain carry exactly the same information.
- melscale parameters are used as frequency domain parameters. Therefore, the complete frequency range of the received speech signal may be subdivided in and/or covered by a set of frequency bands taking into account the different information contents of the frequency bands or frequency intervals with respect to human perception capabilities.
- said exchange feature data are chosen to include exchange frequency domain parameters and in particular they may contain or include exchange melscale parameters to exchange frequency domain parameters and in particular melscale parameters which belong or which are assumed to belong to disturbed or distorted frequency bands. Thereby, a modified set of frequency domain parameters is generated.
- frequency domain parameters and in particular melscale parameters are used in the process of deriving likelihoods for the occurrence of speech elements within the received speech signal.
- first acoustic model set which is based on the entire frequency range of the speech signal.
- said first model set may comprise submodels which base solely on respective frequency bands. In any case, it may be advantageous to involve information from as much frequency bands as possible.
- a further aspect of the present invention is to use as a first acoustic model set an acoustic model set which is based on average speech and/or on undisturbed or undistorted speech. According to that particular measure said derived exchange feature data are free from disturbance compared to the received speech signal, thereby ensuring a better recognition and an higher recognition rate.
- time domain parameters or time domains like parameters and in particular cepstral parameters for the generation of said likelihoods.
- the corresponding frequency domain parameters and in particular the corresponding melscale parameters are exchanged by exchange frequency domain parameters which correspond to a speech element P1, ..., Pm to be tested and in particular taken from said first acoustic model to generate modified acoustic feature data.
- time domain parameters or time domain like parameters and in particular said cepstral parameters in particular by involving an inverse Fourier transform or the like.
- cepstral parameters are not true time domain parameters, they may be referred to as time domain like parameters as they are derived by involving an inverse Fourier transform leading back from the frequency domain.
- the likelihoods are derived based on said time domain parameters, time domain like parameters and/or in particular they are based on said cepstral parameters.
- these total likelihoods are derived for each number M of assumed distorted frequency bands, said number M satisfying the relation 0 ⁇ M ⁇ N. I.e., it is assumed that the number of assumed distorted frequency bands is lower than the number of the entirety of frequency bands of the speech signal.
- the case M 0 represents the case without noise.
- a global likelihood is derived from said global likelihoods for each M and for each combination of assumed distorted or disturbed frequency bands.
- the total likelihoods LM total and the global likelihood L a decomposition of the whole frequency range F of a received speech input into three frequency subbands F1, F2, and F3 is assumed. It is further assumed that the occurance of a particular speech element X in said received speech input is tested.
- Lj is representative for the likelihood of the occurance of speech element X in said speech input based on frequency range or subband Fj. Therefore, single likelihoods L1, L2, and L3 are calculated.
- this calculation scheme can be used to describe the likelihood of the occurance of a speech element X in the received speech input. This scheme is based on the fact that with a proper and appropriate replacement by modified acoustic feature data the approximation L1 ⁇ L2 ⁇ L1 holds.
- a system, an apparatus, a device, a dialog system, and/or the like is provided which is in each case adapted to realize, to carry out and/or to perform a method for pre-processing speech according to the present invention and/or the steps thereof.
- the first acoustic model operates in the frequency domain and is used to exchange distorted information by undistorted information. Based on the thereby modified acoustic feature data information in the time domain is obtained. The likelihoods for different speech elements are extracted based on said time domain information using the second acoustic model which operates in the time domain or time-like domain.
- the present invention overcomes this limitation and makes the full information usable by replacing the corrupted band by an estimate of the average speech information inside that band.
- the Union model suffers less than the standard model does.
- the increased robustness is not sufficient to compensate the lack of performance of the baseline, i. e. no noise, case.
- the models K can be of different complexity. Namely, they can have one mean vector per HMM-state-model of the recognizer, e. g. for a monophone recognizer, one mean vector per phoneme. Or they can have only one global mean vector for all speech and silence, or one for speech and one for silence, or any other degree of tying between this two extremes.
- the number of streams or frequency bands is 3, and the number of corrupted streams is 1.
- the number of log spectral coefficients totals e. g. 21.
- the final likelihood is computed as (L1*L2) + (L1*L3) + (L2*L3), where e. g. L2 is the likelihood computed only on the second stream or band, using some cepstral parameters derived from the mid part of the spectrum or coefficients 8-14 of the 21 log spectral coefficients.
- band 3 - corresponding to the coefficients 15-21 in log spectral domain - is corrupted, then it conveys no useful information any more. It should therefore not be used to compute the likelihoods. However, we can "reconstruct" the corrupted information, by replacing it with the average information that is contained in this band as averaged over all the speech, or by the average information usually found in this band for this phone.
- the artificial data inserted into band 3 is the same for all phone models being evaluated, it cannot discriminate between them and adds no information whatsoever to the discriminative power of the model. But, it allows the computation of lower order cepstra which relate all other bands with each other.
- the net effect of the replacement is to blur information in band 3 up to the point where it is unusable, but to keep the information in bands 1 and 2 available, and also the information about the relationship of bands 1 and 2 available which is not available in the union model. Since the lower order cepstral parameters contain exactly this type of information, they cannot be used in the union model, but they can be computed in the proposed model.
- Section A of Fig. 1 shows the envelope of a speech signal S as a function of time S(t).
- the logarithmic power spectrum is generated. This is done by first applying a Fourier transform to the speech signal S. Then the logarithm of the absolute square of the Fourier transformed signal is generated.
- the result of the generation of the logarithmic power spectrum is shown in section B of Fig. 1, wherein for simplicity also the logarithm log(f) of the frequency f is taken.
- the whole frequency range F is built up by a union of three distinct frequency bands F1, F2, and F3.
- the frequency bands F1 to F3 are subdivided. For each subdivision the average of the logarithmic power spectrum is taken taking into account a weighting function, which is in the case of section B of Fig. 1 a piecewise triangular weighting function.
- section C of Fig. 1 The result of piecewise averaging the logarithmic power spectrum with the triangular weighting function is shown in section C of Fig. 1, where on the ordinate the average value Mj of the distinct subdivisions numbered by j are shown as single values. According to section C of Fig. 1 to each subdivision with number j a parameter Mj are in the frequency domain is assigned. These parameters Mj called melscale parameters and they are examples of the frequency domain parameters in the sense of the invention.
- the derived melscale parameters Mj may be used as said frequency domain parameters MSJ in the sense of the invention and in particular for generating the likelihoods of the distinct speech elements P1, ..., Pm within the speech signal S, it is often more appropriate to generate from the melscale parameters Mj the so-called cepstral parameters Cj which built up the cepstrum corresponding to the spectrum. This is done by essentially applying a discrete inverse Fourier transform to the set of melscale parameters Mj of section C. The result is shown in section D of Fig. 1.
- cepstral parameters Cj are used as input values for an acoustic model P operating in the time domain, time-like domain or the domain of the cepstral parameters C1.
- a likelihood is obtained being descriptive for the chance of occurrences of the tested speech element X within the speech signal S(t) or a part thereof.
- the melscale domain or frequency domain is subdivided with respect to the given and predetermined frequency bands F1 to F3.
- Fig. 3 shows an embodiment of the inventive pre-processing of a speech signal S.
- section B of Fig. 3 it is shown, that in frequency band F2 noise components NS(f) are added.
- the scattered frequency components show the components which are comparable with section B of Fig. 1, i. e. the noise-free case.
- a weighting function is applied to the logarithmic power spectrum of section B leading to a piecewise average with respect to the frequency subdivisions of the frequency bands F1 - F3.
- the resulting melscale parameters M5 - M8 i.e. the frequency domain parameters MS5 - MS8 in the sense of the invention, are replaced by exchange melscale parameters EMS5 - EMS8 which are represented by filled symbols and which are taken from an acoustic model K with respect to undisturbed speech speech in the frequency domain.
- the union model strategy is incorporated into the processing of Fig. 3 by taking into account all combinations of likelihoods for distorted frequency bands F1 - F3. As this is done, it is not necessary to know which frequency band actually is distorted by noise.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Telephonic Communication Services (AREA)
Abstract
Description
Speech recognition in the presence of noise is a difficult problem of great practical importance. Recently, the Union model has been proposed that tries to overcome the signal quality deterioration by the assumption of band-limited additive noise, and by effectively ignoring the contribution of the distorted signal band in the likelihood computation.
Basically, a signal is split up in N (let N = 5 for clarity from now on, however N is arbitrary) frequency bands. Under the assumption that M (M < N) bands are distorted (let M = 1 for clarity from now on although the algorithm is not limited to M = 1, the likelihood for a speech element, e.g. a phoneme, can be computed as the sum of the likelihood contributions of all combinations of N - M (= 4) bands.
- Fig. 1
- shows a pre-processing sequence known in the art.
- Fig. 2
- shows the standard Union model processing in addition to the processing of Fig. 2.
- Fig. 3
- shows a preferred embodiment of the inventive method for pre-processing speech.
Claims (20)
- Method for pre-processing speech, in particular in a method for recognizing speech, comprising the steps of:receiving a speech signal (S),analyzing said speech signal (S) with respect to a given number (N) of predetermined frequency bands (F1, ..., FN),thereby generating acoustic feature data (AFD) which are at least in part representative for said speech signal with respect to said frequency bands (F1, ..., FN),deriving likelihoods for occurrences of speech elements (P1, .... Pm) or of sequences thereof within said speech signal (S) based on said acoustic feature data (AFD) or a derivative thereof,wherein before deriving said likelihoods parts of said acoustic feature data (AFD) being representative for frequency bands (F1, ..., FN) which are at least assumed to be disturbed by an additive and band-limited noise signal (NS) are exchanged by exchange feature data (EFD) so as to generate modified acoustic feature data (MAFD), said exchange feature data (EFD) being representative for undisturbed and/or average speech.
- Method according to claim 2,wherein a set of frequency domain parameters (MS1, ..., MS12) is generated as a part of said acoustic feature data (AFD), andwherein subsets of said frequency domain parameters (MS1, ..., MS12) are assigned to different frequency bands (F1, ..., FN).
- Method according to claim 2,
wherein melscale parameters are used as frequency domain parameters (MS1, ..., MS12). - Method according to any one of claims 2 or 3,
wherein said exchange feature data (EFD) are chosen to include exchange frequency domain parameters (EMS1, ..., EMS12) and in particular exchange melscale parameters,
to exchange frequency domain parameters (MS1, ..., MS12) and in particular melscale parameters belonging to disturbed frequency bands (F1, ..., FN) to obtain a modified set of frequency domain parameters. - Method according to any one of the preceding claims,
wherein at least a part of said exchange feature data (EFD) and in particular said exchange frequency domain parameters (EMS1, ..., EMS12) are derived and/or taken from a first acoustic model (K) operating on the frequency domain and in particular on the space of said frequency domain parameters (MS1, ..., MS12). - Method according to claim 5,
wherein a first acoustic model set (K) is used which is based on the entire frequency range (F) of the speech signal (S) or which includes submodels solely based on respective frequency bands (F1, .... FN). - Method according to any one of the claims 5 or 6,
wherein a first acoustic model set (K) is used which is based on average and/or undisturbed speech. - Method according to any one of the preceding claims,
wherein frequency domain parameters (MS1, ..., MS12) and in particular melscale parameters are used in deriving said likelihoods. - Method according to any one of the preceding claims,
wherein time domain parameters or time domain like(C1, ..., C12) are used in deriving said likelihoods. - Method according to claim 9,
wherein cepstral parameters (C1, ..., C12) are used as said time domain parameters or said time domain like parameters. - Method according to any one of the preceding claims,
wherein said likelihoods are derived using a union-model-like strategy andwherein it is assumed that a given and fixed number (M) of frequency bands (F1, ..., FN) lower than said number (N) of the frequency bands (F1, ..., FN) are disturbed or distorted by a band-limited and additive noise signal (NS). - Method according to claim 11,
wherein for each assumed disturbed or distorted frequency band (F1, ..., FN) corresponding or assigned frequency domain parameters (MS1, ..., MS12) and in particular corresponding melscale parameters are exchanged by exchange frequency domain parameters (EMS1, ..., EMS12) which correspond to a speech element (P1, ..., Pn) to be tested and in particular taken from said first acoustic model (K) to generate modified acoustic feature data (MAFD). - Method according to claim 12,
wherein based on said frequency domain parameters (MS1, ..., MS12) and said exchange frequency domain parameters (EMS1, ..., EMS12) said time domain parameters or time domain like parameters (C1, ..., C12) and in particular said cepstral parameters are derived, in particular by involving an inverse Fourier transform or the like. - Method according to any one of the claims 11 to 13,
wherein said likelihoods are derived based on said time domain parameters or time domain like parameters (C1, ..., C12) and in particular based on said cepstral parameters. - Method according to any one of the claims 11 to 14,wherein single likelihoods (Lj) or substream likelihoods for each combination of fixed M assumed disturbed or distorted frequency bands (F1, ..., FN) are generated andwherein a total likelihood (LMtotal) is derived from said single likelihoods (Lj).
- Method according to any one of the claims 11 to 15,wherein total likelihoods (LMtotal) are derived for each M fulfilling 0 ≤ M < N andwherein a global likelihood (L) is derived from said total likelihoods (LMtotal) for each M and for each combination of assumed distorted or disturbed frequency bands (F1, ..., FN).
- Method according to any one of the preceding claims,
wherein for deriving said likelihoods (Lj, LMtotal, L) a second acoustic model (P) operating on the time domain or time-like domain and in particular operating on the time domain parameter space, time domain like parameter space and/or the cepstral parameter space is used, evaluating the complete cepstral information. - Method according to any one of the preceding claims,
wherein the first acoustic model (K) has different complexities compared to said second acoustic model (P). - Apparatus which is capable of realizing a method for pre-processing speech according to any one of the claims 1 to 18 and/or the steps thereof.
- Computer program product, comprising computer program means adapted to perform and/or to realize a method for pre-processing speech according to any one of the claims 1 to 18 and/or the steps thereof when it is executed on a computer, a digital signal processing means and/or the like.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP02004143A EP1339045A1 (en) | 2002-02-25 | 2002-02-25 | Method for pre-processing speech |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP02004143A EP1339045A1 (en) | 2002-02-25 | 2002-02-25 | Method for pre-processing speech |
Publications (1)
Publication Number | Publication Date |
---|---|
EP1339045A1 true EP1339045A1 (en) | 2003-08-27 |
Family
ID=27635830
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP02004143A Withdrawn EP1339045A1 (en) | 2002-02-25 | 2002-02-25 | Method for pre-processing speech |
Country Status (1)
Country | Link |
---|---|
EP (1) | EP1339045A1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109346097A (en) * | 2018-03-30 | 2019-02-15 | 上海大学 | A kind of sound enhancement method based on Kullback-Leibler difference |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0789349A2 (en) * | 1996-02-09 | 1997-08-13 | Canon Kabushiki Kaisha | Pattern matching method and apparatus and telephone system |
US20010021905A1 (en) * | 1996-02-06 | 2001-09-13 | The Regents Of The University Of California | System and method for characterizing voiced excitations of speech and acoustic signals, removing acoustic noise from speech, and synthesizing speech |
WO2002095730A1 (en) * | 2001-05-21 | 2002-11-28 | Queen's University Of Belfast | Interpretation of features for signal processing and pattern recognition |
-
2002
- 2002-02-25 EP EP02004143A patent/EP1339045A1/en not_active Withdrawn
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20010021905A1 (en) * | 1996-02-06 | 2001-09-13 | The Regents Of The University Of California | System and method for characterizing voiced excitations of speech and acoustic signals, removing acoustic noise from speech, and synthesizing speech |
EP0789349A2 (en) * | 1996-02-09 | 1997-08-13 | Canon Kabushiki Kaisha | Pattern matching method and apparatus and telephone system |
WO2002095730A1 (en) * | 2001-05-21 | 2002-11-28 | Queen's University Of Belfast | Interpretation of features for signal processing and pattern recognition |
Non-Patent Citations (5)
Title |
---|
JANCOVIC P. ET AL: "COMBINING MULTI-BAND AND FREQUENCY-FILTERING TECHNIQUES FOR SPEECH RECOGNITION IN NOISY ENVIRONMENTS", TEXT, SPEECH AND DIALOGUE. INTERNATIONAL WORKSHOP, TSD.PROCEEDINGS,, no. 1902, 13 September 2000 (2000-09-13), BERLIN HEIDELBERG, pages 265 - 270, XP008006658 * |
JANCOVIC P.; MING J.: "Combining the union model and missing feature method to improve noise robustness in ASR", 2002 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING. PROCEEDINGS. (ICASSP). ORLANDO, FL, MAY 13 - 17, 2002; [IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP)], vol. 1, 13 May 2003 (2003-05-13), NEW YORK, NY : IEEE, US, pages I-69 - I-72 * |
JI MING ET AL: "Union: a new approach for combining sub-band observations for noisy speech recognition", SPEECH COMMUNICATION,1-1-2001, ELSEVIER SCIENCE PUBLISHERS, vol. 34, no. 1-2, 1 January 2001 (2001-01-01), AMSTERDAM, NETHERLANDS, pages 41 - 55, XP002209287 * |
MACHO D.; NADEU C.: "ON THE INTERACTION BETWEEN TIME AND FREQUENCY FILTERING OF SPEECH PARAMETERS FOR ROBUST SPEECH RECOGNITION", PROC. ICSLP '98, 1 October 1998 (1998-10-01), pages 1487 - 1490, XP007000817 * |
NADEU C.; HERNANDO J.; GORRICHO M.: "ON THE DECORRELATION OF FILTER-BANK ENERGIES IN SPEECH RECOGNITION", 4TH EUROPEAN CONFERENCE ON SPEECH COMMUNICATION AND TECHNOLOGY. EUROSPEECH '95. MADRID, SPAIN, SEPT. 18 - 21, 1995; [EUROPEAN CONFERENCE ON SPEECH COMMUNICATION AND TECHNOLOGY. (EUROSPEECH)], vol. 2, 18 September 1995 (1995-09-18), MADRID : GRAFICAS BRENS, ES, pages 1381 - 1384, XP000854958 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109346097A (en) * | 2018-03-30 | 2019-02-15 | 上海大学 | A kind of sound enhancement method based on Kullback-Leibler difference |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Michelsanti et al. | Conditional generative adversarial networks for speech enhancement and noise-robust speaker verification | |
Tan et al. | Low-complexity variable frame rate analysis for speech recognition and voice activity detection | |
EP1536414B1 (en) | Method and apparatus for multi-sensory speech enhancement | |
KR101224755B1 (en) | Multi-sensory speech enhancement using a speech-state model | |
Vaseghi | Multimedia signal processing: theory and applications in speech, music and communications | |
JP5127754B2 (en) | Signal processing device | |
US7313518B2 (en) | Noise reduction method and device using two pass filtering | |
EP2164066A1 (en) | Noise spectrum tracking in noisy acoustical signals | |
EP2416315A1 (en) | Noise suppression device | |
KR20060044629A (en) | Isolating speech signals utilizing neural networks | |
JP2010055000A (en) | Signal band extension device | |
CN104067339A (en) | Noise suppression device | |
US7454338B2 (en) | Training wideband acoustic models in the cepstral domain using mixed-bandwidth training data and extended vectors for speech recognition | |
KR20130057668A (en) | Voice recognition apparatus based on cepstrum feature vector and method thereof | |
EP1995722B1 (en) | Method for processing an acoustic input signal to provide an output signal with reduced noise | |
Sanam et al. | A semisoft thresholding method based on Teager energy operation on wavelet packet coefficients for enhancing noisy speech | |
Saleem et al. | Spectral phase estimation based on deep neural networks for single channel speech enhancement | |
JP5443547B2 (en) | Signal processing device | |
JP2002268698A (en) | Voice recognition device, device and method for standard pattern generation, and program | |
Hammam et al. | Blind signal separation with noise reduction for efficient speaker identification | |
EP1339045A1 (en) | Method for pre-processing speech | |
Ding | Speech enhancement in transform domain | |
WO2006114100A1 (en) | Estimation of signal from noisy observations | |
Korany | Application of wavelet transform for classification of underwater acoustic signals | |
Hirsch | Automatic speech recognition in adverse acoustic conditions |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
AK | Designated contracting states |
Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LI LU MC NL PT SE TR |
|
AX | Request for extension of the european patent |
Extension state: AL LT LV MK RO SI |
|
17P | Request for examination filed |
Effective date: 20040123 |
|
AKX | Designation fees paid |
Designated state(s): DE FR GB |
|
RAP1 | Party data changed (applicant data changed or rights of an application transferred) |
Owner name: SONY DEUTSCHLAND GMBH |
|
RAP1 | Party data changed (applicant data changed or rights of an application transferred) |
Owner name: SONY DEUTSCHLAND GMBH |
|
17Q | First examination report despatched |
Effective date: 20081105 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN |
|
18D | Application deemed to be withdrawn |
Effective date: 20090317 |