EP1339045A1 - Method for pre-processing speech - Google Patents

Method for pre-processing speech Download PDF

Info

Publication number
EP1339045A1
EP1339045A1 EP02004143A EP02004143A EP1339045A1 EP 1339045 A1 EP1339045 A1 EP 1339045A1 EP 02004143 A EP02004143 A EP 02004143A EP 02004143 A EP02004143 A EP 02004143A EP 1339045 A1 EP1339045 A1 EP 1339045A1
Authority
EP
European Patent Office
Prior art keywords
parameters
speech
likelihoods
feature data
frequency
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP02004143A
Other languages
German (de)
French (fr)
Inventor
Thomas c/o Advanced Tech. Cr. Stuttgart Kemp
Raquel c/o Advanced Tech. Cr. Stuttgart Tato
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sony Deutschland GmbH
Original Assignee
Sony International Europe GmbH
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sony International Europe GmbH filed Critical Sony International Europe GmbH
Priority to EP02004143A priority Critical patent/EP1339045A1/en
Publication of EP1339045A1 publication Critical patent/EP1339045A1/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • G10L19/0204Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders using subband decomposition
    • G10L19/0208Subband vocoders
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering

Definitions

  • the present invention relates to a method for pre-processing speech and in particular to a method for pre-processing speech to be employed in a method for recognizing speech. More particular the invention relates to a method for pre-processing speech using noise-robust acoustic modeling by Union models with back-off.
  • the object is achieved by a method for pre-processing speech according to the features of claim 1. Additionally, the object is achieved by an apparatus and a computer program product according to the features of claims 19 and 20, respectively. Preferred embodiments of the present invention are within the scope of the dependent subclaims.
  • the inventive method for pre-processing speech comprises the steps of receiving a speech signal, analyzing said speech signal with respect to a given number of predetermined frequency bands and thereby generating acoustic feature data which are at least in part representative for said speech signal with respect to said frequency bands.
  • inventive method comprises the step of deriving likelihoods for occurrences of speech elements or of sequences thereof within said speech signal based on said acoustic feature data or a derivative thereof, wherein before deriving said likelihoods parts of said acoustical feature data being representative for frequency bands which are at least assumed to be disturbed or distorted by an additive and band-limited noise signal are exchanged by exchange feature data, so as to generate modified acoustic feature data, said exchange feature data being representative for undisturbed and/or average speech.
  • a set of frequency domain parameters is generated at least as a part of said acoustic feature data. Further, subsets of said frequency domain parameters are assigned to different frequency bands. According to that measure first of all the time domain signal, i.e. the amplitude of the voice signal as a function of time, is converted into a frequency domain signal, upon which certain frequency domain parameters may be obtained being representative for or describing the speech signal in the time domain and/or in the frequency domain.
  • the frequency domain parameters are generated and designed so as to be representative for different frequency bands of the speech signal in the frequency domain. In principle, both the time and the frequency domain carry exactly the same information.
  • melscale parameters are used as frequency domain parameters. Therefore, the complete frequency range of the received speech signal may be subdivided in and/or covered by a set of frequency bands taking into account the different information contents of the frequency bands or frequency intervals with respect to human perception capabilities.
  • said exchange feature data are chosen to include exchange frequency domain parameters and in particular they may contain or include exchange melscale parameters to exchange frequency domain parameters and in particular melscale parameters which belong or which are assumed to belong to disturbed or distorted frequency bands. Thereby, a modified set of frequency domain parameters is generated.
  • frequency domain parameters and in particular melscale parameters are used in the process of deriving likelihoods for the occurrence of speech elements within the received speech signal.
  • first acoustic model set which is based on the entire frequency range of the speech signal.
  • said first model set may comprise submodels which base solely on respective frequency bands. In any case, it may be advantageous to involve information from as much frequency bands as possible.
  • a further aspect of the present invention is to use as a first acoustic model set an acoustic model set which is based on average speech and/or on undisturbed or undistorted speech. According to that particular measure said derived exchange feature data are free from disturbance compared to the received speech signal, thereby ensuring a better recognition and an higher recognition rate.
  • time domain parameters or time domains like parameters and in particular cepstral parameters for the generation of said likelihoods.
  • the corresponding frequency domain parameters and in particular the corresponding melscale parameters are exchanged by exchange frequency domain parameters which correspond to a speech element P1, ..., Pm to be tested and in particular taken from said first acoustic model to generate modified acoustic feature data.
  • time domain parameters or time domain like parameters and in particular said cepstral parameters in particular by involving an inverse Fourier transform or the like.
  • cepstral parameters are not true time domain parameters, they may be referred to as time domain like parameters as they are derived by involving an inverse Fourier transform leading back from the frequency domain.
  • the likelihoods are derived based on said time domain parameters, time domain like parameters and/or in particular they are based on said cepstral parameters.
  • these total likelihoods are derived for each number M of assumed distorted frequency bands, said number M satisfying the relation 0 ⁇ M ⁇ N. I.e., it is assumed that the number of assumed distorted frequency bands is lower than the number of the entirety of frequency bands of the speech signal.
  • the case M 0 represents the case without noise.
  • a global likelihood is derived from said global likelihoods for each M and for each combination of assumed distorted or disturbed frequency bands.
  • the total likelihoods LM total and the global likelihood L a decomposition of the whole frequency range F of a received speech input into three frequency subbands F1, F2, and F3 is assumed. It is further assumed that the occurance of a particular speech element X in said received speech input is tested.
  • Lj is representative for the likelihood of the occurance of speech element X in said speech input based on frequency range or subband Fj. Therefore, single likelihoods L1, L2, and L3 are calculated.
  • this calculation scheme can be used to describe the likelihood of the occurance of a speech element X in the received speech input. This scheme is based on the fact that with a proper and appropriate replacement by modified acoustic feature data the approximation L1 ⁇ L2 ⁇ L1 holds.
  • a system, an apparatus, a device, a dialog system, and/or the like is provided which is in each case adapted to realize, to carry out and/or to perform a method for pre-processing speech according to the present invention and/or the steps thereof.
  • the first acoustic model operates in the frequency domain and is used to exchange distorted information by undistorted information. Based on the thereby modified acoustic feature data information in the time domain is obtained. The likelihoods for different speech elements are extracted based on said time domain information using the second acoustic model which operates in the time domain or time-like domain.
  • the present invention overcomes this limitation and makes the full information usable by replacing the corrupted band by an estimate of the average speech information inside that band.
  • the Union model suffers less than the standard model does.
  • the increased robustness is not sufficient to compensate the lack of performance of the baseline, i. e. no noise, case.
  • the models K can be of different complexity. Namely, they can have one mean vector per HMM-state-model of the recognizer, e. g. for a monophone recognizer, one mean vector per phoneme. Or they can have only one global mean vector for all speech and silence, or one for speech and one for silence, or any other degree of tying between this two extremes.
  • the number of streams or frequency bands is 3, and the number of corrupted streams is 1.
  • the number of log spectral coefficients totals e. g. 21.
  • the final likelihood is computed as (L1*L2) + (L1*L3) + (L2*L3), where e. g. L2 is the likelihood computed only on the second stream or band, using some cepstral parameters derived from the mid part of the spectrum or coefficients 8-14 of the 21 log spectral coefficients.
  • band 3 - corresponding to the coefficients 15-21 in log spectral domain - is corrupted, then it conveys no useful information any more. It should therefore not be used to compute the likelihoods. However, we can "reconstruct" the corrupted information, by replacing it with the average information that is contained in this band as averaged over all the speech, or by the average information usually found in this band for this phone.
  • the artificial data inserted into band 3 is the same for all phone models being evaluated, it cannot discriminate between them and adds no information whatsoever to the discriminative power of the model. But, it allows the computation of lower order cepstra which relate all other bands with each other.
  • the net effect of the replacement is to blur information in band 3 up to the point where it is unusable, but to keep the information in bands 1 and 2 available, and also the information about the relationship of bands 1 and 2 available which is not available in the union model. Since the lower order cepstral parameters contain exactly this type of information, they cannot be used in the union model, but they can be computed in the proposed model.
  • Section A of Fig. 1 shows the envelope of a speech signal S as a function of time S(t).
  • the logarithmic power spectrum is generated. This is done by first applying a Fourier transform to the speech signal S. Then the logarithm of the absolute square of the Fourier transformed signal is generated.
  • the result of the generation of the logarithmic power spectrum is shown in section B of Fig. 1, wherein for simplicity also the logarithm log(f) of the frequency f is taken.
  • the whole frequency range F is built up by a union of three distinct frequency bands F1, F2, and F3.
  • the frequency bands F1 to F3 are subdivided. For each subdivision the average of the logarithmic power spectrum is taken taking into account a weighting function, which is in the case of section B of Fig. 1 a piecewise triangular weighting function.
  • section C of Fig. 1 The result of piecewise averaging the logarithmic power spectrum with the triangular weighting function is shown in section C of Fig. 1, where on the ordinate the average value Mj of the distinct subdivisions numbered by j are shown as single values. According to section C of Fig. 1 to each subdivision with number j a parameter Mj are in the frequency domain is assigned. These parameters Mj called melscale parameters and they are examples of the frequency domain parameters in the sense of the invention.
  • the derived melscale parameters Mj may be used as said frequency domain parameters MSJ in the sense of the invention and in particular for generating the likelihoods of the distinct speech elements P1, ..., Pm within the speech signal S, it is often more appropriate to generate from the melscale parameters Mj the so-called cepstral parameters Cj which built up the cepstrum corresponding to the spectrum. This is done by essentially applying a discrete inverse Fourier transform to the set of melscale parameters Mj of section C. The result is shown in section D of Fig. 1.
  • cepstral parameters Cj are used as input values for an acoustic model P operating in the time domain, time-like domain or the domain of the cepstral parameters C1.
  • a likelihood is obtained being descriptive for the chance of occurrences of the tested speech element X within the speech signal S(t) or a part thereof.
  • the melscale domain or frequency domain is subdivided with respect to the given and predetermined frequency bands F1 to F3.
  • Fig. 3 shows an embodiment of the inventive pre-processing of a speech signal S.
  • section B of Fig. 3 it is shown, that in frequency band F2 noise components NS(f) are added.
  • the scattered frequency components show the components which are comparable with section B of Fig. 1, i. e. the noise-free case.
  • a weighting function is applied to the logarithmic power spectrum of section B leading to a piecewise average with respect to the frequency subdivisions of the frequency bands F1 - F3.
  • the resulting melscale parameters M5 - M8 i.e. the frequency domain parameters MS5 - MS8 in the sense of the invention, are replaced by exchange melscale parameters EMS5 - EMS8 which are represented by filled symbols and which are taken from an acoustic model K with respect to undisturbed speech speech in the frequency domain.
  • the union model strategy is incorporated into the processing of Fig. 3 by taking into account all combinations of likelihoods for distorted frequency bands F1 - F3. As this is done, it is not necessary to know which frequency band actually is distorted by noise.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephonic Communication Services (AREA)

Abstract

A method for pre-processing speech in particular for a method for recognizing speech is suggested in which likelihoods (Lj, LMtotal, L) for occurrences for speech elements (P1, ..., Pm) based on acoustic feature data (AFD) is suggested, wherein before deriving said likelihoods (Lj, LMtotal, L) parts of said acoustic feature data (AFD) being representative for frequency bands (F1, ..., FN) which are assumed to be distorted by an additive and band-limited noise signal (NS) are exchanged by exchange feature data (EFD) so as to generate modified acoustic feature data (MAFD), said exchange feature data (EFD) being representative for undisturbed and/or average speech.

Description

The present invention relates to a method for pre-processing speech and in particular to a method for pre-processing speech to be employed in a method for recognizing speech. More particular the invention relates to a method for pre-processing speech using noise-robust acoustic modeling by Union models with back-off.
Nowadays electronic equipment and appliances become more and more important which use for instance with a man-machine-interface speech dialog systems between a user and the device. Therefore, speech recognition is an important aspect of such appliances. As the distinct applications may also take place with noisy environments the problem of speech recognition from a speech signal which is disturbed or distorted by a noise signal has to be faced.
Although, many approaches try to separate the noise signal from the speech signal in advance of the recognition process, these known methods are far beyond from leading to a satisfactory recognition rate when noise signals interfere with speech signals.
Therefore, it is an object of the present invention to provide a method for pre-processing speech, in particular within a method of recognizing speech, which is capable of taking into account noise signals in superposition with speech signals in a particularly simple and reliable manner.
The object is achieved by a method for pre-processing speech according to the features of claim 1. Additionally, the object is achieved by an apparatus and a computer program product according to the features of claims 19 and 20, respectively. Preferred embodiments of the present invention are within the scope of the dependent subclaims.
The inventive method for pre-processing speech, in particular in a method for recognizing speech, comprises the steps of receiving a speech signal, analyzing said speech signal with respect to a given number of predetermined frequency bands and thereby generating acoustic feature data which are at least in part representative for said speech signal with respect to said frequency bands. Further the inventive method comprises the step of deriving likelihoods for occurrences of speech elements or of sequences thereof within said speech signal based on said acoustic feature data or a derivative thereof, wherein before deriving said likelihoods parts of said acoustical feature data being representative for frequency bands which are at least assumed to be disturbed or distorted by an additive and band-limited noise signal are exchanged by exchange feature data, so as to generate modified acoustic feature data, said exchange feature data being representative for undisturbed and/or average speech.
It is therefore a basic idea of the present invention to exchange information extracted from the speech signal which is assumed to be disturbed or distorted by a noise signal by undisturbed and undistorted information. Consequently, the whole content of information upon which the recognition process is based, i.e. on which at least the generation and calculation of likelihoods of speech elements is based, is modified, so as to exchange disturbed or distorted information by artificial information. It is therefore assumed, that the content of artificial information leads to a better recognition process and in particular to a more reliable calculation of likelihoods.
According to a preferred embodiment of the inventive method of pre-processing speech a set of frequency domain parameters is generated at least as a part of said acoustic feature data. Further, subsets of said frequency domain parameters are assigned to different frequency bands. According to that measure first of all the time domain signal, i.e. the amplitude of the voice signal as a function of time, is converted into a frequency domain signal, upon which certain frequency domain parameters may be obtained being representative for or describing the speech signal in the time domain and/or in the frequency domain. The frequency domain parameters are generated and designed so as to be representative for different frequency bands of the speech signal in the frequency domain. In principle, both the time and the frequency domain carry exactly the same information.
In a preferred embodiment of the present invention melscale parameters are used as frequency domain parameters. Therefore, the complete frequency range of the received speech signal may be subdivided in and/or covered by a set of frequency bands taking into account the different information contents of the frequency bands or frequency intervals with respect to human perception capabilities.
According to a further preferred embodiment of the present invention said exchange feature data are chosen to include exchange frequency domain parameters and in particular they may contain or include exchange melscale parameters to exchange frequency domain parameters and in particular melscale parameters which belong or which are assumed to belong to disturbed or distorted frequency bands. Thereby, a modified set of frequency domain parameters is generated.
It is of particular advantage to derive at least a part of said exchange feature data and in particular at least a part of said exchange frequency domain parameters based on a first acoustic model set which operates on the frequency domain and/or on the space of said frequency domain parameters.
It is further preferred to use frequency domain parameters and in particular melscale parameters in the process of deriving likelihoods for the occurrence of speech elements within the received speech signal.
It is a further aspect of the present invention to use a first acoustic model set which is based on the entire frequency range of the speech signal. Alternatively, said first model set may comprise submodels which base solely on respective frequency bands. In any case, it may be advantageous to involve information from as much frequency bands as possible.
A further aspect of the present invention is to use as a first acoustic model set an acoustic model set which is based on average speech and/or on undisturbed or undistorted speech. According to that particular measure said derived exchange feature data are free from disturbance compared to the received speech signal, thereby ensuring a better recognition and an higher recognition rate.
According to an alternative it is also possible to use time domain parameters or time domains like parameters and in particular cepstral parameters for the generation of said likelihoods.
According to another aspect of the present invention it is suggested to derive said likelihoods using a union-model-like strategy, wherein it is assumed that a given and fixed number of a frequency bands or frequency intervals which is lower than the number of the entirety of the frequency bands are disturbed or distorted by a band-limited and additive noise signal.
It is further preferred that for each assumed disturbed or distorted frequency band the corresponding frequency domain parameters and in particular the corresponding melscale parameters are exchanged by exchange frequency domain parameters which correspond to a speech element P1, ..., Pm to be tested and in particular taken from said first acoustic model to generate modified acoustic feature data.
It is further provided in the inventive method to derive based on said frequency domain parameters and said exchange frequency domain parameters said time domain parameters or time domain like parameters and in particular said cepstral parameters, in particular by involving an inverse Fourier transform or the like. Although, it follows from actual calculation schemes that e. g. cepstral parameters are not true time domain parameters, they may be referred to as time domain like parameters as they are derived by involving an inverse Fourier transform leading back from the frequency domain.
According to a further advantageous embodiment of the present invention the likelihoods are derived based on said time domain parameters, time domain like parameters and/or in particular they are based on said cepstral parameters.
It is a further aspect to generate single likelihoods for each combination of assumed and fixed M distorted frequency bands. Based on these single likelihoods for the variety of M assumed distorted frequency bands a total likelihood is derived.
According to a further aspect these total likelihoods are derived for each number M of assumed distorted frequency bands, said number M satisfying the relation 0 ≤ M < N. I.e., it is assumed that the number of assumed distorted frequency bands is lower than the number of the entirety of frequency bands of the speech signal. The case M = 0 represents the case without noise. Under these circumstances a global likelihood is derived from said global likelihoods for each M and for each combination of assumed distorted or disturbed frequency bands.
These both latter aspects together with the inventive principle of exchanging corrupted frequency domain data represent a modification of the union-model-like strategy, which is also a basic aspect of the present invention. According to these aspects the global likelihood for the occurrence of a speech element is calculated on the entirety of single likelihoods for each combination and for each number of assumed distorted frequency bands.
In the following the construction and generation of the single likelihoods Lj, the total likelihoods LMtotal and the global likelihood L a decomposition of the whole frequency range F of a received speech input into three frequency subbands F1, F2, and F3 is assumed. It is further assumed that the occurance of a particular speech element X in said received speech input is tested.
Further, in the following Lj is representative for the likelihood of the occurance of speech element X in said speech input based on frequency range or subband Fj. Therefore, single likelihoods L1, L2, and L3 are calculated.
For each assumption of M < N = 3 distorted frequency bands a total likelihood L1total, L2total can be derived in an approximative way: for M = 1 : L1total ≈ L1 · L2 + L1 · L3 + L2 · L3 for M = 2 : L2total ≈ L1 + L2 + L3.
For a simple estimation the global likelihood L can be set to L : = L1total or L : = L2total.
If, however, the number of distorted subbands is unknown all terms from the equation listed above can be added to contribute to the global likelihood L: L : = L1 · L2 + L1 · L3 + L2 · L3 + L1 + L2 + L3 : = L1total + L2total.
It is a further aspect of the present invention, that this calculation scheme can be used to describe the likelihood of the occurance of a speech element X in the received speech input. This scheme is based on the fact that with a proper and appropriate replacement by modified acoustic feature data the approximation L1 · L2 ≈ L1 holds.
It is of further advantage to use for deriving said likelihoods a second acoustic model which operates on the time domain or time-like domain and in particular on the space of the time domain parameters time domain like parameters and/or more particular on the space of the cepstral parameters. Additionally, in deriving said likelihoods, the complete cepstral information is used according to the invention.
To increase the performance of the inventive method it is suggested to use a first acoustic model in the frequency domain having different complexities compared to said second acoustic model operating in the time domain.
According to a further aspect of the invention a system, an apparatus, a device, a dialog system, and/or the like is provided which is in each case adapted to realize, to carry out and/or to perform a method for pre-processing speech according to the present invention and/or the steps thereof.
It is a further aspect of the present invention to provide a computer program product, comprising computer program means which is adapted to perform and/or to realize the inventive method for pre-processing speech and/or the steps thereof, when it is executed on a computer, a digital signal processing means, and/or the like.
It has to be emphasized, that according to the preferred embodiment of the present invention two acoustic models are involved in the inventive method. The first acoustic model operates in the frequency domain and is used to exchange distorted information by undistorted information. Based on the thereby modified acoustic feature data information in the time domain is obtained. The likelihoods for different speech elements are extracted based on said time domain information using the second acoustic model which operates in the time domain or time-like domain.
These and further aspects of the present invention are discussed taking reference to the following remarks:
Speech recognition in the presence of noise is a difficult problem of great practical importance. Recently, the Union model has been proposed that tries to overcome the signal quality deterioration by the assumption of band-limited additive noise, and by effectively ignoring the contribution of the distorted signal band in the likelihood computation.
However, this model suffers from the reduction of the information usable by the likelihood computation.
The present invention overcomes this limitation and makes the full information usable by replacing the corrupted band by an estimate of the average speech information inside that band.
To achieve this object an improved Union model for the likelihood combination in the presence of band-limited additive noise is proposed.
Basically, a signal is split up in N (let N = 5 for clarity from now on, however N is arbitrary) frequency bands. Under the assumption that M (M < N) bands are distorted (let M = 1 for clarity from now on although the algorithm is not limited to M = 1, the likelihood for a speech element, e.g. a phoneme, can be computed as the sum of the likelihood contributions of all combinations of N - M (= 4) bands.
The principal idea is that if a combination includes the corrupted, i. e. noisy, band then its likelihood is very low, and therefore the sum of the individual likelihood contributions is dominated by the one combination of bands where the noisy bands is excluded.
The interesting property of the Union model is now that it is not necessary to know which of the bands is corrupted.
In the Union model, in order to run the pre-processing on the individual subbands, it is no longer possible, to compute features that use input from different sub-bands. Specifically, in the Union model it is only possible to compute a limited number, e. g. in a specific implementation 3 cepstral parameters, which can be associated with C4, C5 and C6 in the standard model. The parameters C1, C2 and C3 carry important information, this is a strong drawback of the standard Union model as such which leads to a reduced performance of the Union model if there is no noise present.
In the presence of noise, the Union model suffers less than the standard model does. However, in many cases the increased robustness is not sufficient to compensate the lack of performance of the baseline, i. e. no noise, case.
It is a basic idea of the invention to allow for the full number of cepstral parameters to be computed by changing the pre-processing significantly, while maintaining the principle advantage of the Union model, i. e. the independence of the identity of the corrupted channel.
The reason for the intrinsic inability of the Union model to make use of less cepstra, and no lower order cepstra is its very idea, the sub-band processing. If the full range of, say, Q (typically: 15 < Q < 40) spectral energy parameters is split up in N bands of size Q/N each, then the discrete cosine transform that is used to decorrelate the features can only be computed up to order (Q/2N), rather than to order (Q/2) as in the original model. In order to compute low-or-der cepstra, it is necessary to take into account information from all bands, which is not feasible in the Union.
It is suggested to have two sets of acoustic models, i. e. the standard models P in the cepstral domain - without any sub-streams, using the whole information - and another set of models K in the log spectral domain.
The models K can be of different complexity. Namely, they can have one mean vector per HMM-state-model of the recognizer, e. g. for a monophone recognizer, one mean vector per phoneme. Or they can have only one global mean vector for all speech and silence, or one for speech and one for silence, or any other degree of tying between this two extremes.
Suppose now in the union model the number of streams or frequency bands is 3, and the number of corrupted streams is 1. The number of log spectral coefficients totals e. g. 21. Then, in the standard union model the final likelihood is computed as (L1*L2) + (L1*L3) + (L2*L3), where e. g. L2 is the likelihood computed only on the second stream or band, using some cepstral parameters derived from the mid part of the spectrum or coefficients 8-14 of the 21 log spectral coefficients.
In this invention, it is suggested to compute the terms, e. g. comparable to L1* L3, differently. Rather than constructing models for streams L1 and L3, evaluating them separately, and multiplying the likelihood, it is proposed to use the original cepstral models P which are based on all 21 log spectral coefficients, but to replace before deriving the cepstrum the coefficients 8-14 of the log-spectral domain parameters by the coefficients 8-14 of the model K mentioned above. After the replacement has been done, the pre-processing continued as usual, i.e. the cepstral parameters are computed and evaluated using the standard models.
To compute the term comparable to L1*L2, similarly, the standard pre-processing of the input data is done until the point where the 21 log spectral coefficients are computed. Then, the coefficients 15-21 are replaced by the value taken from the model K, and the pre-processing is continued to compute ordinary cepstral parameters as usual.
A basic idea is the following. If, say band 3 - corresponding to the coefficients 15-21 in log spectral domain - is corrupted, then it conveys no useful information any more. It should therefore not be used to compute the likelihoods. However, we can "reconstruct" the corrupted information, by replacing it with the average information that is contained in this band as averaged over all the speech, or by the average information usually found in this band for this phone. Clearly, since the artificial data inserted into band 3 is the same for all phone models being evaluated, it cannot discriminate between them and adds no information whatsoever to the discriminative power of the model. But, it allows the computation of lower order cepstra which relate all other bands with each other.
So, the net effect of the replacement is to blur information in band 3 up to the point where it is unusable, but to keep the information in bands 1 and 2 available, and also the information about the relationship of bands 1 and 2 available which is not available in the union model. Since the lower order cepstral parameters contain exactly this type of information, they cannot be used in the union model, but they can be computed in the proposed model.
No information loss is incurred in addition to the information loss by the noise, which is theoretically unavoidable. This recovery of lost information is the main advantageous difference between the invention and the state of the art.
The aspects of the present invention will be discussed in further detail taking reference to the accompanying drawings.
Fig. 1
shows a pre-processing sequence known in the art.
Fig. 2
shows the standard Union model processing in addition to the processing of Fig. 2.
Fig. 3
shows a preferred embodiment of the inventive method for pre-processing speech.
In the following same reference symbols refer to comparable elements and aspects.
First of all the standard pre-processing of a received speech signal S is described taking reference to sections A to D of Fig. 1.
Section A of Fig. 1 shows the envelope of a speech signal S as a function of time S(t).
Based on this particular amplitude time relationship of the speech signal S the logarithmic power spectrum is generated. This is done by first applying a Fourier transform to the speech signal S. Then the logarithm of the absolute square of the Fourier transformed signal is generated.
The result of the generation of the logarithmic power spectrum is shown in section B of Fig. 1, wherein for simplicity also the logarithm log(f) of the frequency f is taken. The whole frequency range F is built up by a union of three distinct frequency bands F1, F2, and F3.
The frequency bands F1 to F3 are subdivided. For each subdivision the average of the logarithmic power spectrum is taken taking into account a weighting function, which is in the case of section B of Fig. 1 a piecewise triangular weighting function.
The result of piecewise averaging the logarithmic power spectrum with the triangular weighting function is shown in section C of Fig. 1, where on the ordinate the average value Mj of the distinct subdivisions numbered by j are shown as single values. According to section C of Fig. 1 to each subdivision with number j a parameter Mj are in the frequency domain is assigned. These parameters Mj called melscale parameters and they are examples of the frequency domain parameters in the sense of the invention.
Although, the derived melscale parameters Mj may be used as said frequency domain parameters MSJ in the sense of the invention and in particular for generating the likelihoods of the distinct speech elements P1, ..., Pm within the speech signal S, it is often more appropriate to generate from the melscale parameters Mj the so-called cepstral parameters Cj which built up the cepstrum corresponding to the spectrum. This is done by essentially applying a discrete inverse Fourier transform to the set of melscale parameters Mj of section C. The result is shown in section D of Fig. 1.
The so derived cepstral parameters Cj are used as input values for an acoustic model P operating in the time domain, time-like domain or the domain of the cepstral parameters C1. Upon input of the cepstral parameters Cj and an speech element X to be tested a likelihood is obtained being descriptive for the chance of occurrences of the tested speech element X within the speech signal S(t) or a part thereof.
It is emphasized that the distinct amplitudes of the signals and the derived parameters of sections A to D of Fig. 1 have no strict mathematical correspondence to each other but are only used as a simplified explanation for the relationship between them.
In the union model approach of the prior art shown in Fig. 2 the pre-processing scheme differs with respect to the transition from the section C to section D of the processing of Fig. 1.
After calculating the melscale parameters MSj as frequency domain parameters according to section C of Fig. 2 the melscale domain or frequency domain is subdivided with respect to the given and predetermined frequency bands F1 to F3.
The result is shown in section E of Fig. 2. For each frequency band F1 to F3 a separate set of melscale parameters M1 - M4, M5 - M8 and M9 - M10 are derived. In contrast to the processing shown in Fig. 1, where the entirety of all melscale parameters M1 - M12 is used to calculate the cepstral parameters C1 - C12 by applying an inverse Fourier transform, cepstral parameters are calculated for each of the subdivided melscale domains separately.
That means, that by applying an inverse Fourier transform to the set M1 - M4 cepstral parameters C11 - C41 are derived. Because of the properties of the inverse Fourier transform and because of the subdivision and separation of the melscale domain these first parameter set C11 - C41 does not contain information from the other frequency bands F2 and F3. The parameters C11 - C41 are at most comparable to the cepstral parameters C5 - C8 of the processing of Fig. 1, the latter of which taking into account the whole frequency information of all frequency bands F1 - F3.
As a result, in the processing of Fig. 2 information is lost with respect to the lower cepstral parameters C1 to C4 of Fig. 1, although the aspect of noise is taken into account by the union model processing of Fig. 2.
Fig. 3 shows an embodiment of the inventive pre-processing of a speech signal S.
In section B of Fig. 3 it is shown, that in frequency band F2 noise components NS(f) are added. The scattered frequency components show the components which are comparable with section B of Fig. 1, i. e. the noise-free case.
In the transition from section B to section C of Fig. 3 again a weighting function is applied to the logarithmic power spectrum of section B leading to a piecewise average with respect to the frequency subdivisions of the frequency bands F1 - F3. As shown in section C of Fig. 3 the resulting melscale parameters M5 - M8, i.e. the frequency domain parameters MS5 - MS8 in the sense of the invention, are replaced by exchange melscale parameters EMS5 - EMS8 which are represented by filled symbols and which are taken from an acoustic model K with respect to undisturbed speech speech in the frequency domain.
The set of melscale parameters of section C, where the original melscale parameters MS5 - MS8 are replaced by corrected melscale parameters EMS5 - EMS8 are used to again generate a complete and full range cepstrum as shown in section D of Fig. 3.
The major difference between the processing of the prior art according to Fig. 1 and the inventive processing of Fig. 3 is that a noise aspect is taken into account by replacing melscale parameters which are assumed to belong to distorted signal components by parameters which are undisturbed. The difference between the prior art processing of Fig. 2 and the inventive processing of Fig. 3 is that the complete frequency domain information of all sub-bands F1 - F3 is used to calculate the time domain parameters or cepstral parameters Cj.
The union model strategy is incorporated into the processing of Fig. 3 by taking into account all combinations of likelihoods for distorted frequency bands F1 - F3. As this is done, it is not necessary to know which frequency band actually is distorted by noise.
Again all numbers of assumed distorted frequency bands can be taken into account as long as the number of assumed distorted frequency bands is lower than the number of frequency bands at all.

Claims (20)

  1. Method for pre-processing speech, in particular in a method for recognizing speech, comprising the steps of:
    receiving a speech signal (S),
    analyzing said speech signal (S) with respect to a given number (N) of predetermined frequency bands (F1, ..., FN),
    thereby generating acoustic feature data (AFD) which are at least in part representative for said speech signal with respect to said frequency bands (F1, ..., FN),
    deriving likelihoods for occurrences of speech elements (P1, .... Pm) or of sequences thereof within said speech signal (S) based on said acoustic feature data (AFD) or a derivative thereof,
    wherein before deriving said likelihoods parts of said acoustic feature data (AFD) being representative for frequency bands (F1, ..., FN) which are at least assumed to be disturbed by an additive and band-limited noise signal (NS) are exchanged by exchange feature data (EFD) so as to generate modified acoustic feature data (MAFD), said exchange feature data (EFD) being representative for undisturbed and/or average speech.
  2. Method according to claim 2,
    wherein a set of frequency domain parameters (MS1, ..., MS12) is generated as a part of said acoustic feature data (AFD), and
    wherein subsets of said frequency domain parameters (MS1, ..., MS12) are assigned to different frequency bands (F1, ..., FN).
  3. Method according to claim 2,
    wherein melscale parameters are used as frequency domain parameters (MS1, ..., MS12).
  4. Method according to any one of claims 2 or 3,
    wherein said exchange feature data (EFD) are chosen to include exchange frequency domain parameters (EMS1, ..., EMS12) and in particular exchange melscale parameters,
    to exchange frequency domain parameters (MS1, ..., MS12) and in particular melscale parameters belonging to disturbed frequency bands (F1, ..., FN) to obtain a modified set of frequency domain parameters.
  5. Method according to any one of the preceding claims,
    wherein at least a part of said exchange feature data (EFD) and in particular said exchange frequency domain parameters (EMS1, ..., EMS12) are derived and/or taken from a first acoustic model (K) operating on the frequency domain and in particular on the space of said frequency domain parameters (MS1, ..., MS12).
  6. Method according to claim 5,
    wherein a first acoustic model set (K) is used which is based on the entire frequency range (F) of the speech signal (S) or which includes submodels solely based on respective frequency bands (F1, .... FN).
  7. Method according to any one of the claims 5 or 6,
    wherein a first acoustic model set (K) is used which is based on average and/or undisturbed speech.
  8. Method according to any one of the preceding claims,
    wherein frequency domain parameters (MS1, ..., MS12) and in particular melscale parameters are used in deriving said likelihoods.
  9. Method according to any one of the preceding claims,
    wherein time domain parameters or time domain like(C1, ..., C12) are used in deriving said likelihoods.
  10. Method according to claim 9,
    wherein cepstral parameters (C1, ..., C12) are used as said time domain parameters or said time domain like parameters.
  11. Method according to any one of the preceding claims,
    wherein said likelihoods are derived using a union-model-like strategy and
    wherein it is assumed that a given and fixed number (M) of frequency bands (F1, ..., FN) lower than said number (N) of the frequency bands (F1, ..., FN) are disturbed or distorted by a band-limited and additive noise signal (NS).
  12. Method according to claim 11,
    wherein for each assumed disturbed or distorted frequency band (F1, ..., FN) corresponding or assigned frequency domain parameters (MS1, ..., MS12) and in particular corresponding melscale parameters are exchanged by exchange frequency domain parameters (EMS1, ..., EMS12) which correspond to a speech element (P1, ..., Pn) to be tested and in particular taken from said first acoustic model (K) to generate modified acoustic feature data (MAFD).
  13. Method according to claim 12,
    wherein based on said frequency domain parameters (MS1, ..., MS12) and said exchange frequency domain parameters (EMS1, ..., EMS12) said time domain parameters or time domain like parameters (C1, ..., C12) and in particular said cepstral parameters are derived, in particular by involving an inverse Fourier transform or the like.
  14. Method according to any one of the claims 11 to 13,
    wherein said likelihoods are derived based on said time domain parameters or time domain like parameters (C1, ..., C12) and in particular based on said cepstral parameters.
  15. Method according to any one of the claims 11 to 14,
    wherein single likelihoods (Lj) or substream likelihoods for each combination of fixed M assumed disturbed or distorted frequency bands (F1, ..., FN) are generated and
    wherein a total likelihood (LMtotal) is derived from said single likelihoods (Lj).
  16. Method according to any one of the claims 11 to 15,
    wherein total likelihoods (LMtotal) are derived for each M fulfilling 0 ≤ M < N and
    wherein a global likelihood (L) is derived from said total likelihoods (LMtotal) for each M and for each combination of assumed distorted or disturbed frequency bands (F1, ..., FN).
  17. Method according to any one of the preceding claims,
    wherein for deriving said likelihoods (Lj, LMtotal, L) a second acoustic model (P) operating on the time domain or time-like domain and in particular operating on the time domain parameter space, time domain like parameter space and/or the cepstral parameter space is used, evaluating the complete cepstral information.
  18. Method according to any one of the preceding claims,
    wherein the first acoustic model (K) has different complexities compared to said second acoustic model (P).
  19. Apparatus which is capable of realizing a method for pre-processing speech according to any one of the claims 1 to 18 and/or the steps thereof.
  20. Computer program product, comprising computer program means adapted to perform and/or to realize a method for pre-processing speech according to any one of the claims 1 to 18 and/or the steps thereof when it is executed on a computer, a digital signal processing means and/or the like.
EP02004143A 2002-02-25 2002-02-25 Method for pre-processing speech Withdrawn EP1339045A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP02004143A EP1339045A1 (en) 2002-02-25 2002-02-25 Method for pre-processing speech

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
EP02004143A EP1339045A1 (en) 2002-02-25 2002-02-25 Method for pre-processing speech

Publications (1)

Publication Number Publication Date
EP1339045A1 true EP1339045A1 (en) 2003-08-27

Family

ID=27635830

Family Applications (1)

Application Number Title Priority Date Filing Date
EP02004143A Withdrawn EP1339045A1 (en) 2002-02-25 2002-02-25 Method for pre-processing speech

Country Status (1)

Country Link
EP (1) EP1339045A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109346097A (en) * 2018-03-30 2019-02-15 上海大学 A kind of sound enhancement method based on Kullback-Leibler difference

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0789349A2 (en) * 1996-02-09 1997-08-13 Canon Kabushiki Kaisha Pattern matching method and apparatus and telephone system
US20010021905A1 (en) * 1996-02-06 2001-09-13 The Regents Of The University Of California System and method for characterizing voiced excitations of speech and acoustic signals, removing acoustic noise from speech, and synthesizing speech
WO2002095730A1 (en) * 2001-05-21 2002-11-28 Queen's University Of Belfast Interpretation of features for signal processing and pattern recognition

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20010021905A1 (en) * 1996-02-06 2001-09-13 The Regents Of The University Of California System and method for characterizing voiced excitations of speech and acoustic signals, removing acoustic noise from speech, and synthesizing speech
EP0789349A2 (en) * 1996-02-09 1997-08-13 Canon Kabushiki Kaisha Pattern matching method and apparatus and telephone system
WO2002095730A1 (en) * 2001-05-21 2002-11-28 Queen's University Of Belfast Interpretation of features for signal processing and pattern recognition

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
JANCOVIC P. ET AL: "COMBINING MULTI-BAND AND FREQUENCY-FILTERING TECHNIQUES FOR SPEECH RECOGNITION IN NOISY ENVIRONMENTS", TEXT, SPEECH AND DIALOGUE. INTERNATIONAL WORKSHOP, TSD.PROCEEDINGS,, no. 1902, 13 September 2000 (2000-09-13), BERLIN HEIDELBERG, pages 265 - 270, XP008006658 *
JANCOVIC P.; MING J.: "Combining the union model and missing feature method to improve noise robustness in ASR", 2002 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING. PROCEEDINGS. (ICASSP). ORLANDO, FL, MAY 13 - 17, 2002; [IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP)], vol. 1, 13 May 2003 (2003-05-13), NEW YORK, NY : IEEE, US, pages I-69 - I-72 *
JI MING ET AL: "Union: a new approach for combining sub-band observations for noisy speech recognition", SPEECH COMMUNICATION,1-1-2001, ELSEVIER SCIENCE PUBLISHERS, vol. 34, no. 1-2, 1 January 2001 (2001-01-01), AMSTERDAM, NETHERLANDS, pages 41 - 55, XP002209287 *
MACHO D.; NADEU C.: "ON THE INTERACTION BETWEEN TIME AND FREQUENCY FILTERING OF SPEECH PARAMETERS FOR ROBUST SPEECH RECOGNITION", PROC. ICSLP '98, 1 October 1998 (1998-10-01), pages 1487 - 1490, XP007000817 *
NADEU C.; HERNANDO J.; GORRICHO M.: "ON THE DECORRELATION OF FILTER-BANK ENERGIES IN SPEECH RECOGNITION", 4TH EUROPEAN CONFERENCE ON SPEECH COMMUNICATION AND TECHNOLOGY. EUROSPEECH '95. MADRID, SPAIN, SEPT. 18 - 21, 1995; [EUROPEAN CONFERENCE ON SPEECH COMMUNICATION AND TECHNOLOGY. (EUROSPEECH)], vol. 2, 18 September 1995 (1995-09-18), MADRID : GRAFICAS BRENS, ES, pages 1381 - 1384, XP000854958 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109346097A (en) * 2018-03-30 2019-02-15 上海大学 A kind of sound enhancement method based on Kullback-Leibler difference

Similar Documents

Publication Publication Date Title
Michelsanti et al. Conditional generative adversarial networks for speech enhancement and noise-robust speaker verification
Tan et al. Low-complexity variable frame rate analysis for speech recognition and voice activity detection
EP1536414B1 (en) Method and apparatus for multi-sensory speech enhancement
KR101224755B1 (en) Multi-sensory speech enhancement using a speech-state model
Vaseghi Multimedia signal processing: theory and applications in speech, music and communications
JP5127754B2 (en) Signal processing device
US7313518B2 (en) Noise reduction method and device using two pass filtering
EP2164066A1 (en) Noise spectrum tracking in noisy acoustical signals
EP2416315A1 (en) Noise suppression device
KR20060044629A (en) Isolating speech signals utilizing neural networks
JP2010055000A (en) Signal band extension device
CN104067339A (en) Noise suppression device
US7454338B2 (en) Training wideband acoustic models in the cepstral domain using mixed-bandwidth training data and extended vectors for speech recognition
KR20130057668A (en) Voice recognition apparatus based on cepstrum feature vector and method thereof
EP1995722B1 (en) Method for processing an acoustic input signal to provide an output signal with reduced noise
Sanam et al. A semisoft thresholding method based on Teager energy operation on wavelet packet coefficients for enhancing noisy speech
Saleem et al. Spectral phase estimation based on deep neural networks for single channel speech enhancement
JP5443547B2 (en) Signal processing device
JP2002268698A (en) Voice recognition device, device and method for standard pattern generation, and program
Hammam et al. Blind signal separation with noise reduction for efficient speaker identification
EP1339045A1 (en) Method for pre-processing speech
Ding Speech enhancement in transform domain
WO2006114100A1 (en) Estimation of signal from noisy observations
Korany Application of wavelet transform for classification of underwater acoustic signals
Hirsch Automatic speech recognition in adverse acoustic conditions

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

AK Designated contracting states

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LI LU MC NL PT SE TR

AX Request for extension of the european patent

Extension state: AL LT LV MK RO SI

17P Request for examination filed

Effective date: 20040123

AKX Designation fees paid

Designated state(s): DE FR GB

RAP1 Party data changed (applicant data changed or rights of an application transferred)

Owner name: SONY DEUTSCHLAND GMBH

RAP1 Party data changed (applicant data changed or rights of an application transferred)

Owner name: SONY DEUTSCHLAND GMBH

17Q First examination report despatched

Effective date: 20081105

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN

18D Application deemed to be withdrawn

Effective date: 20090317