EP2011114A1 - Signal analysis method with non-gaussian auto-regressive model - Google Patents

Signal analysis method with non-gaussian auto-regressive model

Info

Publication number
EP2011114A1
EP2011114A1 EP07722544A EP07722544A EP2011114A1 EP 2011114 A1 EP2011114 A1 EP 2011114A1 EP 07722544 A EP07722544 A EP 07722544A EP 07722544 A EP07722544 A EP 07722544A EP 2011114 A1 EP2011114 A1 EP 2011114A1
Authority
EP
European Patent Office
Prior art keywords
model
signal
gaussian
state
analysis method
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP07722544A
Other languages
German (de)
French (fr)
Inventor
Li Chunjian
Søren Vang ANDERSEN
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Aalborg Universitet AAU
Original Assignee
Aalborg Universitet AAU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Aalborg Universitet AAU filed Critical Aalborg Universitet AAU
Publication of EP2011114A1 publication Critical patent/EP2011114A1/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]

Definitions

  • the present invention relates to the field of signal processing, more specifically the invention relates to signal modelling and system identification, e.g. blind system identification as used within speech analysis or communication channels with inter-symbol interference.
  • a wide range of signal processing applications and methods perform signal analysis as an important step.
  • a parametric analysis of a signal consists of designing a signal model and identifying the model parameters.
  • Blind system identification of non-Gaussian auto-regressive model (AR) models is known in the art, e.g. within speech analysis. Examples of such methods are:
  • the known methods have a number of disadvantages.
  • the EMAX method as described in 1), tries to solve the parameter estimation problem using a Maximum Likelihood (ML) criterion.
  • ML Maximum Likelihood
  • the SAR model as described in 2), does not exploit the finite state structure in the excitation since it assumes the mean of the excitation process to be constantly zero. Also, the way it updates the AR filter coefficients is inefficient because in many signals the AR filter coefficients change more slowly than the excitation.
  • the invention provides a signal analysis method including a non- Gaussian auto-regressive model, wherein an input to the auto -regressive model is modelled as a sequence of symbols from a finite alphabet by a finite state stochastic model, where probability density functions of the input at each time instant are Gaussian probability density functions having the same variance for each symbol and with their means representing values of the symbols.
  • a preferred embodiment of the FSSM is the Hidden Markov Model (HMM).
  • HMM Hidden Markov Model
  • GMM Gaussian Mixture Model
  • HMARM Hidden Markov-Auto Regressive Model
  • GMARM Gaussian Mixture-Auto Regressive Model
  • This method is advantageous for implementation into algorithms for many applications that involve signal analysis using blind system identification of a source-filter type model, e.g. within speech analysis/processing or telecommunication. Due to the design of the HMARM/GMARM with the constraint on the variance of emission pdfs, an exact ML parameter estimation can be done by solving a set of linear equations iteratively. Thus, the method is computationally efficient and therefore suited also for miniature equipment and real time applications, e.g. hearing devices and mobile communication devices, where only a limited processing power is available and real time performance is critical.
  • the FSSM may in principle have any finite number of states, i.e. the number being an integer larger than one.
  • the FSSM is a two-state model.
  • the method performs an expectation-maximization (EM) algorithm for system identification.
  • the EM algorithm may involve performing an optimal smoothing on the signal with a multi-state minimum mean-square error smoother as the E- step.
  • a multi-state minimum mean-square error smoother is a soft-decision switching Kalman filter. The smoothing reduces noise in the signal and gives all the necessary statistics that are needed in the M-step.
  • the signal analysis method according to the first aspect may form part of an algorithm for applications such as: speech analysis, speech enhancement, blind source separation, blind channel equalization, blind channel estimation, or blind system identification.
  • the invention provides a device including a processor arranged to perform the method according to the first aspect.
  • the device may be one of: a mobile phone, a communication device, an internet telephony system, sound recording equipment, sound processing equipment, sound editing equipment, broadcasting sound equipment, and a monitoring system.
  • the device may also be one of: a hearing aid, a headset, an assistive listening device, an electronic hearing protector, and a headphone with a built-in microphone.
  • the device may be arranged to receive a telecommunication signal, wherein the device includes a channel equalizer arranged to perform a blind channel equalization according to a method of the first aspect.
  • the invention provides a computer executable program code adapted to perform the method according to the first aspect.
  • program may be present on a computer or processor memory or represented as data on a data carrier, e.g. hard disk or mobile data carrier (CD, DVD, memory card or memory stick etc.).
  • FIG. 1 illustrating a generative data structure of a HMARM embodiment
  • FIG. 2 illustrating a generative data structure of an E-HMARM embodiment
  • Fig. 3 illustrating the shift invariance property of the HMARM method of the invention compared to the LPC method as the prior art.
  • the two graphs show: (a) log-spectral distribution (LSD) of AR spectra of 50 shifted frames, (b) synthetic signal waveform used in the experiment resulting in (a), Figs. 4a and 4b illustrating four panels of graphs with AR spectra estimated for different vowels to illustrate the result of LPC analysis according to prior art compared with the result of the HMARM according to the invention, each panel illustrating: HMARM (top), LPC analysis (middle), and original signal waveform (bottom),
  • Fig. 5 illustrating prediction residuals by the HMARM (top), by LPC analysis (middle), the example signal waveform being shown in the bottom graph
  • Fig. 7 illustrating an example of recovered symbol sequences, where dots indicate transmitted symbols, circles indicate recovered symbols by a HMARM embodiment, while stars indicate recovered symbols by a least-squares (LS) method,
  • Fig. 8 illustrating true and estimated spectra with the HMARM and the LS method for the example shown in Fig. 7, where it is noted that the HMARM result is overlapping with the true spectrum and is therefore hardly visible,
  • FIG. 9 illustrating another example of recovered symbol sequences with the same type of indications as in Fig. 7,
  • Fig. 10 illustrating true and estimated spectra for the example of Fig. 9, where the difference between the HMARM result and the true spectrum is only marginal (SNR is 15 dB),
  • FIG. 11 illustrating yet another example of recovered symbol sequences with the same type of indications as in Fig. 7,
  • Fig. 12 illustrating true and estimated spectra for the example of Fig. 11, and again the difference between the HMARM result and the true spectrum is hardly visible (SNR is 18 dB)
  • Fig. 13 illustrating a block diagram of a generative signal model used in a preferred signal analysis embodiment
  • FIG. 14 illustrating a preferred headset with speech enhancement.
  • the finite-state stochastic model is embodied as a hidden Markov model (HMM), and an exact expectation- maximization (EM) algorithm is proposed that incorporates a switching Kalman smoother which provides optimum nonlinear MMSE estimates of system output based on the HMM.
  • EM algorithms can only be obtained by appropriate constraints in the model design, and have better convergence properties than algorithms employing generalized EM algorithms or empirical iterative schemes.
  • the method embodiments also provide good data efficiency since only second order statistics is involved in the computation.
  • the signal models are general and suitable to numerous important signals, e.g. speech signals and base-band communication signals. Two system identification algorithm embodiments will be described, and experiment results will be given for embodiments with applications within speech analysis and channel equalization.
  • Embodiments of the signal analysis method according to the invention will be described applied to speech analysis, blind channel equalization of a communication channel, spectrum estimation for voiced speech in background noise, and blind noisy channel equalization.
  • the first system model consists of a linear time-invariant AR filter excited by a first- order discrete-state Hidden Markov process.
  • the AR filter models the resonant property of the vocal tract
  • a two-state Hidden Markov process models the excitation to the filter as a noisy impulse train.
  • the task of system identification here is to jointly estimate the AR coefficients and the excitation dynamics, which contains information about the impulse position, the impulse amplitude, and the noise variance, under a certain optimum criterion. By the joint estimation, the highly non-Gaussian impulse train structure of the excitation no longer affects the AR estimation as it does in the classic Least Squares (LS) solution.
  • LS Least Squares
  • the LS methods such as the auto-correlation method, also known as the LPG analysis, assumes a Gaussian signal model.
  • the consequence of the mismatch of Gaussian model to non-Gaussian signals is an unnecessarily large variation in the estimates. This is supported by the fact that the Cramer-Rao bound for the variances of the AR estimators is lower in the non- Gaussian case than in the Gaussian case [I].
  • Estimating the AR parameters taking into account the impulse structure of the excitation can also reduce bias. This bias is present in the LPC analysis because of the spectral sampling effect of the impulse train. We will s show that the AR spectra estimated by our method have smaller variance and bias and a better shift invariance property than the LPC analysis.
  • the algorithm exploits the underlying dynamics and non-Gaussianity of the finite alphabet symbol sequence to accomplish system identification.
  • An example of equalizing an MA channel is also demonstrated.
  • observation noise is taken into account.
  • the model consists of a linear time-invariant AR filter excited by a first-order discrete-state Hidden Markov process, and the measurements of the system output are perturbed by white Gaussian noise.
  • the identification algorithm must jointly estimate the AR parameters, the excitation dynamics, and the measurement noise variance.
  • the introduction of measurement noise complicates the problem significantly. This is because that the simplicity of the first algorithm partly comes from the fact that the AB. model aggregates the state information in the most recent system output samples, which are now hidden due to the presence of measurement noise.
  • the EM algorithm thus involves a nonlinear MMSE smoother, which provides estimates of the conditional first and second moments of the system output needed in the parameter estimations.
  • a nonlinear MMSE smoother that can be seen as a variant of the soft-decision Switching Kalman Filter [4], where the states control the discrete inputs to the AR filter, and the switching relies on the a posteriori probability of states estimated by a forward-backward algorithm.
  • the EM algorithm thus iterates between the nonlinear MMSE smoothing and the ML parameter estimations.
  • the introduction of measurement noise modeling in the second system model is a major extension to the first system model.
  • the second method is thus noise robust and applicable in adverse environments, although with a price of higher computational complexity.
  • the algorithm gives better estimates of the signal spectra than reference methods do, under moderate noise conditions.
  • Established iterative estimators based on Gaussian AR models are known to have convergence problems, thus an empirical termination is required [5] [6]. They also require prior knowledge of measurement noise statistics.
  • the proposed algorithm does not require prior knowledge of the noise statistics, and its convergence is guaranteed. Applications to channel equalization under moderate noise conditions are also demonstrated. Simulations show that the proposed algorithm has better estimates of the channel response and the transmitted symbols than the Least Squares method.
  • SAR Switching Auto Regression
  • HMM Switching Auto Regression
  • An EM algorithm is derived for parameter estimation. This is a very general formulation, since its dynamics has vector states and all parameters are time dependent. But if the system dynamics takes the specific form of all-pole filtering, as is focused in this work, the SAR becomes less efficient since it is hard to impose s any structure on the state transition matrix. Besides, the system noise in the SAR is assumed to be zero mean, while in our models the mean of the system noise is also controlled by the state, which is found particularly useful in the applications discussed in this paper.
  • the following Section introduces the two signal models and derives the EM algo- o rithms for blind system identification.
  • the proposed algorithms are applied to solving problems in speech analysis, noise robust spectrum estimation, and blind channel equalizations with and without measurement noise.
  • This noisy impulse train structure can be characterized by a two-state symbol sequence. While an M- ary Pulse Amplitude Modulation (PAM) signal can be characterized by an M-state symbol sequence.
  • the probability distribution functions (pdfs) of these discrete state excitations are thus multi-modal, and possibly asymmetric (as is for the impulse train).
  • a Gaussian Mixture Model (GMM) or a Hidden Markov Model (HMM) is suitable to characterize the statistics of such excitations.
  • GMM Gaussian Mixture Model
  • HMM Hidden Markov Model
  • HMARM Hidden Markov-Auto Regressive Model
  • GMARM Gaussian Mixture- Auto Regressive model
  • the HMM is preferable in modeling the excitation because of its capability of modeling the underlying dynamics that is not captured by the GMM, which is a static model. Therefore, the following presentation will mainly focus on the HMARM with a brief discussion on the advantage of the HMM over the GMM in modeling temporal structure.
  • the HMARM and its identification without measurement noise we present the HMARM and its identification without measurement noise.
  • the subsequent section deals with the identification of HMARM with its output perturbed by white Gaussian noise, which is termed the Extended-HMARM.
  • x(t) is the observed signal (system output)
  • g(k) is the fcth AR coefficient
  • r(t) is the excitation.
  • the excitation is a Hidden Markov process, i.e., a first order Markov chain ⁇ (t) plus white Gaussian noise u(t) with zero mean and variance ⁇ 2 .
  • a diagram of the data structure of the HMARM is shown in Pig. 1.
  • the state q t at time t selects according to the state transition probability GL ⁇ one of M states.
  • the emission pdfs of the states are Gaussian pdfs with the same variance ⁇ 2 , and means m r (j), j ; 6 (1, - - - , M), respectively.
  • the emission outcome constitutes the excitation sequence r(t), which is independent of r(l) for I ⁇ t and only dependent on the state q t .
  • the excitation r(t) is then convolved with an AR(p) filter with coefficients [g(l), • - ⁇ , g ⁇ p)] to produce the observation x(t).
  • Erom (1), (2) and (3), b x (j, t) can be shown to be a Gaussian pdf with a time varying mean m x (j, t),
  • the quantity represents the expected number of transitions made from state i, and represents the expected number of transitions from state i to state j [H].
  • the first term in (9) concerns only a ⁇ and the second term concerns the rest of the parameters. Thus the optimization can be done on the two terms separately.
  • the re- estimation equation of a ZJ is found by the Lagrange multiplier method, and is identical 5 to the standard Baum- Welch re-estimation algorithm [13]:
  • Equation (12) and (13) form p + M coupled linear equations which can be solved analytically, wherein m x (j ⁇ i) is calculated by (5). Then (14) can be solved by inserting the estimated g (k) and m r (j).
  • Equation (12) is a multi-state version of the orthogonality principle
  • Equation (13) tells that the prediction error weighted by state posterior is of zero mean
  • (14) calculates the mean of the prediction error power weighted by the state posterior as the variance of the stochastic element of the signal.
  • a GMM with similar constraint can be used in place of the HMM in our signal model, and the EM equations can be derived in the same way as shown above with proper changes in the definition of a and ⁇ (the ⁇ i, j, t) used in the HMM is not needed in the GMM).
  • the derivation of the GMARM is briefly described in Appendix 0.4.
  • the advantage of the GMARM is a lighter computational load than that of the HMARM. Whereas, the lack of dynamic modeling makes the GMARM converge slower and estimate less accurately than the HMARM whenever there is a temporal dependency in the excitation that is invisible to the GMM, since the GMM is still a static model.
  • the complexity of the HMARM identification is similar to that of the standard HMM, since the forward-backward evaluations of state posterior and the re-estimation equations are analogous to that of the HMM.
  • the extra complexity lies in solving the p+M linear equations (that is, (12) and (13)) for the AR coefficients in every iteration, which accounts for only a mild increase in complexity.
  • y(t) is the observations
  • z(t) is the measurement noise
  • g(k) is the fcth AR coefficient
  • r(t) is the non-Gaussian process noise, or, the filter excitation.
  • r(t) is the sum of v(t), a sequence of M-state symbols, and a white Gaussian noise sequence u(t) with zero mean and variance
  • the excitation r(t) is actually a Hidden Markov process with M states.
  • these states have Gaussian emission pdfs with mean ro r (j), j ' £ [1, • • • , M], and identical variance
  • the observation noise is assumed to be white Gaussian noise with zero mean and variance
  • the mean of y(t) should be x(t) if x(t) was known. But since x(t) is not available, a proper choice of the mean of y(t) will be the mean of x(t) given y.
  • So rn y ( j, t) can be obtained by calculating the smoothing estimate of x(t) using the observations y and the current state q t ⁇
  • the variance of the emission pdf is therefore the sum of the smoothing error variance and the measurement noise variance.
  • the smoothing estimates and the error variance can be calculated with a nonlinear MMSE smoother, which will be described later. It can be summarized as follows:
  • Equation (21) follows from the first order Markovian property of the layered data model:
  • Q T involves only the top hidden layer parameters
  • Q B involves only the bottom hidden layer parameters
  • Qy involves only the visible (observation) layer parameters.
  • the maximization of the Q function can now be done by maximizing the three terms in (21) separately.
  • Qy According to the Gaussian assumption of the observation noise, Qy can be written as:
  • the transition probability can be estimated in the same way as in the standard HMM:
  • Equation (26), (29), and (30) consist of a set of 1 +p+ M linear equations with the same number of unknowns, and can be solved by straight forward linear algebra. Then (32) can be solved by inserting the newly updated parameter estimates.
  • the quantities needed in these equations include: the state posteriors ⁇ i,j,t) and 7(2, i), which are calculated by the forward-backward algorithm; the first and second moments of x(t), which are estimated by a nonlinear MMSE fixed-interval smoother.
  • the nonlinear MMSE smoother consists of a forward sweep and a backward sweep.
  • M estimates weighted by the state a posteriori probabilities, 7(2, i), to get an MMSE filtering estimate conditioned only on y.
  • the backward sweep calculates the smoothing estimates and MSE matrices using the filtering estimates and MSE matrices obtained in the forward sweep, Tie backward sweep equations are identical to those of the two-pass Kalman smoother, and can be found in, e.g., [14, p.572].
  • the o algorithm thus iterates between the nonlinear MMSE smoother, and the estimation of ⁇ and j(i, t).
  • the algorithm stacks two dynamic state estimators together, i.e., the nonlinear MMSB smoother and the HMM estimator.
  • a unifying view of the Kalman-type state estimator and the HMM state estimator can be found in [15].
  • the nonlinear smoother uses a continuous state model, where the state vector is the output of the AR(p) filter, xj t _ p+ i : i, and the state transition is ruled by the auto-regressive property of the AR(JD) filter.
  • the HMM uses a discrete state model, where the states are the input symbols, and the state transition is ruled by the underlying mechanism that produces
  • the proposed nonlinear MMSE smoother falls in the category of Switching Kalman Filter (SKF) with soft-decision, as is defined in [4]. But there are major differences between the model and identification proposed in [4] and the ones proposed here.
  • the parameters of a vector AR model (named Switching AR model, or o SAR) switch their values over time, and the switching is modeled by an HMM.
  • the parameter estimation is done by an EM algorithm.
  • the major differences between the SAR and the E-HMARM are: 1) When modeling an AR(p) signal, the state transition matrix has a sparse structure, but the transition matrix in the SAR has no structure.
  • the zero- valued elements of the matrix are set to zeros s after the estimation of the matrix. This is analogous to projecting the estimates to the correct parameter space.
  • the AR model is explicitly expressed in terms of AR coefficients, thus no matrix estimation neither projection are needed.
  • the system noise process is modeled as a zero- mean process with a varying variance
  • the system noise 0 is modeled as a process with a varying mean and a constant variance.
  • the SAR assumes no observation noise.
  • the complexity of the E-HMARM identification is significantly higher than that of the HMARM. This is due to the need of Kalman smoothing in each iteration.
  • the Kalman smoothing has a complexity of 0(Tp 3 ), whereas the complexity of the HMARM 5 is O(TM 2 ). Therefore, the complexity of the E-HMARM is O (Tp 3 ).
  • Least Squares methods such as the LPC analysis (implemented as an autocorrelation method), have been the standard methods of analyzing AR models.
  • the Gaussian assumption taken by the LS method results in simple analytic solutions. But when applied to non ⁇ Gaussian signals such as voiced speech signals, the mismatch of assumption brings in undesirably large variance and bias.
  • the large variance implies a bad shift-invariance property of the LPC analysis. This means that, when a sustained vowel is segmented into several frames, the LPC estimates of the AR parameters for each frame can be very different.
  • the synthetic signal is made by filtering a noisy impulse train with an AR(IO) filter. 50 realizations of this signal are analyzed. To get the 50 realizations we shift a rectangular window along the signal one sample each time 50 times. The window length is 320 s samples. The estimated AR spectra of the 50 realizations are compared to the true AR spectrum, and the difference is measured by the Log-Spectral Distortion (LSD) measure.
  • LSD Log-Spectral Distortion
  • L is the number of spectral bins.
  • the LSD versus the shift is shown in Pig o 3. It is clear that the proposed method has a flat distortion surface and this surface is lower than the LPCs. It is important to note that the LPC estimates encounter huge deviation from the true values in the second half of the plot. This is where a large "hump" in the signal comes into the analysis frame. The large humps in the signal are caused by the impulses in the excitation, which represent the non-Gaussian/nonlinear s structure of the signal.
  • the bias of the estimates is also calculated by taking the difference between the true parameters and the sample mean of the estimates, and the variance of the estimates is calculated using the sample variance of the estimates. Table 1 compares the biases and variances of the HMARM and LPC estimates.
  • the spectra are plotted in Fig 4.
  • the estimates by the HMARM show good consistency, while the consistency of the LPC analysis appears to be poor.
  • the residual of the HMARM analysis also has different properties than the LPC analysis.
  • Fig. 5 we show the prediction residual of a voiced speech signal using the AR parameters estimated by the HMARM and the LPC respectively. It is clear that the residual of the HMARM has more prominent impulses, and the noise between the impulses appears to be less correlated.
  • the residual of HMARM has a smaller Ll norm than that of the LPC analysis.
  • the proposed method provides a sparser representation of the voice signal than the one given by LPC analysis.
  • sparse representation is achieved by minimizing Ll-norm with numerical optimizations (see [16] for a review, and [17] for application in speech analysis), or using Bayesian inference with a super Gaussian pdf as prior [18].
  • the HMARM method proposed here provides a computationally simple alternative to the sparse coding of voiced speech signals.
  • Another known LS method is the covariance method [19, Ch. 5.3].
  • the covariance method is known to give more accurate estimates of the AR coefficients than the autocorrelation method when the data length is small. In our experiments, it is so when the analysis window is rectangular. When a Hamming window is used, the covariance method gives similar results as the autocorrelation method. 0
  • the channel response has included the response of the transmitter filter, the medium, the receiver filter, and the symbol-rate sampler.
  • the channel can be well characterized by an AR model, and no measurement noise is present (or, the channel B has a very high SNR).
  • the transmitted symbols are quaternary PAM symbols.
  • the channel distortion is compensated and the transmitted symbols are decoded.
  • the receiver has no prior knowledge about the channel, the alphabet of the transmitted symbols, and the probability distribution of the symbols.
  • the equalization and decoding are done jointly.
  • the channel is AR(IO) with coefficients
  • A [1, -1.223, -0.120, 1.016, 0.031, -0.542, -0.229, 0.659, 0.307, -0.756, 0.387].
  • the equalizer output and the estimated channel spectra are shown in Fig. 7 and Fig. 8, respectively.
  • the LS method as the reference method. It is clear from the figures that the recovered symbol sequence by the HMARM method coincides with the transmitted symbols very well, and the spectrum estimated by the HMARM method completely overlaps with the true channel spectrum. Whereas the LS method has a much larger estimation error on both the recovered symbols and the channel spec- trum. More precisely, the estimation error variance of the recovered symbol sequence is 1.06 x 10 ⁇ 26 for the HMARM method and 0.36 for the LS method, which represents a 255 dB gain of the HMARM method over the LS method.
  • the alphabet A is the same as before, and the 3rd order MA channel coefficients are B — [1.0, -0.65, 0.06, 0.41].
  • the recovered symbol sequence are shown in Fig. 9.
  • the estimation error variance of the recovered symbol sequence is 0.0023 for the HMARM method, and 0.4212 for the LS method.
  • the gain of the HMARM method over the LS method is 22.6 dB.
  • the performance of the HMARM method degrades.
  • the gain of the HMARM method over the LS method are 27.5 dB, 17.5 dB, and 8 dB, respectively. From 30 dB down, the performance of HMARM is similar to that of the LS method.
  • Fig. 10 shows the signal spectrum and its estimetes given by the E-HMARM and LS, respectively, at 15 dB input SNR.
  • Table 2 shows the averaged values of parameters of 50 estimations. The results show that the E-HMARM algorithm gives much better estimates of the signal spectra than the LS method. The estimates of the impulse amplitude and measurement noise variance are also quite accurate.
  • the estimated process noise variance is always larger than the true value, especially when the SNR is low. This is because in the E-HMARM algorithm, the modeling error is included as part of the process noise.
  • the E-HMARM algorithm Like all EM-type algorithms, it is possible for the E-HMARM algorithm to converge towards a local maxima. A good initialization can prevent converging to the local maxima.
  • the LS estimates of the AR coefficients are used as initial values. The convergence criterion is set such that the iteration stops when the norm of the difference in the parameter vectors is smaller than 10 ⁇ 4 . No divergence has ever been observed under extensive experiments.
  • the E-HMARM algorithm works best at SNRs above 15 dB. From 10 dB and below, the algorithm converges to the LS solution.
  • PPM Pulse Position Modulation
  • UWB ultra-wide-band
  • the spectrum of a PPM symbol sequence is high pass and has a strong DC component (Instead of defining the whole frame as a symbol, here we treat the pulse duration as the symbol duration.
  • a time frame consists of M symbols, and the sampler at the receiver samples M times per frame. This is why the received symbol sequence has a strong DC component and a high pass spectrum.).
  • a signal frame thus has 8 time slots, each corresponding to one symbol in the alphabet.
  • a pulse is put at the fcth time slot, and zeros elsewhe ⁇ e.
  • the transmitted signal is modeled as a "1" at the symbol position and "0" at the other 7 positions.
  • the channel is modeled as an AR(IO) filter.
  • White Gaussian noise is added to the output of the AR(IO) filter.
  • the E-HMARM equalizer estimates the channel response and the noise variance, and does inverse filtering to recover the transmitted symbols.
  • the standard LS method is used as a reference method. It is shown in Fig. 11 that the recovered symbol sequence by the E-HMARM method has much smaller error variance than that of the LS method. In Fig. 12 it is shown that the E-HMARM gives a very good estimate of the channel spectrum, while the LS estimate is far off.
  • the channel SNR in this example is 18 dB, and the signal length is 400 samples.
  • the E-HMARM equalizer works best at SNRs above 18 dB. At SNRs below 18 dB its performance degrades fast. At SNRs below 15 dB the E-HMARM algorithm converges to the LS solution.
  • This GMARM algorithm has a lighter computational load than the HMARM pre- sented in Section 0.2.1 since the calculation of the state posterior probability has a simpler form.
  • Table 1 A comparison of biasis and variances of the estimates by the HMARM and LPC. The improvement of the HMARM over the LPC is shown in the Imprv.
  • TaLIe 2 The true and estimated parameters. Results are the average of 50 estimations.
  • Fig. 13 illustrates in block diagram form a preferred embodiment of the signal model used in system identification.
  • the signal is modelled as the output of an AR filter, here illustrated as an AR(IO) filter, excited by the input X.
  • the input X is modelled by a Finite State Stochastic Model FSSM.
  • the constraint on the variance is due to the fact that the input X is modelled as a sequence of symbols I selected from a finite alphabet added with stationary white Gaussian noise.
  • the parameters of this model are identified by an EM algorithm and are used in a later stage for processing of the signal.
  • the FSSM uses a Gaussian probability density function with a variance value Oi 2 for the first symbol which equals a variance value ⁇ i 2 for the second symbol.
  • Mean values ⁇ l7 ⁇ 2 in the Gaussian probability density functions are determined by the symbols, and in this example the values 0 and 1 are selected.
  • the algorithm is suited e.g. for input signals X such as speech which can be modelled as a train of impulses, and hence only a sequence of symbols indicating "pulse” and "no pulse” is needed as input to the AR.
  • Fig. 14 illustrates in block diagram form an example of a device according to the invention, namely a headset.
  • the headset includes digital signal processor arranged to receive a signal representing noisy speech, e.g. the voice of a remote speaker distorted by background acoustic noise in a noisy environment.
  • System identification of a signal model e.g. the one described in Fig. 13, is then performed on this input signal to estimate parameters of the model.
  • a speech enhancement procedure filters out the noise from the speech using the estimated parameters.
  • the loudspeaker produces an acoustic signal representing an enhanced version of the noisy input speech signal and thus provides the headset user with an improved speech quality.
  • the headset may include many additional signal processing features and additional electronic equipment. E.g. analog-to-digital and digital-to-analog converters are not illustrated.
  • the input noisy speech signal is analyzed using a 2-state E-HMARM model where the emission probability density functions for the two states have the same variance.
  • the noise reduction block reduces noise by performing a multi-state Kalman smoothing, given the estimated model parameters.
  • this enhanced speech signal is applied, e.g. after digital-to-analog conversion, to a loudspeaker in the headset that converts the signal to an enhanced acoustic speech signal.
  • the illustrated speech analysis and enhancement procedure may additionally or alternatively be used for the speech picked up by the microphone in the headset before transmitted from the headset, so as to clean speech picked up by the microphone, e.g. if the user is located in a noisy environment.
  • the individual elements of an embodiment of the invention may be physically, functionally and logically implemented in any suitable way such as in a single unit, in a plurality of units or as part of separate functional units.
  • the invention may be implemented in a single unit, or be both physically and functionally distributed between different units and processors.

Abstract

A signal analysis method including a non-Gaussian auto-regressive model, wherein an input to the autoregressive model (AR) is modelled as a sequence of symbols (I) from a finite alphabet by a finite state stochastic model (FSSM). Probability density functions (pdf) of an input (X) at each time instant are Gaussian pdfs with the same variance (σ21, σ22) for each symbol and with their means (µ1, µ2) decided by the symbols. In preferred embodiments, the FSSM is a Hidden Markov Model HMM, which, on certain occasions depending on the signal characteristic can be reduced to a Gaussian Mixture Model (GMM). The method preferably includes an identification step of performing an expectation-maximization (EM) algorithm. In the case that the observed signal is noisy, such an EM algorithm may involve an optimal smoothing on the noisy signal with a multi-state minimum mean-square error smoother, such as a soft-decision switching Kalman filter. The method is suited for analysis of both clean and noisy non-Gaussian AR signals that have an impulsive structure in its excitation. Due to its low complexity, it is suitable to apply in equipment with limited signal processing capacity. Especially, the method may form part of an algorithm for applications such as: speech analysis, speech enhancement, blind source separation, blind channel equalization, blind channel estimation, or blind system identification.

Description

SIGNAL ANALYSIS METHOD WITH NON-GAUSSIAN AUTO-REGRESSIVE MODEL
FIELD OF THE INVENTION
The present invention relates to the field of signal processing, more specifically the invention relates to signal modelling and system identification, e.g. blind system identification as used within speech analysis or communication channels with inter-symbol interference.
BACKGROUND OF THE INVENTION A wide range of signal processing applications and methods perform signal analysis as an important step. A parametric analysis of a signal consists of designing a signal model and identifying the model parameters. Blind system identification of non-Gaussian auto-regressive model (AR) models is known in the art, e.g. within speech analysis. Examples of such methods are:
1) The method as described in "Parameter estimation for autoregressive Gaussian-Mixture processes: the EMAX algorithm", IEEE Trans, on Signal Processing, vol. 46, No. 10, pp. 2744-2756, 1998. In this method, the non- Gaussian input to the AR model is modelled by a Gaussian Mixture Model (GMM).
2) The Switching Auto Regression (SAR) model described in "Switching Kalman filters", Technical report, U. C. Berkely, 1998 by K. Murphy. The SAR model is actually an AR model whose parameters are time varying and controlled by a Hidden Markov Model (HMM), but the mean of the excitation is a constant zero.
However, the known methods have a number of disadvantages. The EMAX method, as described in 1), tries to solve the parameter estimation problem using a Maximum Likelihood (ML) criterion. But the ML solution of this model is mathematically intractable. So it resorts to approximations and the algorithm has to run more iterations than an exact ML algorithm should do. The SAR model, as described in 2), does not exploit the finite state structure in the excitation since it assumes the mean of the excitation process to be constantly zero. Also, the way it updates the AR filter coefficients is inefficient because in many signals the AR filter coefficients change more slowly than the excitation.
Generally, the known methods or algorithms are computationally complex and therefore requires a large amount of processing power to be performed in real time. Thus, such algorithms are not suited for miniature equipment and real time applications.
SUMMARY OF THE INVENTION Thus, following the above explanation, it may be seen as an object of the present invention to provide a signal analysis method applicable for blind system identification of non-Gaussian AR signals with reduced requirement for computation power and improved speed.
In a first aspect, the invention provides a signal analysis method including a non- Gaussian auto-regressive model, wherein an input to the auto -regressive model is modelled as a sequence of symbols from a finite alphabet by a finite state stochastic model, where probability density functions of the input at each time instant are Gaussian probability density functions having the same variance for each symbol and with their means representing values of the symbols.
A preferred embodiment of the FSSM is the Hidden Markov Model (HMM). Whereas, the HMM may be reduced, according to the signal's characteristic, to the form of a Gaussian Mixture Model (GMM). When the HMM is used, the signal model is termed the Hidden Markov-Auto Regressive Model (HMARM), and when the GMM is used, the signal model is termed the Gaussian Mixture-Auto Regressive Model (GMARM). The design of the signal model and the constraint posed on the variance of the emission pdfs give rise to very efficient blind system identification algorithms.
This method is advantageous for implementation into algorithms for many applications that involve signal analysis using blind system identification of a source-filter type model, e.g. within speech analysis/processing or telecommunication. Due to the design of the HMARM/GMARM with the constraint on the variance of emission pdfs, an exact ML parameter estimation can be done by solving a set of linear equations iteratively. Thus, the method is computationally efficient and therefore suited also for miniature equipment and real time applications, e.g. hearing devices and mobile communication devices, where only a limited processing power is available and real time performance is critical.
The FSSM may in principle have any finite number of states, i.e. the number being an integer larger than one. In a simple embodiment, the FSSM is a two-state model.
In preferred embodiments, the method performs an expectation-maximization (EM) algorithm for system identification. When the method is performed on a signal with noise, the EM algorithm may involve performing an optimal smoothing on the signal with a multi-state minimum mean-square error smoother as the E- step. Especially, such a multi-state minimum mean-square error smoother is a soft-decision switching Kalman filter. The smoothing reduces noise in the signal and gives all the necessary statistics that are needed in the M-step.
The signal analysis method according to the first aspect may form part of an algorithm for applications such as: speech analysis, speech enhancement, blind source separation, blind channel equalization, blind channel estimation, or blind system identification.
In a second aspect, the invention provides a device including a processor arranged to perform the method according to the first aspect.
Especially, the device may be one of: a mobile phone, a communication device, an internet telephony system, sound recording equipment, sound processing equipment, sound editing equipment, broadcasting sound equipment, and a monitoring system.
The device may also be one of: a hearing aid, a headset, an assistive listening device, an electronic hearing protector, and a headphone with a built-in microphone. In a special embodiment, the device may be arranged to receive a telecommunication signal, wherein the device includes a channel equalizer arranged to perform a blind channel equalization according to a method of the first aspect.
In a third aspect the invention provides a computer executable program code adapted to perform the method according to the first aspect. Especially, such program may be present on a computer or processor memory or represented as data on a data carrier, e.g. hard disk or mobile data carrier (CD, DVD, memory card or memory stick etc.).
It is appreciated that each individual aspects or embodiments of the present invention may each be combined with any of the other aspects or embodiments of these other aspects. These and other aspects of the invention will be apparent from the following description with reference to the described embodiments.
BRIEF DESCRIPTION OF THE FIGURES
The invention will now be described in more detail with regard to the accompanying figures. The figures show specific ways of implementing the present invention and these are not to be construed as being limiting to other possible embodiments falling within the scope of the attached claim set.
In the following reference will be given to the following figures:
Fig. 1 illustrating a generative data structure of a HMARM embodiment,
Fig. 2 illustrating a generative data structure of an E-HMARM embodiment,
Fig. 3 illustrating the shift invariance property of the HMARM method of the invention compared to the LPC method as the prior art. The two graphs show: (a) log-spectral distribution (LSD) of AR spectra of 50 shifted frames, (b) synthetic signal waveform used in the experiment resulting in (a), Figs. 4a and 4b illustrating four panels of graphs with AR spectra estimated for different vowels to illustrate the result of LPC analysis according to prior art compared with the result of the HMARM according to the invention, each panel illustrating: HMARM (top), LPC analysis (middle), and original signal waveform (bottom),
Fig. 5 illustrating prediction residuals by the HMARM (top), by LPC analysis (middle), the example signal waveform being shown in the bottom graph,
Fig. 6 illustrating a discrete time channel model with additive noise,
Fig. 7 illustrating an example of recovered symbol sequences, where dots indicate transmitted symbols, circles indicate recovered symbols by a HMARM embodiment, while stars indicate recovered symbols by a least-squares (LS) method,
Fig. 8 illustrating true and estimated spectra with the HMARM and the LS method for the example shown in Fig. 7, where it is noted that the HMARM result is overlapping with the true spectrum and is therefore hardly visible,
Fig. 9 illustrating another example of recovered symbol sequences with the same type of indications as in Fig. 7,
Fig. 10 illustrating true and estimated spectra for the example of Fig. 9, where the difference between the HMARM result and the true spectrum is only marginal (SNR is 15 dB),
Fig. 11 illustrating yet another example of recovered symbol sequences with the same type of indications as in Fig. 7,
Fig. 12 illustrating true and estimated spectra for the example of Fig. 11, and again the difference between the HMARM result and the true spectrum is hardly visible (SNR is 18 dB), Fig. 13 illustrating a block diagram of a generative signal model used in a preferred signal analysis embodiment, and
Fig. 14 illustrating a preferred headset with speech enhancement.
DETAILED DESCRIPTION OF EMBODIMENTS
In the following, the invention will be described in details, and examples of results achieved with specific embodiments will be given and compared to prior art methods.
Signal analysis embodiments will be described where the finite-state stochastic model is embodied as a hidden Markov model (HMM), and an exact expectation- maximization (EM) algorithm is proposed that incorporates a switching Kalman smoother which provides optimum nonlinear MMSE estimates of system output based on the HMM. Exact EM algorithms can only be obtained by appropriate constraints in the model design, and have better convergence properties than algorithms employing generalized EM algorithms or empirical iterative schemes. The method embodiments also provide good data efficiency since only second order statistics is involved in the computation. The signal models are general and suitable to numerous important signals, e.g. speech signals and base-band communication signals. Two system identification algorithm embodiments will be described, and experiment results will be given for embodiments with applications within speech analysis and channel equalization.
Embodiments of the signal analysis method according to the invention will be described applied to speech analysis, blind channel equalization of a communication channel, spectrum estimation for voiced speech in background noise, and blind noisy channel equalization.
Literature references will be indicated as [X], where X refers to a number in a reference list that can be found on pages 31-33. One of the recent trends in signal processing is to exploit non-Gaussianity or non- stationarity of the signals to accomplish tasks that are generally impossible for traditional linear estimators, e.g., blind source separation, blind channel equalization, and blind system identification. Blind system identification (BSI) solves the fundamental problem residing in most signal processing fields: estimating the system parameters from system output only. In this definition of BSI, the model selection is a preliminary step to the actual identification process. Model selection is usually done according to prior knowledge of the underlying physics of the system. So the task of the BSI is to extract a posteriori information from the system output. A good model selection should facilitate the identification process without compromising the validity of the model much.
In this work, we present two signal models that have efficient identification solutions. On one hand, they are general enough to accommodate many important signals such as speech signals and base band communications signals with the presence of Inter- Symbol Interference (ISI). On the other hand, the efficiency of the algorithms comes from the prior knowledge of the specific signal structure carried by the model.
The first system model consists of a linear time-invariant AR filter excited by a first- order discrete-state Hidden Markov process. In the speech analysis application, the AR filter models the resonant property of the vocal tract, and a two-state Hidden Markov process models the excitation to the filter as a noisy impulse train. The task of system identification here is to jointly estimate the AR coefficients and the excitation dynamics, which contains information about the impulse position, the impulse amplitude, and the noise variance, under a certain optimum criterion. By the joint estimation, the highly non-Gaussian impulse train structure of the excitation no longer affects the AR estimation as it does in the classic Least Squares (LS) solution. The LS methods, such as the auto-correlation method, also known as the LPG analysis, assumes a Gaussian signal model. The consequence of the mismatch of Gaussian model to non-Gaussian signals is an unnecessarily large variation in the estimates. This is supported by the fact that the Cramer-Rao bound for the variances of the AR estimators is lower in the non- Gaussian case than in the Gaussian case [I]. Estimating the AR parameters taking into account the impulse structure of the excitation can also reduce bias. This bias is present in the LPC analysis because of the spectral sampling effect of the impulse train. We will s show that the AR spectra estimated by our method have smaller variance and bias and a better shift invariance property than the LPC analysis. These properties are useful in a wide range of speech processing fields, such as speech coding, pitch modification, speech recognition, and speech synthesis. The identification is done through an exact EM algorithm that consists of forward-backward calculations of state posterior and o solving a small linear equation system iteratively. Initialized with the LPC estimates, using only a few dozens of samples, the algorithm converges in typically 3 to 5 iterations. Application of this model to the blind channel equalization problem is also demonstrated in this paper. To combat ISI in a dispersive channel, channel equalizers are used in many communication systems before decoding the signal. When neither the s channel response nor the transmitted-symbol statistics are known a priori, hence the name blind equalization, the channel response and transmitted symbols need to be estimated jointly. Most established blind equalization methods presume the channel to be FIR. Our blind equalization method, instead, is based on an assumption of an HR all-pole channel model with the following arguments: 1) The use of an AR chan- 0 nel model can reduce the computational complexity dramatically by exploiting the Markovian property of the channel; 2) In channels that exhibit resonance property, such as wireline channels, an AR model is probably more realistic than an FIR model; 3) An AR model with a sufficiently high order can approximate any ARMA or MA model to an arbitrary accuracy, in terms of the spectral density and the covariance 5 function [2, p.411]. To be specific, the AR filter models the channel response, and the Hidden Markov process models the sampled base-band signals. The algorithm exploits the underlying dynamics and non-Gaussianity of the finite alphabet symbol sequence to accomplish system identification. An example of equalizing an MA channel is also demonstrated. 0 In the second system model, observation noise is taken into account. Now, the model consists of a linear time-invariant AR filter excited by a first-order discrete-state Hidden Markov process, and the measurements of the system output are perturbed by white Gaussian noise. The identification algorithm must jointly estimate the AR parameters, the excitation dynamics, and the measurement noise variance. The introduction of measurement noise complicates the problem significantly. This is because that the simplicity of the first algorithm partly comes from the fact that the AB. model aggregates the state information in the most recent system output samples, which are now hidden due to the presence of measurement noise. We adopted a layered data structure with Markov property between layers, which is analogous to the one used in the Independent Factor Analysis [3]. The EM algorithm thus involves a nonlinear MMSE smoother, which provides estimates of the conditional first and second moments of the system output needed in the parameter estimations. We propose a nonlinear MMSE smoother that can be seen as a variant of the soft-decision Switching Kalman Filter [4], where the states control the discrete inputs to the AR filter, and the switching relies on the a posteriori probability of states estimated by a forward-backward algorithm. The EM algorithm thus iterates between the nonlinear MMSE smoothing and the ML parameter estimations.
The introduction of measurement noise modeling in the second system model is a major extension to the first system model. The second method is thus noise robust and applicable in adverse environments, although with a price of higher computational complexity. In its application to robust spectrum estimation of speech signals, the algorithm gives better estimates of the signal spectra than reference methods do, under moderate noise conditions. Established iterative estimators based on Gaussian AR models are known to have convergence problems, thus an empirical termination is required [5] [6]. They also require prior knowledge of measurement noise statistics. The proposed algorithm does not require prior knowledge of the noise statistics, and its convergence is guaranteed. Applications to channel equalization under moderate noise conditions are also demonstrated. Simulations show that the proposed algorithm has better estimates of the channel response and the transmitted symbols than the Least Squares method. In the literature, there has been a few signal models that take the form of combinations of HMM and AR models. One is the Mixture Autoregressive Hidden Markov Model [7], where the HMM has the so called Gaussian Autoregressive Density (GAD) as its emission pdf. Every state of the HMM is associated with k observation samples. When k is large, the GAD can be approximated with a function of the AR coefficients. Thus re-estimation equations can be derived for the HMM parameters including the AR coefficients. There are major differences between this model and our proposal: the GAD pdf is approximated based on the assumption that each state is observed via a large number of samples, while in our models, each state is observed via only one sample, and there is no approximation of pdf involved. Another similar signal model is the Switching Auto Regression (SAR) model [4). The SAR model is an AR model o that changes all its parameters at every time instant, and the switching is controlled by an HMM. An EM algorithm is derived for parameter estimation. This is a very general formulation, since its dynamics has vector states and all parameters are time dependent. But if the system dynamics takes the specific form of all-pole filtering, as is focused in this work, the SAR becomes less efficient since it is hard to impose s any structure on the state transition matrix. Besides, the system noise in the SAR is assumed to be zero mean, while in our models the mean of the system noise is also controlled by the state, which is found particularly useful in the applications discussed in this paper.
The following Section introduces the two signal models and derives the EM algo- o rithms for blind system identification. In the next section, the proposed algorithms are applied to solving problems in speech analysis, noise robust spectrum estimation, and blind channel equalizations with and without measurement noise.
Method
We consider the stochastic source-filter model, in which a linear time invariant (LTI) 5 filter is excited by a stochastic process with a certain statistic property. When the excitation is stationary and Gaussian, the Least Squares method provides an optimum solution to the system identification problem. Nevertheless, many important signals are far from Gaussian. Voiced speech signals and modulated communication signals transmitted through a dispersive channel are just two examples of such signals. A common characteristic of the above mentioned two non-Gaussian signals is that the excitation can be viewed as a sequence of symbols drawn from a finite alphabet, with possibly additive noise. More specifically, for voiced speech, the excitation is well modeled by an impulse train with additive white Gaussian noise [8). This noisy impulse train structure can be characterized by a two-state symbol sequence. While an M- ary Pulse Amplitude Modulation (PAM) signal can be characterized by an M-state symbol sequence. The probability distribution functions (pdfs) of these discrete state excitations are thus multi-modal, and possibly asymmetric (as is for the impulse train). Based on this observation, either a Gaussian Mixture Model (GMM) or a Hidden Markov Model (HMM) is suitable to characterize the statistics of such excitations. The GMM has long been used for modeling multi-modal distributions (e.g., [1,9,10]). We contend also that if there is dependency between consecutive states, modeling the state sequence with an HMM can impose temporal constraints to the model, which is beneficial to solving the identification problem. When the non-Gaussian excitation (modeled by either an HMM or a GMM) is filtered by an AR filter, we term the system model a Hidden Markov-Auto Regressive Model (HMARM) or a Gaussian Mixture- Auto Regressive model (GMARM), respectively. We will show in the following sections that when the emission pdfs of all states are constrained to be Gaussian pdfs with identical variances, both the HMARM and the GMARM have exact EM algorithms for their identifications. Whereas, the HMM is preferable in modeling the excitation because of its capability of modeling the underlying dynamics that is not captured by the GMM, which is a static model. Therefore, the following presentation will mainly focus on the HMARM with a brief discussion on the advantage of the HMM over the GMM in modeling temporal structure. In the next section, we present the HMARM and its identification without measurement noise. The subsequent section deals with the identification of HMARM with its output perturbed by white Gaussian noise, which is termed the Extended-HMARM.
The HMARM and its identification
For an AR(p) filter excited by a Hidden Markov sequence, we have the following system model:
where x(t) is the observed signal (system output), g(k) is the fcth AR coefficient, and r(t) is the excitation. The excitation is a Hidden Markov process, i.e., a first order Markov chain υ(t) plus white Gaussian noise u(t) with zero mean and variance σ2. A diagram of the data structure of the HMARM is shown in Pig. 1. The state qt at time t selects according to the state transition probability GL ^one of M states. The emission pdfs of the states are Gaussian pdfs with the same variance σ2, and means mr(j), j ; 6 (1, - - - , M), respectively. The emission outcome constitutes the excitation sequence r(t), which is independent of r(l) for I φ t and only dependent on the state qt. The excitation r(t) is then convolved with an AR(p) filter with coefficients [g(l), • - , g{p)] to produce the observation x(t). The objective of the identification algorithm is to learn the model parameters φ = [A, mr(l), - - • , TTXr(M)J a2 J g(I), , g(p)] given a frame of signal with length T, where the state transition matrix is denoted by A = (α^), i, J E (I1 - - - , M).
We now define some HMM type notations. Let a(j, i) and β(i, t) denote the forward and backward likelihoods as defined in [11], and a^ denote the state transition probability (from state qι = i to state qt+i = j), and br(j, t) denote the emission pdf of state qt — j observed at the intermediate layer r(t). Follows from (2), the emission pdf br(j, t) takes on a Gaussian distribution
Now, let bx(j, t) denote the emission pdf of state qt = j observed at the observation data layer x(t). It is difficult to deduce this pdf from top layer down to the bottom layer because of the filtering. But we can use the autoregressive property of the filter, i.e., the p most recent system outputs and the current state emission define the current output uniquely. Erom (1), (2) and (3), bx(j, t) can be shown to be a Gaussian pdf with a time varying mean mx(j, t),
where
The forward and backward likelihood inductions are given by
respectively. No define ξ(i, j, t) to be the probability of being in state i at time t and in state j at time t + 1, i.e. ξ(i, J, ϊ) = p[qt = i, qt+i = j|x, ^)- One can evaluate
Define It can then be shown that the quantity represents the expected number of transitions made from state i, and represents the expected number of transitions from state i to state j [H].
Now we are ready to derive the EM algorithm for identification. Let bold face letters x and q denote a frame of signal and the state vector of the corresponding frame of excitation, respectively. We define the complete data to be the concatenation of the observation data and the hidden data (x, q), as indicated in Fig. 1. The excitation r[t) can not be treated as hidden data because once the parameters φ are known, r[t) is linearly dependent on the observation data. Hence we term it the intermediate data. Following the EM paradigm [12], we maximize, instead of the log-likelihood log p(x\φ) directly, the expectation of the complete data likelihood log p[x, q\φ) over the states q given the observation x and current estimate of φ, which is denoted by φ. So the function to be maximized in each iteration is written as (In the following, the notation of summation is abbreviated to showing only the variable's name if the summation interval is over the whole range of the variable. In other case the summation interval will be shown explicitly.):
The first term in (9) concerns only a^ and the second term concerns the rest of the parameters. Thus the optimization can be done on the two terms separately. The re- estimation equation of aZJ is found by the Lagrange multiplier method, and is identical 5 to the standard Baum- Welch re-estimation algorithm [13]:
We denote the second term of (9) by Q(φ, b). Following (1) and (4) we can write
The re-estimation equations of the rest of the parameters are found by setting the o partial derivatives of (11) w.r.t. the parameters to zero, and solving the equation system. Define , which is now interpreted as the posterior of state j at time t given the observation x and . For g(k), we have p equations:
For mr(j), we have M equations:
For σ2, we get
Equation (12) and (13) form p + M coupled linear equations which can be solved analytically, wherein mx(j\ i) is calculated by (5). Then (14) can be solved by inserting the estimated g (k) and mr(j).
In this model, mx(j, t) can be interpreted as the linear prediction of x(t) taking into account the mean of the state qt — j. The re-estimation equations also have intuitive interpretations. In (10), α^- equals the expected number of transitions from state i to state j divided by the expected number of transitions made from state i; Equation (12) is a multi-state version of the orthogonality principle; Equation (13) tells that the prediction error weighted by state posterior is of zero mean; and (14) calculates the mean of the prediction error power weighted by the state posterior as the variance of the stochastic element of the signal. The existence of linear solutions to the maximization of the Q function makes fast convergence. This is a direct benefit from the HMM modeling of the excitation, where the HMM is constrained to have states with identical emission variance. Without this constraint, the resulting maximization equations would be a set of nonlinear equations. GMM-based, general purpose identification methods do not have this constraint, e.g. [10]. Thus they have to resort to numerical maximization of the Q function, which is known as the Generalized EM algorithm.
A GMM with similar constraint can be used in place of the HMM in our signal model, and the EM equations can be derived in the same way as shown above with proper changes in the definition of a and β (the ξ{i, j, t) used in the HMM is not needed in the GMM). The derivation of the GMARM is briefly described in Appendix 0.4. The advantage of the GMARM is a lighter computational load than that of the HMARM. Whereas, the lack of dynamic modeling makes the GMARM converge slower and estimate less accurately than the HMARM whenever there is a temporal dependency in the excitation that is invisible to the GMM, since the GMM is still a static model. Exam- pies of this discrete-state temporal dependency include the impulse train structure in voiced speech signals and Pulse Position Modulation (PPM) signals, and trellis-coded modulation signals. Their dependencies between states can be conveniently represented by state transition matrices. The GMARM on the other hand, can not exploit this use- fill information in its estimation. For excitations that have no temporal dependency, the two algorithms perform similarly.
Remark: An advantage of this two-layer structure is that the AR model extracts the linear temporal structure from the signal, and the HMM takes care of the nonlinear (discrete-state) temporal structure overlooked by the AR model. Thus it is a more efficient way of modeling complex temporal structures than using the AR model or the HMM alone.
The complexity of the HMARM identification is similar to that of the standard HMM, since the forward-backward evaluations of state posterior and the re-estimation equations are analogous to that of the HMM. The extra complexity lies in solving the p+M linear equations (that is, (12) and (13)) for the AR coefficients in every iteration, which accounts for only a mild increase in complexity.
The Extended-HMARM and its identification
In the previous signal model, the output of the AR filter is assumed to be exactly measurable. In many applications, however, measurement noise is inevitable. To be robust against noise, the signal model need to be extended to incorporate a noise model. Assuming stationary white Gaussian measurement noise, we have a new system model whose structure is depicted in Fig. 2. We term this model the Extended-HMARM (E-HMARM).
In this extended data model, we define two hidden data layers: the state qt and the filter output x(t). Observe that r(t) is not hidden because it is linearly dependent on x[t). The system model can be expressed in the following equations:
where y(t) is the observations, z(t) is the measurement noise, g(k) is the fcth AR coefficient, and r(t) is the non-Gaussian process noise, or, the filter excitation. We write r(t) as the sum of v(t), a sequence of M-state symbols, and a white Gaussian noise sequence u(t) with zero mean and variance Thus the excitation r(t) is actually a Hidden Markov process with M states. In HMM terms, these states have Gaussian emission pdfs with mean ror(j), j ' £ [1, • • • , M], and identical variance The state transition matrix is denoted by A. = (a-ij)- The observation noise is assumed to be white Gaussian noise with zero mean and variance
The HMM used here is different from the standard HMM and the HMM used in the HMARM in that, the emission pdf of the state qt = j observed at the observation data layer is a Gaussian pdf with a time varying mean my(j, t) and a time varying variance This can be written as:
From (17), the mean of y(t) should be x(t) if x(t) was known. But since x(t) is not available, a proper choice of the mean of y(t) will be the mean of x(t) given y.
So rny ( j, t) can be obtained by calculating the smoothing estimate of x(t) using the observations y and the current state qt~ The variance of the emission pdf is therefore the sum of the smoothing error variance and the measurement noise variance. The smoothing estimates and the error variance can be calculated with a nonlinear MMSE smoother, which will be described later. It can be summarized as follows:
being the smoothing error variance of x(t) given qt = j. In (19) and in the following, we use the angle bracket {φ\ψ) to denote the expectation of ψ conditioned on ψ. The forward and backward likelihood denoted by a(j, t) and β{j, t) are denned in the same way as in the HMARM, and can be calculated recursively.
The parameters to be estimated are - Applying the EM methodology again, we write the Q function as follows:
Equation (21) follows from the first order Markovian property of the layered data model:
Denote the first, second, and third term in (21) as QT, QB, and Qv, respectively. Thus QT involves only the top hidden layer parameters, QB involves only the bottom hidden layer parameters, and Qy involves only the visible (observation) layer parameters. The maximization of the Q function can now be done by maximizing the three terms in (21) separately.
According to the Gaussian assumption of the observation noise, Qy can be written as:
Note that all the conditioned mean should also be conditioned on φ, but it is omitted here and in the sequel for brevity.
From (15) and (21), QB can be written as:
(24) where qt = j, and j E (1, • - • , M). Here, the posterior mean is conditioned on both the state at present time Qt and the observation y. QT can be written as:
wήere Q>qt_τqt J3 the state transition probability (from state qt-\ to qt).
Now we maximize the Q functions by setting the derivatives with respect to the parameters to zeros. For we get:
from which we get
For the AR parameters g(k), we get:
Here, is the posterior probability of the state being j at time t, and is to be denoted in the sequel by - In (27), the sum of the posterior mean ( • \qt = j, y) over the state weighted by the state posterior can be expressed as the posterior mean conditioned only on y. That is,
Therefore, (27) can be re-written as
For mr(j) we have
For we have
from which we get
where
The transition probability can be estimated in the same way as in the standard HMM:
where ξ(i,j, t) and 7(2, i) are defined in the same way as in the HMARM.
Equation (26), (29), and (30) consist of a set of 1 +p+ M linear equations with the same number of unknowns, and can be solved by straight forward linear algebra. Then (32) can be solved by inserting the newly updated parameter estimates. The quantities needed in these equations include: the state posteriors ζ{i,j,t) and 7(2, i), which are calculated by the forward-backward algorithm; the first and second moments of x(t), which are estimated by a nonlinear MMSE fixed-interval smoother.
The nonlinear MMSE smoother consists of a forward sweep and a backward sweep. In the forward sweep, at time t, a Kalman filter produces M estimates of the mean and correlation matrix of x(t) conditioned on qt = j, j = 1, ■ • • , M, and y. We combine the M estimates weighted by the state a posteriori probabilities, 7(2, i), to get an MMSE filtering estimate conditioned only on y. Then the backward sweep calculates the smoothing estimates and MSE matrices using the filtering estimates and MSE matrices obtained in the forward sweep, Tie backward sweep equations are identical to those of the two-pass Kalman smoother, and can be found in, e.g., [14, p.572]. The o algorithm thus iterates between the nonlinear MMSE smoother, and the estimation of φ and j(i, t).
The algorithm stacks two dynamic state estimators together, i.e., the nonlinear MMSB smoother and the HMM estimator. A unifying view of the Kalman-type state estimator and the HMM state estimator can be found in [15]. The nonlinear smoother uses a continuous state model, where the state vector is the output of the AR(p) filter, xjt_p+i:i, and the state transition is ruled by the auto-regressive property of the AR(JD) filter. The HMM uses a discrete state model, where the states are the input symbols, and the state transition is ruled by the underlying mechanism that produces
5 the symbols.
Remark: The proposed nonlinear MMSE smoother falls in the category of Switching Kalman Filter (SKF) with soft-decision, as is defined in [4]. But there are major differences between the model and identification proposed in [4] and the ones proposed here. In [4], the parameters of a vector AR model (named Switching AR model, or o SAR) switch their values over time, and the switching is modeled by an HMM. The parameter estimation is done by an EM algorithm. The major differences between the SAR and the E-HMARM are: 1) When modeling an AR(p) signal, the state transition matrix has a sparse structure, but the transition matrix in the SAR has no structure. Thus in the SAR identification, the zero- valued elements of the matrix are set to zeros s after the estimation of the matrix. This is analogous to projecting the estimates to the correct parameter space. Whereas in the E-HMARM, the AR model is explicitly expressed in terms of AR coefficients, thus no matrix estimation neither projection are needed. 2) In the SAR model, the system noise process is modeled as a zero- mean process with a varying variance, whereas in the E-HMARM, the system noise 0 is modeled as a process with a varying mean and a constant variance. 3) The SAR assumes no observation noise.
The complexity of the E-HMARM identification is significantly higher than that of the HMARM. This is due to the need of Kalman smoothing in each iteration. The Kalman smoothing has a complexity of 0(Tp3), whereas the complexity of the HMARM 5 is O(TM2). Therefore, the complexity of the E-HMARM is O (Tp3). Applications and results
We apply the proposed system models and their identification algorithms to tackle problems in speech analysis and channel equalization. In the speech analysis examples, we show that the proposed non-Gaussian AR system identification method can provide better estimates of the AR coefficients, and better structured residual, than those given by the classical LPC analysis. We also show that under mild noise conditions, robust AR analysis can be achieved without knowing the noise variance. In the channel equalization examples, we show that joint channel estimation and symbol estimation can be done efficiently to a high accuracy when SNR is high. When SNR is moderate, the joint estimation can be done with extra computational complexity.
: Efficient non-Gaussian speech analysis
In a vast variety of speech processing applications, AR coefficients or AR spectra, and linear prediction residual need to be calculated. Least Squares methods, such as the LPC analysis (implemented as an autocorrelation method), have been the standard methods of analyzing AR models. The Gaussian assumption taken by the LS method results in simple analytic solutions. But when applied to non^Gaussian signals such as voiced speech signals, the mismatch of assumption brings in undesirably large variance and bias. The large variance implies a bad shift-invariance property of the LPC analysis. This means that, when a sustained vowel is segmented into several frames, the LPC estimates of the AR parameters for each frame can be very different. This causes, as an example, in a CELP coding application, more bits than necessary to be transmitted, and in a packet loss concealment application, difficulty to interpolate a missing frame. Here we apply the HMARM method to AR analysis, and compare the bias and the variance of the estimates to those given by the LPC analysis. With both the LPC analysis and the HMARM analysis, the AR(IO) coefficients are estimated. This model order is selected because the vocal tract system is typically a 10th order AR system [8, p.285). For voiced speech analysis, we used a two-state HMARM, since the impulse train structure in the excitation is best modeled by a two-state HMM. First, we use a synthetic signal that resembles a sustained voiced speech signal. The synthetic signal is made by filtering a noisy impulse train with an AR(IO) filter. 50 realizations of this signal are analyzed. To get the 50 realizations we shift a rectangular window along the signal one sample each time 50 times. The window length is 320 s samples. The estimated AR spectra of the 50 realizations are compared to the true AR spectrum, and the difference is measured by the Log-Spectral Distortion (LSD) measure. The LSD is defined as follows:
where L is the number of spectral bins. The LSD versus the shift is shown in Pig o 3. It is clear that the proposed method has a flat distortion surface and this surface is lower than the LPCs. It is important to note that the LPC estimates encounter huge deviation from the true values in the second half of the plot. This is where a large "hump" in the signal comes into the analysis frame. The large humps in the signal are caused by the impulses in the excitation, which represent the non-Gaussian/nonlinear s structure of the signal. The bias of the estimates is also calculated by taking the difference between the true parameters and the sample mean of the estimates, and the variance of the estimates is calculated using the sample variance of the estimates. Table 1 compares the biases and variances of the HMARM and LPC estimates.
Now, we test the shift-invariance property with true speech signals. For real speech 0 signals, there is an implementation issue needed to be pointed out. Since the HMARM is a causal dynamic model, and the analysis is usually frame-based, the ringing from the last impulse of the previous frame has an undesired impact on the current frame estimates. This is because the estimator does not see the previous impulse but its effect exists. This could sometimes degrade the performance mildly. We therefore suggest to 5 do a pre-processing that removes the ringing from the previous frame, or simply set the signal before the first impulse to zero. The latter is used in our experiments. The AR spectra of four different voiced phonemes are estimated 50 times with one sample shift each time. The frame length is set to 256 samples. The spectra are plotted in Fig 4. The estimates by the HMARM show good consistency, while the consistency of the LPC analysis appears to be poor. We observed the same tendency when we varied the segment length and compared the estimates from different data length. These results show that, the LPC analysis is sensitive to the difference in the waveforms of different realizations of the same process, while the HMARM is significantly less sensitive. The residual of the HMARM analysis also has different properties than the LPC analysis. In Fig. 5 we show the prediction residual of a voiced speech signal using the AR parameters estimated by the HMARM and the LPC respectively. It is clear that the residual of the HMARM has more prominent impulses, and the noise between the impulses appears to be less correlated. In general, the residual of HMARM has a smaller Ll norm than that of the LPC analysis. From a sparse coding point of view, the proposed method provides a sparser representation of the voice signal than the one given by LPC analysis. Traditionally, sparse representation is achieved by minimizing Ll-norm with numerical optimizations (see [16] for a review, and [17] for application in speech analysis), or using Bayesian inference with a super Gaussian pdf as prior [18]. The HMARM method proposed here provides a computationally simple alternative to the sparse coding of voiced speech signals. s Another known LS method is the covariance method [19, Ch. 5.3]. The covariance method is known to give more accurate estimates of the AR coefficients than the autocorrelation method when the data length is small. In our experiments, it is so when the analysis window is rectangular. When a Hamming window is used, the covariance method gives similar results as the autocorrelation method. 0
Blind channel equalization
We consider a discrete-time communication channel model as shown in Fig. 6, where the channel response has included the response of the transmitter filter, the medium, the receiver filter, and the symbol-rate sampler. We assume that the channel can be well characterized by an AR model, and no measurement noise is present (or, the channel B has a very high SNR). The transmitted symbols are quaternary PAM symbols. At the receiver end, the channel distortion is compensated and the transmitted symbols are decoded. The receiver has no prior knowledge about the channel, the alphabet of the transmitted symbols, and the probability distribution of the symbols.
Using the HMARM, the equalization and decoding are done jointly. In the first experiment, 200 symbols generated randomly using a four-symbol alphabet A = {—3, —1, 1, 3} are transmitted. The channel is AR(IO) with coefficients
A = [1, -1.223, -0.120, 1.016, 0.031, -0.542, -0.229, 0.659, 0.307, -0.756, 0.387]. The equalizer output and the estimated channel spectra are shown in Fig. 7 and Fig. 8, respectively. Here we again use the LS method as the reference method. It is clear from the figures that the recovered symbol sequence by the HMARM method coincides with the transmitted symbols very well, and the spectrum estimated by the HMARM method completely overlaps with the true channel spectrum. Whereas the LS method has a much larger estimation error on both the recovered symbols and the channel spec- trum. More precisely, the estimation error variance of the recovered symbol sequence is 1.06 x 10~26 for the HMARM method and 0.36 for the LS method, which represents a 255 dB gain of the HMARM method over the LS method.
In the second experiment, we consider an FIR channel model. In most of the channel equalization literature, channels are modeled by MA models. A major advantage of MA modeling in channel equalization is the simplicity in algorithm design. Whereas, most realistic channels have both an MA part and an AR part. When the channel response is IIR, the drawback of an MA model is obvious: it requires a very large number of coefficients to approximate an IIR channel, while the AR model can approximate an MA channel with a mildly larger order. Equalization of MA channel using AR model has been shown before, e.g. [10]. In this example we use the same experimental setup as in [10] to demonstrate the applicability of our method to the MA channel equalization. The alphabet A is the same as before, and the 3rd order MA channel coefficients are B — [1.0, -0.65, 0.06, 0.41]. The recovered symbol sequence are shown in Fig. 9. The estimation error variance of the recovered symbol sequence is 0.0023 for the HMARM method, and 0.4212 for the LS method. The gain of the HMARM method over the LS method is 22.6 dB.
When there exists white Gaussian measurement noise in the system, the performance of the HMARM method degrades. For a channel SNR of 60 dB, 50 dB, and 40 dB, the gain of the HMARM method over the LS method are 27.5 dB, 17.5 dB, and 8 dB, respectively. From 30 dB down, the performance of HMARM is similar to that of the LS method.
Noise robust spectrum estimation for voiced speech
When measurement noise is present in an AR system, the classic Least Squares method performs poorly because there is no noise modeling in it. The LS method can be extended to modeling both process noise and measurement noise. This is known as the Extended Least Squares (XLS) method [20]. Examples of Gaussian AR model identification are given in [20]. On another thread, EM-type AR model estimations in noisy environments have been extensively studied, especially in the speech processing literature. Pioneered by Lim and Oppenheim [5], and followed by Hansen and Clements [21], Weinstein and Oppenheim [22], Gannot [23], and etc., the paradigm of EM-type algorithms is an iterative ML or MAP estimation. These algorithms are all based on Gaussian signal assumption and succeed in achieving noise robust AR estimation with low complexities. Yet a common drawback of the Gaussian EM-type algorithm is that convergence is not guaranteed. Often an empirical stop criterion is needed, or certain constraints based on knowledge of speech signals are needed [21]. Using the E-HMARM method, we show that the observation noise strength, the AR parameters, and the excitation statistics of voiced speech signal can be jointly estimated, and the convergence is guaranteed.
The synthetic signal used in Section Efficient non-Gaussian Speech Analysis is added with white Gaussian noise, such that the SNR equals 15 dB and 20 dB. Fig. 10 shows the signal spectrum and its estimetes given by the E-HMARM and LS, respectively, at 15 dB input SNR. Table 2 shows the averaged values of parameters of 50 estimations. The results show that the E-HMARM algorithm gives much better estimates of the signal spectra than the LS method. The estimates of the impulse amplitude and measurement noise variance are also quite accurate. The estimated process noise variance is always larger than the true value, especially when the SNR is low. This is because in the E-HMARM algorithm, the modeling error is included as part of the process noise.
Like all EM-type algorithms, it is possible for the E-HMARM algorithm to converge towards a local maxima. A good initialization can prevent converging to the local maxima. In our implementation of the E-HMARM algorithm, the LS estimates of the AR coefficients are used as initial values. The convergence criterion is set such that the iteration stops when the norm of the difference in the parameter vectors is smaller than 10~4. No divergence has ever been observed under extensive experiments. The E-HMARM algorithm works best at SNRs above 15 dB. From 10 dB and below, the algorithm converges to the LS solution.
Blind noisy channel equalization
In Section Blind Channel Equalization we have shown the performance of the HMARM blind channel equalization in a high SNR communication system. We now show that at a lower SNR range, the E-HMARM algorithm can do the job better. In this example, we consider a Pulse Position Modulation (PPM) signal. PPM is a modulation scheme in which M messages are encoded by transmitting a single pulse in one of M possible time-shifts in a time frame. PPM is typically used in optical communications and recently in ultra-wide-band (UWB) systems [24] and indoor infrared communications [25]. PPM is known to be vulnerable to ISI because of its very large signal bandwidth, and equalization is necessary for high speed transmission. Different from the white spectrum of the PAM symbol sequence, the spectrum of a PPM symbol sequence is high pass and has a strong DC component (Instead of defining the whole frame as a symbol, here we treat the pulse duration as the symbol duration. Thus a time frame consists of M symbols, and the sampler at the receiver samples M times per frame. This is why the received symbol sequence has a strong DC component and a high pass spectrum.). The smaller the M, the more high pass the spectrum. This imposes a difficulty to the system identification, i.e., the auto-correlation in the symbol sequence can be absorbed into the AR spectrum estimates resulting in biased estimates of the channel response. In the E-HMARM, this difficulty can be circum- vented by exploiting the known symbol amplitudes. That is, if the transmitted symbol amplitudes are known to the receiver, as is the case in most communication systems, we can constrain the mr to be equal to the known values. This not only speeds up the convergence, but also makes the algorithm robust against the non-whiteness of the symbol sequence.
In the experiment, the transmitted symbols are randomly generated from an M-ary alphabet with M = 8. A signal frame thus has 8 time slots, each corresponding to one symbol in the alphabet. When the /cth symbol is to be transmitted, a pulse is put at the fcth time slot, and zeros elsewheτe. We again use an equivalent discrete-time channel model to simplify the analysis. Without loss of generality, the transmitted signal is modeled as a "1" at the symbol position and "0" at the other 7 positions. The channel is modeled as an AR(IO) filter. White Gaussian noise is added to the output of the AR(IO) filter. The E-HMARM equalizer estimates the channel response and the noise variance, and does inverse filtering to recover the transmitted symbols. The standard LS method is used as a reference method. It is shown in Fig. 11 that the recovered symbol sequence by the E-HMARM method has much smaller error variance than that of the LS method. In Fig. 12 it is shown that the E-HMARM gives a very good estimate of the channel spectrum, while the LS estimate is far off. The channel SNR in this example is 18 dB, and the signal length is 400 samples. The E-HMARM equalizer works best at SNRs above 18 dB. At SNRs below 18 dB its performance degrades fast. At SNRs below 15 dB the E-HMARM algorithm converges to the LS solution.
Here we show how to combine a GMM with an AR model in the two-layer data struc- ture. The forward-backward algorithm used in the HMM parameter learning is a convenient and insightful way of calculating the state posterior probability. So we can modify the HMM learning algorithm to obtain a GMM learning algorithm.
Assume the GMM has M Gaussian terms. Denote the vector of the weights for
Gaussian terms by A — [θj], where i G 1, ■ ■ • , M. Denote the emission pdf given the state qt — j by bx(j, t). Define the forward and backward likelihood a(j, t) and β{i, t) as same as in the HMM. So the induction equations can be written, analogous to those of the HMM, as: and
Now, we can derive the EM algorithm. The Q function can be written as
Comparing (38) with (9), only the first terms are different. So all the re-estimation equations are identical except for the one for a,j. For a,j we have the following re- estimation equation:
This GMARM algorithm has a lighter computational load than the HMARM pre- sented in Section 0.2.1 since the calculation of the state posterior probability has a simpler form. Table 1: A comparison of biasis and variances of the estimates by the HMARM and LPC. The improvement of the HMARM over the LPC is shown in the Imprv.
TaLIe 2: The true and estimated parameters. Results are the average of 50 estimations.
Reference list
[1] D. Sengupta and S. Kay, "Efficient estimation of parameters for non-Gaussian
autoregressive processes," JJEEE Trans. Acoustics, Speech and Signal Processing, vol. 37. No.6, pp. 785-794, 1989.
[2] T. W. Anderson, The statistical analysis of time series. New York: John Wiley & Sons, 1973.
|3] H. Attias, "Independent factor analysis," Neural Computation, vol. 13 , no. 4, pp.
803-851, 1999.
|4] K. Murphy, "Switching Kalman filters," Technical report, U. C. Berkeley, 1998. |5] J. S. Lim and A. V. Oppenheim, "All-pole Modeling of Degraded Speech," JEEE Trans. Acousi., Speech, Signal Processing, vol. ASP-26, pp. 197-209, June 1978.
|6] J. D. Gibson, B. Koo, and S- D. Gray, "Filtering of colored noise for speech enhancement," JEEE Trans, on Signal Processing, vol. 39, pp. 1732-1742, 1991. J7] B.-H. Juang and L. R. Rabiner, "Mixture autoregressive Hidden Markov Models for speech signals," JEEE Trans, on Acoustics, Speech and Signal Processing, vol. ASSP-33. No.6, pp. 1404-1413, 1985.
|8] 3. R. Deller, J. H. L. Hansen, and J. G. Proakis, Discrete-Time Processing of
Speech Signals. Wiley-Interscience-IEEE, 1993.
[9] Y. Zhao, X. Zhuang, and S.-J. Ting, "Gaυssian mixture density modeling of non- Gaυssian source for autoregressϊve process," JEEE Trans, on Signal Processing, vol. 43. No.4 , pp. 894-903, 1995.
[10] S. M. Verbout, J. M. Ooi, J. T. Ludwig, and A. V. Oppenheim, "Parameter estimation for autoregressive Gaussian-Mixture processes: the EMAX algorithm," IEEE Trans, on Signal Processing, vol. 46. No.10, pp. 2744-2756, 3998. [11] L. R. Rabiner and B. H. Juang, "An introduction to Hidden Markov Model," IEEE
ASSP Magaine , pp. 4-16, Jan. 1986.
[12] A- P. Dempster, N. M. Laird, and D. B. Rubin, "Maximum likelihood from incomplete data via the EM algorithm," J- JZ. Statist Soc, Series B, p. 138, 1977. |13] L. E. Baυm and T. Petrie, "Statistic inference for probabilistic functions of finite state Markov chains," Ann. Math. Stat, vol. 37, pp. 1554-1563, Mar. 1966. |14] G. Strang and K. Borre, Linear Algebra, Geodesy and GPS. Wellesley-Cambridge,
O . S-, 1997.
|15] S- Roweis and Z. Ghahraroani, "A unifying review of linear Gaussian models," Neural Computation, vol. 11. No.2, 1999.
|16] Υ. Li, A. Cichocki, and S.-L Amari, "Analysis of sparse representation and blind source separation," Neural Computation, vol. 16, pp. 1193-1234, 2004. |17] M. Namba, H. Kamata, and Y. Ishida, "Neural Networks Learning with Ll Criteria and Its Efficiency in Linear Prediction of Speech Signals," Proc. ICSLP '96, vol. 2, pp. 1245-1248, 1996.
|18] B. Olshauεen and D. Field, "Sparse coding with an overcomplete basis set: A strategy employed by vl," Vision Research, vol. 37, No.23, pp. 3311-3325, 1997. |19] J. R. DeUer, J. H. L. Hansen, and J . G- Proakis, Discrete-time processing of speech signals. Wiley Interscience, 2002. [20] A. Yeredor, "The extended least squares criterion: minimization algorithms and applications," IEEE Trans, on Signal Processing, vol. 49. No.l, pp. 74-86, 2000. [21] J. H. L. Hansen and M. A. Clements, "Constrained Iterative Speech Enhancement with Application to Speech Recognition," IEEE Trans. Signal Processing, vol. 39, pp. 795-805, 1991. [22] E- Weinstein, A. V. Oppenheim, and M. Feder, "Signal enhancement using single and multi-sensor measurements," RLE Tech. Rep. 560, MIT, Cambridge, MA, vol. 46, pp. 1-14, 1990.
|23) S. Gannot, D. Burshtein, and E. Weinstein, "Iterative and sequential Kalman filter-based speech enhancement algorithms," IEEE Trans, on Speech and Audio, vol. 6, pp. 373-385, July 1998.
[24] K- Siwiak, P. Withington, and S. Phelan, "Ultra-wide band radio: The emergence of an important new technology," IEEE Proc. Veh. Tech. Conf., vol. 2, pp. 1169— 1172, 200] .
[25] M. D. Audeh, J. M. Kahn, and J. R. Barry, "Performance of Pulse-Position Modulation on measured non-directed indoor infrared channels," IEEE Trans. Com- munications, vol. 44, No. 6, pp. 654-659, 1996.
Fig. 13 illustrates in block diagram form a preferred embodiment of the signal model used in system identification. The signal is modelled as the output of an AR filter, here illustrated as an AR(IO) filter, excited by the input X. The input X is modelled by a Finite State Stochastic Model FSSM. Here the preferred embodiment of the FSSM is an HMM with 2 states, wherein the emission pdfs of the HMM are Gaussian pdfs with the same variance σx 2 = σ2 a. The constraint on the variance is due to the fact that the input X is modelled as a sequence of symbols I selected from a finite alphabet added with stationary white Gaussian noise. The parameters of this model are identified by an EM algorithm and are used in a later stage for processing of the signal.
The FSSM according to this embodiment uses a Gaussian probability density function with a variance value Oi2 for the first symbol which equals a variance value σi2 for the second symbol. Mean values μl7 μ2 in the Gaussian probability density functions are determined by the symbols, and in this example the values 0 and 1 are selected. With this choice of mean values μi, μ2, the algorithm is suited e.g. for input signals X such as speech which can be modelled as a train of impulses, and hence only a sequence of symbols indicating "pulse" and "no pulse" is needed as input to the AR.
Fig. 14 illustrates in block diagram form an example of a device according to the invention, namely a headset. The headset includes digital signal processor arranged to receive a signal representing noisy speech, e.g. the voice of a remote speaker distorted by background acoustic noise in a noisy environment. System identification of a signal model, e.g. the one described in Fig. 13, is then performed on this input signal to estimate parameters of the model. Then, a speech enhancement procedure filters out the noise from the speech using the estimated parameters.
Finally, the loudspeaker produces an acoustic signal representing an enhanced version of the noisy input speech signal and thus provides the headset user with an improved speech quality. For simplicity, only the vital parts with respect to speech analysis and enhancement are illustrated, however, it is to be understood that the headset may include many additional signal processing features and additional electronic equipment. E.g. analog-to-digital and digital-to-analog converters are not illustrated.
The input noisy speech signal is analyzed using a 2-state E-HMARM model where the emission probability density functions for the two states have the same variance. The noise reduction block reduces noise by performing a multi-state Kalman smoothing, given the estimated model parameters. Finally, this enhanced speech signal is applied, e.g. after digital-to-analog conversion, to a loudspeaker in the headset that converts the signal to an enhanced acoustic speech signal.
In the headset of Fig. 14, the illustrated speech analysis and enhancement procedure may additionally or alternatively be used for the speech picked up by the microphone in the headset before transmitted from the headset, so as to clean speech picked up by the microphone, e.g. if the user is located in a noisy environment.
In general, it is to be understood that the methods and devices described can be implemented by means of hardware, software, firmware or any combination of these. The invention or some of the features thereof can also be implemented as software running on one or more data processors and/or digital signal processors.
The individual elements of an embodiment of the invention may be physically, functionally and logically implemented in any suitable way such as in a single unit, in a plurality of units or as part of separate functional units. The invention may be implemented in a single unit, or be both physically and functionally distributed between different units and processors.
Although the present invention has been described in connection with the specified embodiments, it should not be construed as being in any way limited to the presented examples. The scope of the present invention is to be interpreted in the light of the accompanying claim set. In the context of the claims, the terms "comprising" or "comprises" do not exclude other possible elements or steps. Also, the mentioning of references such as "a" or "an" etc. should not be construed as excluding a plurality. The use of reference signs in the claims with respect to elements indicated in the figures shall also not be construed as limiting the scope of the invention. Furthermore, individual features mentioned in different claims, may possibly be advantageously combined, and the mentioning of these features in different claims does not exclude that a combination of features is not possible and advantageous.

Claims

1. A signal analysis method including a non-Gaussian auto-regressive model (AR), wherein an input (X) to the auto-regressive model (AR) is modelled as a sequence of symbols (I) from a finite alphabet by a finite state stochastic model (FSSM), where probability density functions of the input at each time instant are Gaussian probability density functions having the same variance (σ2 !, σ2 2) for each symbol and with their means (P1, μ2) representing values of the symbols.
2. Signal analysis method according to claim 1, wherein the input (X) of the auto- regressive model (AR) is modelled as a sequence of symbols (I) from the finite alphabet added with Gaussian noise.
3. Signal analysis method according to claim 1 or 2, wherein the finite state stochastic model (FSSM) is a Hidden Markov Model.
4. Signal analysis method according to claim 3, wherein the Hidden Markov Model has the form of a Gaussian Mixture Model.
5. Signal analysis method according to any of the preceding claims, wherein the finite state stochastic model is a two-state model.
6. Signal analysis method according to any of the preceding claims, including an identification step of performing an expectation-maximization algorithm.
7. Signal analysis method according to claim 6, wherein the expectation- maximization algorithm involves performing a smoothing on the signal with a multi-state minimum mean-square error smoother.
8. Signal analysis method according to claim 7, wherein the multi-state minimum mean-square error smoother is a soft-decision switching Kalman filter.
9. A speech analysis method including the signal analysis method according to any of claims 1-8.
10. A speech enhancement method including the method according to any of claims 1-8.
11. A blind source separation method according to any of claims 1-8.
12. A blind channel equalization method according to any of claims 1-8.
13. A blind channel estimation method according to any of claims 1-8.
14. A blind system identification method according to any of claims 1-8.
15. A device including a processor arranged to perform the method according to any of claims 1-14.
16. A device according to claim 15, the device being selected from the group consisting of: a mobile phone, a communication device, an internet telephony system, sound recording equipment, sound processing equipment, sound editing equipment, broadcasting sound equipment, and a monitoring system.
17. Device according to claim 15, the device being selected from the group consisting of: a hearing aid, a headset, an assistive listening device, an electronic hearing protector, and a headphone with a built-in microphone.
18. Device according to claim 15, arranged to receive a telecommunication signal, the device including a channel equalizer arranged to perform a blind channel equalization according to any of claims 1-8.
19. Computer executable program code adapted to perform the method according to any of claims 1-14.
EP07722544A 2006-04-04 2007-03-30 Signal analysis method with non-gaussian auto-regressive model Withdrawn EP2011114A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
DKPA200600476 2006-04-04
PCT/DK2007/000163 WO2007112749A1 (en) 2006-04-04 2007-03-30 Signal analysis method with non-gaussian auto-regressive model

Publications (1)

Publication Number Publication Date
EP2011114A1 true EP2011114A1 (en) 2009-01-07

Family

ID=38134633

Family Applications (1)

Application Number Title Priority Date Filing Date
EP07722544A Withdrawn EP2011114A1 (en) 2006-04-04 2007-03-30 Signal analysis method with non-gaussian auto-regressive model

Country Status (2)

Country Link
EP (1) EP2011114A1 (en)
WO (1) WO2007112749A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109859467A (en) * 2019-01-30 2019-06-07 银江股份有限公司 A kind of mining analysis method of Environmental Factors in traffic model

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8185480B2 (en) 2008-04-02 2012-05-22 International Business Machines Corporation System and method for optimizing pattern recognition of non-gaussian parameters
TWI465122B (en) 2009-01-30 2014-12-11 Dolby Lab Licensing Corp Method for determining inverse filter from critically banded impulse response data
US10394985B2 (en) * 2017-01-11 2019-08-27 Samsung Electronics Co., Ltd. Apparatus and method for modeling random process using reduced length least-squares autoregressive parameter estimation
CN109389979B (en) * 2018-12-05 2022-05-20 广东美的制冷设备有限公司 Voice interaction method, voice interaction system and household appliance
US11428839B2 (en) * 2018-12-28 2022-08-30 Carbo Ceramics Inc. Systems and methods for detecting a proppant in a wellbore
CN116866124A (en) * 2023-07-13 2023-10-10 中国人民解放军战略支援部队航天工程大学 Blind separation method based on baseband signal time structure

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See references of WO2007112749A1 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109859467A (en) * 2019-01-30 2019-06-07 银江股份有限公司 A kind of mining analysis method of Environmental Factors in traffic model

Also Published As

Publication number Publication date
WO2007112749A1 (en) 2007-10-11

Similar Documents

Publication Publication Date Title
EP0689194B1 (en) Method of and apparatus for signal recognition that compensates for mismatching
EP0886263B1 (en) Environmentally compensated speech processing
US9881631B2 (en) Method for enhancing audio signal using phase information
KR100549133B1 (en) Noise reduction method and device
JP3919287B2 (en) Method and apparatus for equalizing speech signals composed of observed sequences of consecutive input speech frames
CN101647061B (en) Noise variance estimator for speech enhancement
EP2011114A1 (en) Signal analysis method with non-gaussian auto-regressive model
WO2011037587A1 (en) Downsampling schemes in a hierarchical neural network structure for phoneme recognition
US8737460B2 (en) Equalizer and detector arrangement employing joint entropy-based calibration
González et al. MMSE-based missing-feature reconstruction with temporal modeling for robust speech recognition
JP4891805B2 (en) Reverberation removal apparatus, dereverberation method, dereverberation program, recording medium
KR20110024969A (en) Apparatus for filtering noise by using statistical model in voice signal and method thereof
Li et al. Efficient blind system identification of non-Gaussian autoregressive models with HMM modeling of the excitation
Akarsh et al. Speech enhancement using non negative matrix factorization and enhanced NMF
Zhao An EM algorithm for linear distortion channel estimation based on observations from a mixture of gaussian sources
Tang et al. Speech Recognition in High Noise Environment.
White et al. Reduced computation blind equalization for FIR channel input Markov models
Li Non-Gaussian, Non-stationary and Nonlinear Signal Processing Methods-with Applications to Speech Processing and Channel Estimation
Li et al. Paper F
Niazadeh et al. ISI sparse channel estimation based on SL0 and its application in ML sequence-by-sequence equalization
Wang et al. RVAE-EM: Generative speech dereverberation based on recurrent variational auto-encoder and convolutive transfer function
Yang et al. HB-DTW: Hyperdimensional Bayesian dynamic time warping for non-uniform Doppler
Siohan et al. Iterative noise and channel estimation under the stochastic matching algorithm framework
Vaseghi Channel equalization and blind deconvolution
Li et al. Blind identification of non-Gaussian Autoregressive models for efficient analysis of speech signals

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 20081104

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LI LT LU LV MC MT NL PL PT RO SE SI SK TR

AX Request for extension of the european patent

Extension state: AL BA HR MK RS

RIN1 Information on inventor provided before grant (corrected)

Inventor name: ANDERSEN, SOREN, VANG

Inventor name: CHUNJIAN, LI

RIN1 Information on inventor provided before grant (corrected)

Inventor name: ANDERSEN, SOREN, VANG

Inventor name: CHUNJIAN, LI

GRAP Despatch of communication of intention to grant a patent

Free format text: ORIGINAL CODE: EPIDOSNIGR1

DAX Request for extension of the european patent (deleted)
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN

18D Application deemed to be withdrawn

Effective date: 20110226