WO2018138543A1 - Probabilistic method for fundamental frequency estimation - Google Patents
Probabilistic method for fundamental frequency estimation Download PDFInfo
- Publication number
- WO2018138543A1 WO2018138543A1 PCT/IB2017/050352 IB2017050352W WO2018138543A1 WO 2018138543 A1 WO2018138543 A1 WO 2018138543A1 IB 2017050352 W IB2017050352 W IB 2017050352W WO 2018138543 A1 WO2018138543 A1 WO 2018138543A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- fundamental frequency
- features
- posterior
- likelihood function
- gmm
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 40
- 238000009826 distribution Methods 0.000 claims abstract description 35
- 238000000342 Monte Carlo simulation Methods 0.000 claims abstract description 5
- 230000008569 process Effects 0.000 claims description 11
- 238000001514 detection method Methods 0.000 claims description 8
- 238000001914 filtration Methods 0.000 claims description 8
- 239000000203 mixture Substances 0.000 claims description 6
- 238000012549 training Methods 0.000 abstract description 6
- 239000000654 additive Substances 0.000 abstract description 4
- 230000000996 additive effect Effects 0.000 abstract description 4
- 230000006870 function Effects 0.000 description 25
- 238000012800 visualization Methods 0.000 description 7
- 230000003044 adaptive effect Effects 0.000 description 5
- 238000005259 measurement Methods 0.000 description 5
- 238000005070 sampling Methods 0.000 description 5
- 238000013459 approach Methods 0.000 description 4
- 238000012545 processing Methods 0.000 description 4
- 238000009827 uniform distribution Methods 0.000 description 4
- 239000011159 matrix material Substances 0.000 description 3
- 230000015572 biosynthetic process Effects 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 238000009499 grossing Methods 0.000 description 2
- 239000002245 particle Substances 0.000 description 2
- 238000003786 synthesis reaction Methods 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 241000665848 Isca Species 0.000 description 1
- 238000007476 Maximum Likelihood Methods 0.000 description 1
- 238000005311 autocorrelation function Methods 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000001427 coherent effect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000009408 flooring Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000000737 periodic effect Effects 0.000 description 1
- 230000000630 rising effect Effects 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- 230000002459 sustained effect Effects 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/90—Pitch determination of speech signals
Definitions
- This invention relates to a probabilistic method for the estimation of fundamental frequency and detection of the presence of sinusoidal component in digital signals.
- the method is primarily designed for speech signals but is also applicable to other periodic digital signals.
- the present invention primarily concerns about the estimation of fundamental frequency in clean and noisy speech signals.
- fundamental frequency (F0) estimation problem in Bayesian framework
- a class of Bayesian F0 estimation approaches assume a time-domain signal model with additive Gaussian noise. A likelihood function of observing the input signal can then be defined and converted to a posterior distribution of F0.
- Nielsen et al J. K. Nielsen, M. G. Christensen, and S. H. Jensen, "An Approximate Bayesian Fundamental Frequency Estimator, " in 2012 IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE Press., 2012.
- used a harmonic plus noise model to model windowed speech; F0 and model order are j ointly determined with prior distribution set under maximum-entropy principle. More recently, Hajimolahoseini et al. (H.
- Garner et al. P. N. Garner, M. Cernak, and P. Motlicek, "A simple continuous pitch estimation algorithm, " IEEE Signal Processing Letters, vol. 20, no. 1, pp. 102-105, Jan. 2013.
- Tsanas et al. A. Tsanas, M. Zanartu, M. A Little, C. Fox, L. O. Ramig, and G. D.
- the method disclosed in this patent belongs to the second category of F0 estimation methods.
- the novelty of the present method includes (1) use of GMM for modeling the relation between FO and features extracted from speech; (2) estimation of GMM parameters by Monte-Carlo simulation; (3) an adaptive Bayesian filtering method for smoothing the estimated FO trajectory, with very few hyper-parameters introduced; (4) detection of the presence of sinusoidal component, or voicing detection in the context of speech analysis, based on entropy of the posterior distribution of FO.
- the present invention is mostly free from the use of heuristics and rule-based decisions, and is very robust against additive noise.
- This patent discloses a multi-stage probabilistic method for fundamental frequency (FO) estimation and detection of the presence of sinusoidal component in digital signals.
- the present method involves the following stages (Fig. 1).
- Stage A Compute the posterior likelihood function L (f 0 ) of FO from an input digital signal at each time instant (or equivalently, frame); the posterior likelihood function is proportional to the posterior probability of FO given the input around the time instant of interest;
- Stage B Determine the presence of sinusoidal component at each time instant based on the entropy of posterior likelihood function computed in Stage A; in the context of speech processing, this relates to the problem of voicing detection;
- Stage C Adaptively tracking the time-varying FO based on the posterior likelihood functions computed in Stage A; the tracking is based on adaptive Bayesian tracking with its purpose being reducing spurious estimation errors.
- stage A involves three maj or steps, summarized as the following.
- First step of stage A One or a plurality groups of features are extracted from the input signal. Each group may contain one or a plurality of features. For analysis of time-varying FO, such feature extraction is performed for each analysis time instant.
- the feature for example, can be the instantaneous frequency at a certain frequency or the signal-to-noise ratio (SNR) at a certain frequency. It is also possible to use time domain features such as the peak position of autocorrelation function.
- Second step of stage A Compute p k ( f 0
- a Gaussian Mixture Model (GMM) modeling the joint distribution of FO and the related features is assumed to be given in advance.
- the GMM is then converted to a conditional GMM of lower dimension, yielding p k ⁇ x 2, ⁇ 3 ⁇ 4) ⁇
- stage T involves an additional training stage (stage T), based on Monte-Carlo simulation, for estimating the parameters for the GMM corresponding to each group of features.
- Stage T involves three major steps, summarized as the following.
- First step of stage T Draw samples of signals from a signal model and prior distributions of model parameters.
- Second step of stage T From each signal sample, extract one or a plurality of groups of features at each time instant, being consistent with the groups of features in stage A.
- stage T For each group of features, train a GMM modeling the j oint distribution of features and fundamental frequency, based on the samples and FO generated in the previous steps.
- Fig. 1 is a flow chart outlining the connection between maj or stages involved in the present invention.
- Fig. 2 is a diagram outlining the data flow between the steps in stage A of an example implementation of the present invention.
- Fig. 3 is a flow chart outlining the steps in stage B of an example implementation of the present invention.
- Fig. 4 is a flow chart outlining the steps in stage C of an example implementation of the present invention.
- Fig. 5 is a flow chart outlining the steps in stage T of an example implementation of the present invention.
- Fig. 6 is the 2D visualization of the local SNR estimated from a speech sample, in stage A of an example implementation.
- Fig. 7 is the 2D visualization of the log conditional probability density of FO estimated from a speech sample, at the 20 th band, in stage A of an example implementation.
- Fig. 8 is the 2D visualization of the normalized posterior likelihood function of FO estimated from a speech sample, in stage A of an example implementation.
- Fig. 9 is the plot of the entropy difference, overlaid with the posterior voicing probability, estimated from a speech sample, in stage B of an example implementation.
- Fig. 10 is the plot comparing the moving standard deviation of expected FO with the estimated standard deviation of process noise, with the upper part being the 2D visualization of the normalized posterior likelihood function of FO estimated from a speech sample, in stage C of an example implementation.
- Fig. 11 is the plot of the maximum a posteriori estimated FO traj ectory overlaid on the 2D visualization of the normalized posterior likelihood function, estimated from a speech sample, in stage C of an example implementation.
- the example implementation uses a multi-band feature extractor with the band central frequency defined according to an array of 36 frequency values spanning from 40Hz to 1000Hz, with uniform spacing on the logarithmic scale.
- Stage A computation of the posterior likelihood function L ( 0 ) in the example implementation consists of the following detailed steps.
- Step A01. Resample the input signal to a new sampling rate of 8000Hz.
- the new signal is denoted as x [ n] .
- step All through step A15 instantaneous frequency and time- frequency local SNR features are extracted from the resampled signal.
- the feature extraction procedure in this example implementation is mostly identical as the recent work by
- Step All For each band with central frequency a> ck , bandpass filter the resampled signal using a complex filter obtained by frequency-modulating a normalized Nuttall window.
- Step A12 Estimate the time-frequency local SNR a k [n] by applying the bandpass filter again on the normalized y k [n] signal, followed by renormalization, subtraction and smoothing.
- Fig.6 shows an example of local SNR estimated from a speech sample.
- Step A 13 For each band with central frequency a> ck , bandpass filter the resampled
- w [m] -0.25 Wc;( s X(/-l)a ; sin(0.25(/-l) Wc;( m) > V
- Step A14 Estimate the time-frequency local instantaneous frequency (3 ⁇ 4 [n] using
- Step A15 In practice, it may not be necessary to extract features on all samples as the signal is usually quasi-stationary in short period of time.
- the SNR and instantaneous frequency features are downsampled by a factor of T; the SNR is converted to decibel scale.
- step A21 to step A22 the conditional distribution of F0 given extracted features in each band, at each time instant is determined.
- GMM Gaussian Mixture Model
- the random variables modeled by the GMM, aside from F0, include the following features:
- the GMM is defined as,
- x [a k n ,a : constructive,vi k : constructive,f 0 f ; ⁇ /( ⁇
- Step A21 Convert the GMM modeling the joint distribution of FO and associated features to a GMM of single dimension modeling the conditional distribution of FO given the features.
- ⁇ 1 is the 3x1 sub-vector in ⁇ 3 ⁇ 4 ;
- ⁇ 2 is the last element in ⁇ 3 ⁇ 4 .
- Step A22 Evaluate the conditional probability density of FO specified by the conditional GMM on a finite grid.
- the purpose is to facilitate various operations, for example integration and entropy computation, in later stages of the present method.
- care has to be taken to prevent the standard deviation of any mixture component to be smaller than the interval of the finite grid for approximated inference, by setting a flooring limit to the conditional variances ⁇ ' lt ll .
- Fig. 7 shows an example of the log conditional probability of FO, at the 20 th band, estimated from a speech sample.
- stage A combine the conditional probabilities of FO from all bands into a posterior likelihood function, from which various useful statistical properties can be computed in stage B and stage C.
- Step A31 Compute the biased posterior likelihood function L 'ggi( 0 ) by multiplying the conditional probabilities of FO from all bands, based on independence assumption and uniform prior assumption.
- the result L ' tone( 0 ) could be biased toward a certain range of frequencies due to the prior distribution of FO, and/or the non-linear arrangement of band central frequencies.
- Step A32 Divide the biased posterior likelihood function by the mean posterior likelihood estimated by averaging multiple estimates from a white noise signal to remove the bias.
- Step A33 Normalize the unbiased posterior likelihood function so that the probability density integrates to 1.
- Fig. 8 shows an example of normalized posterior likelihood function estimated from a speech sample.
- Stage B (Fig. 3) in the example implementation, detecting the presence of sinusoidal component given the posterior likelihood function obtained in stage A, consists of the following steps.
- Step BOl Compute the entropy H n of the normalized posterior likelihood function for each frame.
- the entropy is an indication of the uncertainty about FO at each frame, from which the probability of presence of sinusoidal component can be derived.
- Step B02. Compute the difference between the entropy H n and its upperbound.
- Step B03. Convert the entropy H N into a posterior probability P n (V ⁇ H N ) indicating the voicing status (i.e. presence of sinusoidal component).
- the entropy measured from unvoiced frames i.e. frames at where the sinusoidal component is absent
- Fig. 9 shows a plot of the entropy difference, overlaid with the posterior voicing probability P n ( V ⁇ H n ) , estimated from a speech sample.
- Step B04. Estimate a binary sequence indicating the status of presence of sinusoidal
- step B03 and step B04 can be replaced by a simple linear decision on the entropy H n .
- the n-th frame is labeled as voiced (i.e. with some sinusoidal component being present) if ⁇ ⁇ ⁇ > ⁇ ⁇ where ⁇ ⁇ is a user-specified thresholding parameter.
- Stage C (Fig. 4) in the example implementation, adaptive Bayesian tracking of the time- varying FO based on the posterior likelihood functions computed in Stage A, consists of the following steps.
- the time-varying process variance of FO is estimated.
- the posterior likelihood function is usually unimodal and thus it is statistically meaningful to assume normal distribution in early stages of adaptive tracking.
- a Kalman filter with time-varying process and measurement noise distributions can be used to efficiently decode the FO traj ectory and also to evaluate the total likelihood of the traj ectory.
- the scaling factor can be determined under maximum-likelihood principle.
- Step COl Compute the expectation and variance of FO from the normalized posterior likelihood function.
- M is the order of moving variance, which is empirically found to be around 10.
- Step C03. Find out the best scaling factor to the moving variance so that the scaled variance, when used as the process variance, maximizes the total likelihood of Kalman filtering. For such small search space with one free variable, a finite grid based search is possible:
- Fig. 10 shows a plot comparing the moving standard deviation of expected F0 with the estimated standard deviation of process noise, with the upper part being the 2D visualization of the normalized posterior likelihood function of F0 estimated from a speech sample.
- Step Cll Estimate the maximum a posteriori (MAP) F0 traj ectory given the normalized posterior likelihood functions L n (f 0 ) and the process variance fi k o 2 n determined in step
- MAP inference can be done using either a finite grid approximation (decoded using Viterbi algorithm) or particle filtering.
- Fig. 11 shows a plot of the maximum a posteriori estimated F0 trajectory overlaid on the 2D visualization of the normalized posterior likelihood function, estimated from a speech sample.
- stage T for determining the GMM parameters for converting speech features into the conditional distribution of F0. While it is well-known that the GMM parameters can be trained, using Expectation-Maximization (EM) algorithm, from speech data with known F0, the resulting model could have a bias toward the particular training data set and the accuracy is not guaranteed on different speakers and noise levels. To solve this problem, stage T in the present method also applies EM training of GMM, but on speech signals generated from a Monte-Carlo simulation, assuming a signal model with parameters following high-entropy distributions. In the following, stage T (Fig. 5) in the example implementation is described in detail from step T01 to step Til.
- EM Expectation-Maximization
- Step T01. Draw random samples of signal model parameters from their respective prior distributions. The following parameters are considered in this example.
- Amplitude of the k-th harmonic ( a k ), following log-uniform distribution from 1/3 to 3, expressed as a ratio to the amplitude of the fundamental ( a x 1 );
- Amplitude of the additive noise ( a g ), following log-uniform distribution from -50dB to 50dB, as a ratio to the amplitude of the fundamental.
- Step T02. Generate a synthetic speech signal s [ n] from the sampled parameters.
- a harmonic plus noise model of the following form is used.
- g [n ] is a Gaussian white noise sequence with zero mean and standard deviation equal to 1; the noise sequence is uncorrelated with the sinusoidal component.
- the harmonic plus noise signal model used in this example implementation is able to represent a wide variety of speech and non-speech (for example, monophonic musical instrument) signals.
- Step T03. For each group (i.e. "band” as in this example implementation), extract features from the synthetic signal generated in step T02, in the identical way as in stage A. Store the set of features and the actual F0 of the synthetic signal, in the format coherent with the GMM modeling the joint distribution of speech features and FO. The features may be repeatedly extracted from different positions in the synthetic signal.
- step TOl through step T03 for around 1000 iterations or more, until enough data is collected for GMM training.
- Step Til For each group of features, apply EM algorithm to train a GMM modeling the joint distribution of features and FO, on the data collected from step TOl through step T03.
- a typical configuration is to use GMMs with 16 mixture components and full covariance matrix.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Complex Calculations (AREA)
Abstract
A probabilistic method for estimating the fundamental frequency (F0) of a digital signal, the method comprises: estimating the posterior distribution of the F0; determining the voicing status at each frame; adaptively estimating the most likely F0 trajectory; and training the probabilistic model from simulated data. The method uses GMM to model the relation between F0 and features extracted from speech, with model parameters trained from synthetic signals generated from Monte-Carlo simulation, the method is robust against additive noise and can be easily adapted to model a wide variety of speech and non-speech signals.
Description
Probabilistic Method for Fundamental Frequency Estimation
FIELD OF THE INVENTION This invention relates to a probabilistic method for the estimation of fundamental frequency and detection of the presence of sinusoidal component in digital signals. The method is primarily designed for speech signals but is also applicable to other periodic digital signals.
BACKGROUND OF THE INVENTION The present invention primarily concerns about the estimation of fundamental frequency in clean and noisy speech signals. In recent years there is a rising trend of formulating the fundamental frequency (F0) estimation problem in Bayesian framework and the
convenience of expressing noise in probabilistic notations in general leads to significant improvement in noise-robustness.
A class of Bayesian F0 estimation approaches assume a time-domain signal model with additive Gaussian noise. A likelihood function of observing the input signal can then be defined and converted to a posterior distribution of F0. For example, Nielsen et al (J. K. Nielsen, M. G. Christensen, and S. H. Jensen, "An Approximate Bayesian Fundamental Frequency Estimator, " in 2012 IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE Press., 2012.) used a harmonic plus noise model to model windowed speech; F0 and model order are j ointly determined with prior distribution set under maximum-entropy principle. More recently, Hajimolahoseini et al. (H.
Hajimolahoseini, R. Amirfattahi, S. Gazor, and H. Soltanian-Zadeh, "Robust estimation and tracking of pitch period using an efficient Bayesian filter, " IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 7, pp. 1219-1229, Jul. 2016.) used a simple stochastic time-delay model with amplitude decay and period length controlled by another stochastic system to track the pitch period; inference is carried out using particle filtering. Despite the flexibility in model configuration, an issue commonly associated with the signal modeling approach is the validity of the assumed signal model: a simple model may not well-represent the speech generation mechanism, while a complicated model requires exotic approximation and inference techniques, or otherwise being computationally intractable.
Another category of probabilistic approaches to F0 estimation do not directly model the speech signal but rather attempt to estimate F0 from features computed from the speech. Garner et al. (P. N. Garner, M. Cernak, and P. Motlicek, "A simple continuous pitch estimation algorithm, " IEEE Signal Processing Letters, vol. 20, no. 1, pp. 102-105, Jan. 2013.) proposed a simple F0 tracker based on Kalman filter, where the measurement distribution is determined from signal autocorrelation, which serves as an indication of both period length and uncertainty of the measurement. Tsanas et al. (A. Tsanas, M. Zanartu, M. A Little, C. Fox, L. O. Ramig, and G. D. Clifford, "Robust fundamental frequency estimation in sustained vowels: Detailed algorithmic comparisons and information fusion with adaptive Kalman filtering, " The Journal of the Acoustical Society of America, vol. 135, no. 5, pp. 2885-2901, May 2014.) considered combining numerous F0 estimation algorithms also using a Kalman filter; in their approach the measurement noise variance is converted from a high level feature of algorithmic robustness, for each contributing F0 estimator. However, the conversion from speech feature to probability distribution often relies on empirically-designed heuristics which are not guaranteed to faithfully represent the statistics, and sometimes involves lots of manually-specified parameters, which may overfit to the speech dataset used for parameter tunning.
The method disclosed in this patent belongs to the second category of F0 estimation methods. The novelty of the present method includes (1) use of GMM for modeling the
relation between FO and features extracted from speech; (2) estimation of GMM parameters by Monte-Carlo simulation; (3) an adaptive Bayesian filtering method for smoothing the estimated FO trajectory, with very few hyper-parameters introduced; (4) detection of the presence of sinusoidal component, or voicing detection in the context of speech analysis, based on entropy of the posterior distribution of FO. The present invention is mostly free from the use of heuristics and rule-based decisions, and is very robust against additive noise.
SUMMARY OF THE INVENTION This patent discloses a multi-stage probabilistic method for fundamental frequency (FO) estimation and detection of the presence of sinusoidal component in digital signals. The present method involves the following stages (Fig. 1).
Stage A. Compute the posterior likelihood function L (f0) of FO from an input digital signal at each time instant (or equivalently, frame); the posterior likelihood function is proportional to the posterior probability of FO given the input around the time instant of interest;
Stage B. Determine the presence of sinusoidal component at each time instant based on the entropy of posterior likelihood function computed in Stage A; in the context of speech processing, this relates to the problem of voicing detection;
Stage C. Adaptively tracking the time-varying FO based on the posterior likelihood functions computed in Stage A; the tracking is based on adaptive Bayesian tracking with its purpose being reducing spurious estimation errors.
Concretely, stage A involves three maj or steps, summarized as the following.
First step of stage A. One or a plurality groups of features are extracted from the input signal. Each group may contain one or a plurality of features. For analysis of time-varying FO, such feature extraction is performed for each analysis time instant. The feature, for example, can be the instantaneous frequency at a certain frequency or the signal-to-noise ratio (SNR) at a certain frequency. It is also possible to use time domain features such as the peak position of autocorrelation function.
Second step of stage A. Compute pk ( f 0|xi; x2,— XN ) > the conditional probability of FO given the extracted features in each group (indexed by k) of features. For each group of features, a Gaussian Mixture Model (GMM) modeling the joint distribution of FO and the related features is assumed to be given in advance. The GMM is then converted to a conditional GMM of lower dimension, yielding pk {
x2, ···■¾) ·
Third step of stage A. Combine the conditional distribution of FO computed from each group of features into a posterior likelihood function.
The present method involves an additional training stage (stage T), based on Monte-Carlo simulation, for estimating the parameters for the GMM corresponding to each group of features. Stage T involves three major steps, summarized as the following.
First step of stage T. Draw samples of signals from a signal model and prior distributions of model parameters.
Second step of stage T. From each signal sample, extract one or a plurality of groups of features at each time instant, being consistent with the groups of features in stage A.
Third step of stage T. For each group of features, train a GMM modeling the j oint distribution of features and fundamental frequency, based on the samples and FO generated in the previous steps.
Details about each stage of the present method are presented in the rest of this description. The present method is flexible in that it can be trivially extended to support digital signals of any sampling rate, any kind of signal model and prior distribution, any kind of feature and any combination of features that can be deterministically computed from the input signal. Thus the choice of sampling rate, signal model, prior distributions and features in this description is for illustrative purposes only and does not limit the scope of this patent in regard to the choice of sampling rate, signal model, prior distributions, features and
combination of features.
BRIEF DESCRIPTION OF DRAWINGS Fig. 1 is a flow chart outlining the connection between maj or stages involved in the present invention.
Fig. 2 is a diagram outlining the data flow between the steps in stage A of an example implementation of the present invention.
Fig. 3 is a flow chart outlining the steps in stage B of an example implementation of the present invention.
Fig. 4 is a flow chart outlining the steps in stage C of an example implementation of the present invention.
Fig. 5 is a flow chart outlining the steps in stage T of an example implementation of the present invention.
Fig. 6 is the 2D visualization of the local SNR estimated from a speech sample, in stage A of an example implementation.
Fig. 7 is the 2D visualization of the log conditional probability density of FO estimated from a speech sample, at the 20th band, in stage A of an example implementation.
Fig. 8 is the 2D visualization of the normalized posterior likelihood function of FO estimated from a speech sample, in stage A of an example implementation.
Fig. 9 is the plot of the entropy difference, overlaid with the posterior voicing probability, estimated from a speech sample, in stage B of an example implementation.
Fig. 10 is the plot comparing the moving standard deviation of expected FO with the estimated standard deviation of process noise, with the upper part being the 2D visualization of the normalized posterior likelihood function of FO estimated from a speech sample, in stage C of an example implementation.
Fig. 11 is the plot of the maximum a posteriori estimated FO traj ectory overlaid on the 2D visualization of the normalized posterior likelihood function, estimated from a speech sample, in stage C of an example implementation.
DETAILED DESCRIPTION OF THE INVENTION The following is the detailed description of an example implementation of each stage in the present invention, for estimating F0 in speech signal. The relation between A, B, C and T stages has been outlined in the summary and visualized in the flow chart in Fig. 1.
The example implementation uses a multi-band feature extractor with the band central frequency defined according to an array of 36 frequency values spanning from 40Hz to 1000Hz, with uniform spacing on the logarithmic scale.
Stage A (Fig. 2), computation of the posterior likelihood function L ( 0) in the example implementation consists of the following detailed steps.
Step A01. Resample the input signal to a new sampling rate of 8000Hz. The new signal is denoted as x [ n] .
In the following steps (step All through step A15) instantaneous frequency and time- frequency local SNR features are extracted from the resampled signal. The feature extraction procedure in this example implementation is mostly identical as the recent work by
Kawahara et al. (H. Kawahara, Y. Agiomyrgiannakis, and H. Zen, "Using instantaneous frequency and aperiodicity detection to estimate F0 for high-quality speech synthesis, " in 9th ISCA Workshop on Speech Synthesis, 2016.)
Step All. For each band with central frequency a>ck , bandpass filter the resampled signal using a complex filter obtained by frequency-modulating a normalized Nuttall window.
4
wj m] =∑ a;cos (θ.25 ( /— l) )ckm ) , V \m\< 4 π/ω^
ί = 1
vickm)
n
yk[n] = x[n]*hk[m]
where "*" is the symbol for convolution, a = [0.338946, 0.481973, 0.161054, 0.018027]. Step A12. Estimate the time-frequency local SNR ak[n] by applying the bandpass filter again on the normalized yk[n] signal, followed by renormalization, subtraction and smoothing. Fig.6 shows an example of local SNR estimated from a speech sample.
Step A 13. For each band with central frequency a>ck , bandpass filter the resampled
x[n] using the time derivative of the complex filter used in step A02.
4
w [m] = -0.25Wc;( sX(/-l)a;sin(0.25(/-l)Wc;(m)>V|m|<4n/Wc;(
h [m]={w'k[m]+jwk[m] ckfs)exp{j ckm)
y'k[n] = x[n]*h'k[m]
where fs is sampling rate.
Step A15. In practice, it may not be necessary to extract features on all samples as the signal is usually quasi-stationary in short period of time. The SNR and instantaneous frequency features are downsampled by a factor of T; the SNR is converted to decibel scale. Subsequently, from step A21 to step A22, the conditional distribution of F0 given extracted features in each band, at each time instant is determined.
For each band, a Gaussian Mixture Model (GMM) modeling the joint distribution of F0 and associated features is given. It is assumed that the GMM parameters have been estimated in stage T and are stored for the use in stage A.
For the k-th band, The random variables modeled by the GMM, aside from F0, include the following features:
local SNR in dB ( ak n );
local SNR taken at the band with central frequency close to half the central frequency of the k-th band (in this example implementation which uses 36 bands spanning from 40Hz to 1000Hz, a 'ktn=ak-7tn , ' for k <= 7, we define that a 'k n=ak n );
instantaneous frequency ( (¾ n ).
where x = [ak n,a :„,vik :„,f0f ; Λ/(·|·,·) is the probability density function of a normal distribution; wlk is the weight of the 1-th mixture component; μ¾ is the 1-th mean vector; ∑lk is the 1-th covariance matrix in the following block matrix form,
∑11 ∑12
^ ∑21 ^∑22
where∑u has dimension 3x3; ∑12 has dimension 3x1; ∑21=∑2 " 1 ; ∑22 is a scalar. For notational simplicity, the subscript 1 and k in their respective representation are dropped. Step A21. Convert the GMM modeling the joint distribution of FO and associated features to a GMM of single dimension modeling the conditional distribution of FO given the features.
, n
! !f<, n = ! 2 + ¾l∑ll ( \- ak, n > a k . n ' ^h. n ]
∑ !f<, n =∑22— ¾1∑11∑12
Wft A/ ( [ ak„, a 'k wk„]Γ|μι.∑ιι)
w
∑ ^^Λ/ ([α„ιΠ , α '„ π, ω„ π]Γ|μ1, Σ11)
m
where μ1 is the 3x1 sub-vector in μ¾ ; μ2 is the last element in μ¾ .
Step A22. Evaluate the conditional probability density of FO specified by the conditional GMM on a finite grid. The purpose is to facilitate various operations, for example integration and entropy computation, in later stages of the present method. In some cases care has to be taken to prevent the standard deviation of any mixture component to be smaller than the interval of the finite grid for approximated inference, by setting a flooring limit to the conditional variances ∑'lt ll . For notational simplicity, the rest of this description uses continuous representation of FO and associated probabilities and functions. Fig. 7 shows an example of the log conditional probability of FO, at the 20th band, estimated from a speech sample.
The final steps of stage A (step A31 and step A33) combine the conditional probabilities of FO from all bands into a posterior likelihood function, from which various useful statistical properties can be computed in stage B and stage C.
Step A31. Compute the biased posterior likelihood function L '„( 0) by multiplying the conditional probabilities of FO from all bands, based on independence assumption and uniform prior assumption.
Optionally, take the result to the power of l/γ to control the kurtosis (i.e. sharpness) of the resulting distribution. When γ = K, the above equation is identical to computing the geometric mean of conditional probabilities of FO from all bands.
The result L '„( 0) could be biased toward a certain range of frequencies due to the prior distribution of FO, and/or the non-linear arrangement of band central frequencies.
Step A32. Divide the biased posterior likelihood function by the mean posterior likelihood estimated by averaging multiple estimates from a white noise signal to remove the bias.
Step A33. Normalize the unbiased posterior likelihood function so that the probability density integrates to 1. Fig. 8 shows an example of normalized posterior likelihood function estimated from a speech sample.
Stage B (Fig. 3) in the example implementation, detecting the presence of sinusoidal component given the posterior likelihood function obtained in stage A, consists of the following steps.
Step BOl. Compute the entropy Hn of the normalized posterior likelihood function for each
frame. The entropy is an indication of the uncertainty about FO at each frame, from which the probability of presence of sinusoidal component can be derived.
f_
H = - f Ln{ f0 ) log { Ln { f0 )) df0
Step B02. Compute the difference between the entropy Hn and its upperbound.
A Hn= Hmax- Hn
1
where Hmnx= -log
/ max I min
Step B03. Convert the entropy HN into a posterior probability Pn (V \HN) indicating the voicing status (i.e. presence of sinusoidal component). The entropy measured from unvoiced frames (i.e. frames at where the sinusoidal component is absent) is modeled using an exponential distribution with parameter λ set according to the standard deviation of entropy of FO measured from white noise signal.
P (V \H ) =
α +λ e
where is a user-specified parameter controlling the sensitivity of sinusoidal component detection. Fig. 9 shows a plot of the entropy difference, overlaid with the posterior voicing probability Pn ( V \Hn) , estimated from a speech sample.
Step B04. Estimate a binary sequence indicating the status of presence of sinusoidal
where pt is the user-specified parameter of transition probability.
Alternatively, step B03 and step B04 can be replaced by a simple linear decision on the entropy Hn . For example, the n-th frame is labeled as voiced (i.e. with some sinusoidal component being present) if Δ Ηη>Ηθ where Ηθ is a user-specified thresholding parameter.
Stage C (Fig. 4) in the example implementation, adaptive Bayesian tracking of the time- varying FO based on the posterior likelihood functions computed in Stage A, consists of the following steps.
From step COl to step C03, the time-varying process variance of FO is estimated. In practice the posterior likelihood function is usually unimodal and thus it is statistically meaningful to assume normal distribution in early stages of adaptive tracking. In such case, a Kalman filter with time-varying process and measurement noise distributions can be used to efficiently decode the FO traj ectory and also to evaluate the total likelihood of the traj ectory. When assuming that the process variance is proportional to the moving variance of the expected FO, the scaling factor can be determined under maximum-likelihood principle.
Step COl. Compute the expectation and variance of FO from the normalized posterior likelihood function.
Step C02. Compute the moving variance of n , but only for those frames whose voicing status (i.e. the status of presence of sinusoidal component) is q„= l ; otherwise the moving variance is set to a reasonably large constant. , for n : qn= l
Step C03. Find out the best scaling factor to the moving variance so that the scaled variance, when used as the process variance, maximizes the total likelihood of Kalman filtering. For such small search space with one free variable, a finite grid based search is possible:
(1) generate an array of candidate scaling factors β^ logarithmically distributed on the order from 0.1 to 10, for example β= [0.1, 0.2, 0.4, 0.8, 1.6, 3.2, 6.4] ; in practice the actual candidate array could be of a larger size;
(2) for each candidate scaling factor, multiply it with the moving variance o2 f n and run Kalman filtering with process mean set to zero, process variance set to β¾σ^ „ ,
measurement mean and variance set to n and , respectively;
(3) find the best β^ such that the total likelihood of Kalman filtering is maximized.
Fig. 10 shows a plot comparing the moving standard deviation of expected F0 with the estimated standard deviation of process noise, with the upper part being the 2D visualization of the normalized posterior likelihood function of F0 estimated from a speech sample.
Step Cll. Estimate the maximum a posteriori (MAP) F0 traj ectory given the normalized posterior likelihood functions Ln (f0) and the process variance fiko2 n determined in step
C03. The MAP inference can be done using either a finite grid approximation (decoded using Viterbi algorithm) or particle filtering. Fig. 11 shows a plot of the maximum a posteriori estimated F0 trajectory overlaid on the 2D visualization of the normalized posterior likelihood function, estimated from a speech sample.
The present invention involves a training stage (stage T) for determining the GMM parameters for converting speech features into the conditional distribution of F0. While it is well-known that the GMM parameters can be trained, using Expectation-Maximization (EM) algorithm, from speech data with known F0, the resulting model could have a bias toward the particular training data set and the accuracy is not guaranteed on different speakers and noise levels. To solve this problem, stage T in the present method also applies EM training of GMM, but on speech signals generated from a Monte-Carlo simulation, assuming a signal model with parameters following high-entropy distributions. In the following, stage T (Fig. 5) in the example implementation is described in detail from step T01 to step Til.
Step T01. Draw random samples of signal model parameters from their respective prior distributions. The following parameters are considered in this example.
Fundamental frequency ( f 0 ), following uniform distribution from 40Hz to 1000Hz;
Amplitude of the k-th harmonic ( ak ), following log-uniform distribution from 1/3 to 3, expressed as a ratio to the amplitude of the fundamental ( ax = 1 );
Initial phase of the k-th harmonic ( pk ), following uniform distribution from 0 to 2π;
Amplitude of the additive noise ( ag ), following log-uniform distribution from -50dB to 50dB, as a ratio to the amplitude of the fundamental.
Step T02. Generate a synthetic speech signal s [ n] from the sampled parameters. In this example a harmonic plus noise model of the following form is used.
where g [n ] is a Gaussian white noise sequence with zero mean and standard deviation equal to 1; the noise sequence is uncorrelated with the sinusoidal component.
The harmonic plus noise signal model used in this example implementation is able to represent a wide variety of speech and non-speech (for example, monophonic musical instrument) signals.
Step T03. For each group (i.e. "band" as in this example implementation), extract features from the synthetic signal generated in step T02, in the identical way as in stage A. Store the set of features and the actual F0 of the synthetic signal, in the format coherent with the
GMM modeling the joint distribution of speech features and FO. The features may be repeatedly extracted from different positions in the synthetic signal.
Repeat step TOl through step T03 for around 1000 iterations or more, until enough data is collected for GMM training.
Step Til. For each group of features, apply EM algorithm to train a GMM modeling the joint distribution of features and FO, on the data collected from step TOl through step T03. A typical configuration is to use GMMs with 16 mixture components and full covariance matrix.
Claims
1. A probabilistic method for estimating the fundamental frequency of a digital signal. The method comprises of at least the following steps,
a. from the input signal, extract one or a plurality of groups of features at each time instant; each feature group may contain one or a plurality of features;
b. for each group of features, convert the associated Gaussian Mixture Model (GMM) modeling the joint distribution of features and the fundamental frequency into a GMM representing the conditional distribution of fundamental frequency given the group of features;
c. compute the posterior likelihood function of fundamental frequency by combining the conditional distributions estimated in step b;
d. estimate the fundamental frequency from its posterior likelihood function.
2. A method for detection of the presence of sinusoidal component in a digital signal, given the posterior likelihood function of fundamental frequency at one or a plurality of time instants. The method involves computing the entropy of the posterior distribution of fundamental frequency from the normalized posterior likelihood function at each time instant. Decision about the presence of sinusoidal component is made by at least one of the following methods,
a. thresholding the entropy of the posterior distribution of fundamental frequency;
b. converting the entropy into a probability value indicating the presence of sinusoidal component and applying Viterbi decoding to the sequence of probabilities.
3. A method for adaptively tracking the time-varying fundamental frequency of a digital signal, given the posterior likelihood function of fundamental frequency at each time instant. The method comprises of at least the following steps,
a. from the input signal, estimate the posterior mean and variance of fundamental
frequency at each time instant;
b. compute the moving variance of the posterior mean;
c. determine the time-varying process variance of the fundamental frequency by scaling the moving variance of the posterior mean and finding the scaling factor that maximizes the total likelihood of Bayesian filtering on the fundamental frequency traj ectory;
d. apply Bayesian inference on the fundamental frequency trajectory using the process variance estimated in step c and the posterior likelihood function.
4. A Monte-Carlo method for estimating GMM parameters for the use in a probabilistic
fundamental frequency estimator. The method comprises of at least the following steps, a. draw samples of signals from a signal model and prior distributions of model
parameters;
b. from each sampled signal, extract one or a plurality of groups of features at each time instant; each feature group may contain one or a plurality of features;
c. for each group of features, train a GMM modeling the joint distribution of features and fundamental frequency, based on the samples generated in step a and step b.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/IB2017/050352 WO2018138543A1 (en) | 2017-01-24 | 2017-01-24 | Probabilistic method for fundamental frequency estimation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/IB2017/050352 WO2018138543A1 (en) | 2017-01-24 | 2017-01-24 | Probabilistic method for fundamental frequency estimation |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2018138543A1 true WO2018138543A1 (en) | 2018-08-02 |
Family
ID=62978372
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/IB2017/050352 WO2018138543A1 (en) | 2017-01-24 | 2017-01-24 | Probabilistic method for fundamental frequency estimation |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2018138543A1 (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110855246A (en) * | 2019-11-08 | 2020-02-28 | 成都天奥测控技术有限公司 | Method for generating Gaussian white noise with arbitrary variance |
CN113205827A (en) * | 2021-05-05 | 2021-08-03 | 张茜 | High-precision extraction method and device for baby voice fundamental frequency and computer equipment |
KR102292091B1 (en) * | 2021-06-02 | 2021-08-20 | 국방과학연구소 | Sparse Frequency Analysis method for Passive SONAR System and System thereof |
CN114067784A (en) * | 2021-11-24 | 2022-02-18 | 云知声智能科技股份有限公司 | Training method and device of fundamental frequency extraction model and fundamental frequency extraction method and device |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060080088A1 (en) * | 2004-10-12 | 2006-04-13 | Samsung Electronics Co., Ltd. | Method and apparatus for estimating pitch of signal |
US7643988B2 (en) * | 2003-03-27 | 2010-01-05 | France Telecom | Method for analyzing fundamental frequency information and voice conversion method and system implementing said analysis method |
US20100049522A1 (en) * | 2008-08-25 | 2010-02-25 | Kabushiki Kaisha Toshiba | Voice conversion apparatus and method and speech synthesis apparatus and method |
-
2017
- 2017-01-24 WO PCT/IB2017/050352 patent/WO2018138543A1/en active Application Filing
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7643988B2 (en) * | 2003-03-27 | 2010-01-05 | France Telecom | Method for analyzing fundamental frequency information and voice conversion method and system implementing said analysis method |
US20060080088A1 (en) * | 2004-10-12 | 2006-04-13 | Samsung Electronics Co., Ltd. | Method and apparatus for estimating pitch of signal |
US20100049522A1 (en) * | 2008-08-25 | 2010-02-25 | Kabushiki Kaisha Toshiba | Voice conversion apparatus and method and speech synthesis apparatus and method |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110855246A (en) * | 2019-11-08 | 2020-02-28 | 成都天奥测控技术有限公司 | Method for generating Gaussian white noise with arbitrary variance |
CN110855246B (en) * | 2019-11-08 | 2023-04-11 | 成都天奥测控技术有限公司 | Method for generating Gaussian white noise with arbitrary variance |
CN113205827A (en) * | 2021-05-05 | 2021-08-03 | 张茜 | High-precision extraction method and device for baby voice fundamental frequency and computer equipment |
CN113205827B (en) * | 2021-05-05 | 2022-02-15 | 张茜 | High-precision extraction method and device for baby voice fundamental frequency and computer equipment |
KR102292091B1 (en) * | 2021-06-02 | 2021-08-20 | 국방과학연구소 | Sparse Frequency Analysis method for Passive SONAR System and System thereof |
CN114067784A (en) * | 2021-11-24 | 2022-02-18 | 云知声智能科技股份有限公司 | Training method and device of fundamental frequency extraction model and fundamental frequency extraction method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Ghahremani et al. | A pitch extraction algorithm tuned for automatic speech recognition | |
US10510363B2 (en) | Pitch detection algorithm based on PWVT | |
JP5411936B2 (en) | Speech signal section estimation apparatus, speech signal section estimation method, program thereof, and recording medium | |
EP3070714B1 (en) | Noise variance estimation for speech enhancement | |
Un et al. | A pitch extraction algorithm based on LPC inverse filtering and AMDF | |
KR100192854B1 (en) | Method for spectral estimation to improve noise robustness for speech recognition | |
US8831942B1 (en) | System and method for pitch based gender identification with suspicious speaker detection | |
Turner et al. | Time-frequency analysis as probabilistic inference | |
EP2151822A1 (en) | Apparatus and method for processing and audio signal for speech enhancement using a feature extraction | |
WO2018138543A1 (en) | Probabilistic method for fundamental frequency estimation | |
US20040158462A1 (en) | Pitch candidate selection method for multi-channel pitch detectors | |
Khanagha et al. | Phonetic segmentation of speech signal using local singularity analysis | |
KR20070007684A (en) | Pitch information extracting method of audio signal using morphology and the apparatus therefor | |
Mellahi et al. | LPC-based formant enhancement method in Kalman filtering for speech enhancement | |
Abrol et al. | Voiced/nonvoiced detection in compressively sensed speech signals | |
JP4496378B2 (en) | Restoration method of target speech based on speech segment detection under stationary noise | |
Korkmaz et al. | Unsupervised and supervised VAD systems using combination of time and frequency domain features | |
CN104036785A (en) | Speech signal processing method, speech signal processing device and speech signal analyzing system | |
JPWO2007094463A1 (en) | Signal distortion removing apparatus, method, program, and recording medium recording the program | |
Hendriks et al. | An MMSE estimator for speech enhancement under a combined stochastic–deterministic speech model | |
Kumar et al. | A new pitch detection scheme based on ACF and AMDF | |
JP6724290B2 (en) | Sound processing device, sound processing method, and program | |
Messaoud et al. | Using multi-scale product spectrum for single and multi-pitch estimation | |
Rao et al. | PSFM—a probabilistic source filter model for noise robust glottal closure instant detection | |
Auvinen et al. | Automatic glottal inverse filtering with the Markov chain Monte Carlo method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 17894011 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 17894011 Country of ref document: EP Kind code of ref document: A1 |