WO2018138543A1

WO2018138543A1 - Probabilistic method for fundamental frequency estimation

Info

Publication number: WO2018138543A1
Application number: PCT/IB2017/050352
Authority: WO
Inventors: Kanru HUA
Original assignee: Hua Kanru
Priority date: 2017-01-24
Filing date: 2017-01-24
Publication date: 2018-08-02

Abstract

A probabilistic method for estimating the fundamental frequency (F0) of a digital signal, the method comprises: estimating the posterior distribution of the F0; determining the voicing status at each frame; adaptively estimating the most likely F0 trajectory; and training the probabilistic model from simulated data. The method uses GMM to model the relation between F0 and features extracted from speech, with model parameters trained from synthetic signals generated from Monte-Carlo simulation, the method is robust against additive noise and can be easily adapted to model a wide variety of speech and non-speech signals.

Description

Probabilistic Method for Fundamental Frequency Estimation

FIELD OF THE INVENTION This invention relates to a probabilistic method for the estimation of fundamental frequency and detection of the presence of sinusoidal component in digital signals. The method is primarily designed for speech signals but is also applicable to other periodic digital signals.

BACKGROUND OF THE INVENTION The present invention primarily concerns about the estimation of fundamental frequency in clean and noisy speech signals. In recent years there is a rising trend of formulating the fundamental frequency (F0) estimation problem in Bayesian framework and the

convenience of expressing noise in probabilistic notations in general leads to significant improvement in noise-robustness.

A class of Bayesian F0 estimation approaches assume a time-domain signal model with additive Gaussian noise. A likelihood function of observing the input signal can then be defined and converted to a posterior distribution of F0. For example, Nielsen et al (J. K. Nielsen, M. G. Christensen, and S. H. Jensen, "An Approximate Bayesian Fundamental Frequency Estimator, " in 2012 IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE Press., 2012.) used a harmonic plus noise model to model windowed speech; F0 and model order are j ointly determined with prior distribution set under maximum-entropy principle. More recently, Hajimolahoseini et al. (H.

Hajimolahoseini, R. Amirfattahi, S. Gazor, and H. Soltanian-Zadeh, "Robust estimation and tracking of pitch period using an efficient Bayesian filter, " IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 7, pp. 1219-1229, Jul. 2016.) used a simple stochastic time-delay model with amplitude decay and period length controlled by another stochastic system to track the pitch period; inference is carried out using particle filtering. Despite the flexibility in model configuration, an issue commonly associated with the signal modeling approach is the validity of the assumed signal model: a simple model may not well-represent the speech generation mechanism, while a complicated model requires exotic approximation and inference techniques, or otherwise being computationally intractable.

Another category of probabilistic approaches to F0 estimation do not directly model the speech signal but rather attempt to estimate F0 from features computed from the speech. Garner et al. (P. N. Garner, M. Cernak, and P. Motlicek, "A simple continuous pitch estimation algorithm, " IEEE Signal Processing Letters, vol. 20, no. 1, pp. 102-105, Jan. 2013.) proposed a simple F0 tracker based on Kalman filter, where the measurement distribution is determined from signal autocorrelation, which serves as an indication of both period length and uncertainty of the measurement. Tsanas et al. (A. Tsanas, M. Zanartu, M. A Little, C. Fox, L. O. Ramig, and G. D. Clifford, "Robust fundamental frequency estimation in sustained vowels: Detailed algorithmic comparisons and information fusion with adaptive Kalman filtering, " The Journal of the Acoustical Society of America, vol. 135, no. 5, pp. 2885-2901, May 2014.) considered combining numerous F0 estimation algorithms also using a Kalman filter; in their approach the measurement noise variance is converted from a high level feature of algorithmic robustness, for each contributing F0 estimator. However, the conversion from speech feature to probability distribution often relies on empirically-designed heuristics which are not guaranteed to faithfully represent the statistics, and sometimes involves lots of manually-specified parameters, which may overfit to the speech dataset used for parameter tunning.

The method disclosed in this patent belongs to the second category of F0 estimation methods. The novelty of the present method includes (1) use of GMM for modeling the relation between FO and features extracted from speech; (2) estimation of GMM parameters by Monte-Carlo simulation; (3) an adaptive Bayesian filtering method for smoothing the estimated FO trajectory, with very few hyper-parameters introduced; (4) detection of the presence of sinusoidal component, or voicing detection in the context of speech analysis, based on entropy of the posterior distribution of FO. The present invention is mostly free from the use of heuristics and rule-based decisions, and is very robust against additive noise.

SUMMARY OF THE INVENTION This patent discloses a multi-stage probabilistic method for fundamental frequency (FO) estimation and detection of the presence of sinusoidal component in digital signals. The present method involves the following stages (Fig. 1).

Stage A. Compute the posterior likelihood function L (f₀) of FO from an input digital signal at each time instant (or equivalently, frame); the posterior likelihood function is proportional to the posterior probability of FO given the input around the time instant of interest;

Stage B. Determine the presence of sinusoidal component at each time instant based on the entropy of posterior likelihood function computed in Stage A; in the context of speech processing, this relates to the problem of voicing detection;

Stage C. Adaptively tracking the time-varying FO based on the posterior likelihood functions computed in Stage A; the tracking is based on adaptive Bayesian tracking with its purpose being reducing spurious estimation errors.

Concretely, stage A involves three maj or steps, summarized as the following.

First step of stage A. One or a plurality groups of features are extracted from the input signal. Each group may contain one or a plurality of features. For analysis of time-varying FO, such feature extraction is performed for each analysis time instant. The feature, for example, can be the instantaneous frequency at a certain frequency or the signal-to-noise ratio (SNR) at a certain frequency. It is also possible to use time domain features such as the peak position of autocorrelation function.

Second step of stage A. Compute p_k ( f ₀|xi_; x_2,— ^XN ) _> the conditional probability of FO given the extracted features in each group (indexed by k) of features. For each group of features, a Gaussian Mixture Model (GMM) modeling the joint distribution of FO and the related features is assumed to be given in advance. The GMM is then converted to a conditional GMM of lower dimension, yielding p_k {

x_2, ···■¾) ·

Third step of stage A. Combine the conditional distribution of FO computed from each group of features into a posterior likelihood function.

The present method involves an additional training stage (stage T), based on Monte-Carlo simulation, for estimating the parameters for the GMM corresponding to each group of features. Stage T involves three major steps, summarized as the following.

First step of stage T. Draw samples of signals from a signal model and prior distributions of model parameters.

Second step of stage T. From each signal sample, extract one or a plurality of groups of features at each time instant, being consistent with the groups of features in stage A.

Third step of stage T. For each group of features, train a GMM modeling the j oint distribution of features and fundamental frequency, based on the samples and FO generated in the previous steps.

Details about each stage of the present method are presented in the rest of this description. The present method is flexible in that it can be trivially extended to support digital signals of any sampling rate, any kind of signal model and prior distribution, any kind of feature and any combination of features that can be deterministically computed from the input signal. Thus the choice of sampling rate, signal model, prior distributions and features in this description is for illustrative purposes only and does not limit the scope of this patent in regard to the choice of sampling rate, signal model, prior distributions, features and combination of features.

BRIEF DESCRIPTION OF DRAWINGS Fig. 1 is a flow chart outlining the connection between maj or stages involved in the present invention.

Fig. 2 is a diagram outlining the data flow between the steps in stage A of an example implementation of the present invention.

Fig. 3 is a flow chart outlining the steps in stage B of an example implementation of the present invention.

Fig. 4 is a flow chart outlining the steps in stage C of an example implementation of the present invention.

Fig. 5 is a flow chart outlining the steps in stage T of an example implementation of the present invention.

Fig. 6 is the 2D visualization of the local SNR estimated from a speech sample, in stage A of an example implementation.

Fig. 7 is the 2D visualization of the log conditional probability density of FO estimated from a speech sample, at the 20^th band, in stage A of an example implementation.

Fig. 8 is the 2D visualization of the normalized posterior likelihood function of FO estimated from a speech sample, in stage A of an example implementation.

Fig. 9 is the plot of the entropy difference, overlaid with the posterior voicing probability, estimated from a speech sample, in stage B of an example implementation.

Fig. 10 is the plot comparing the moving standard deviation of expected FO with the estimated standard deviation of process noise, with the upper part being the 2D visualization of the normalized posterior likelihood function of FO estimated from a speech sample, in stage C of an example implementation.

Fig. 11 is the plot of the maximum a posteriori estimated FO traj ectory overlaid on the 2D visualization of the normalized posterior likelihood function, estimated from a speech sample, in stage C of an example implementation.

DETAILED DESCRIPTION OF THE INVENTION The following is the detailed description of an example implementation of each stage in the present invention, for estimating F0 in speech signal. The relation between A, B, C and T stages has been outlined in the summary and visualized in the flow chart in Fig. 1.

The example implementation uses a multi-band feature extractor with the band central frequency defined according to an array of 36 frequency values spanning from 40Hz to 1000Hz, with uniform spacing on the logarithmic scale.

Stage A (Fig. 2), computation of the posterior likelihood function L ( ₀) in the example implementation consists of the following detailed steps.

Step A01. Resample the input signal to a new sampling rate of 8000Hz. The new signal is denoted as x [ n] .

In the following steps (step All through step A15) instantaneous frequency and time- frequency local SNR features are extracted from the resampled signal. The feature extraction procedure in this example implementation is mostly identical as the recent work by

Kawahara et al. (H. Kawahara, Y. Agiomyrgiannakis, and H. Zen, ^"Using instantaneous frequency and aperiodicity detection to estimate F0 for high-quality speech synthesis, " in 9th ISCA Workshop on Speech Synthesis, 2016.)

Step All. For each band with central frequency a>_ck , bandpass filter the resampled signal using a complex filter obtained by frequency-modulating a normalized Nuttall window.

4

wj m] =∑ a_;cos (θ.25 ( /— l) )_ckm ) , V \m\< 4 π/ω^

ί = 1 vi_ckm)

n

y_k[n] = x[n]*h_k[m]

where "*" is the symbol for convolution, a = [0.338946, 0.481973, 0.161054, 0.018027]. Step A12. Estimate the time-frequency local SNR a_k[n] by applying the bandpass filter again on the normalized y_k[n] signal, followed by renormalization, subtraction and smoothing. Fig.6 shows an example of local SNR estimated from a speech sample.

Step A 13. For each band with central frequency a>_ck , bandpass filter the resampled

x[n] using the time derivative of the complex filter used in step A02.

4

w [m] = -0.25_{Wc;( s}X(/-l)a_;sin(0.25(/-l)_Wc;(m)_>V|m|<4n/_Wc;(

h [m]={w'_k[m]+jw_k[m] _ckf_s)exp{j _ckm)

y'_k[n] = x[n]*h'_k[m]

where f_s is sampling rate.

Step A14. Estimate the time-frequency local instantaneous frequency (¾ [n] using

Step A15. In practice, it may not be necessary to extract features on all samples as the signal is usually quasi-stationary in short period of time. The SNR and instantaneous frequency features are downsampled by a factor of T; the SNR is converted to decibel scale. Subsequently, from step A21 to step A22, the conditional distribution of F0 given extracted features in each band, at each time instant is determined.

For each band, a Gaussian Mixture Model (GMM) modeling the joint distribution of F0 and associated features is given. It is assumed that the GMM parameters have been estimated in stage T and are stored for the use in stage A.

For the k-th band, The random variables modeled by the GMM, aside from F0, include the following features:

local SNR in dB ( a_{k n} );

local SNR taken at the band with central frequency close to half the central frequency of the k-th band (in this example implementation which uses 36 bands spanning from 40Hz to 1000Hz, a '_ktn=a_k-_7tn , ' for k <= 7, we define that a '_{k n}=a_{k n} );

instantaneous frequency ( (¾ _n ).

For the k-th band, the GMM is defined as,

where x = [a_{k n},a _:„,vi_{k :}„,f₀f ; Λ/(·|·,·) is the probability density function of a normal distribution; w_lk is the weight of the 1-th mixture component; μ_¾ is the 1-th mean vector; ∑_lk is the 1-th covariance matrix in the following block matrix form, ∑11 ∑12

^ ∑21 ^∑22

where∑_u has dimension 3x3; ∑₁₂ has dimension 3x1; ∑₂₁=∑₂ ^" ₁ ; ∑₂2 is a scalar. For notational simplicity, the subscript 1 and k in their respective representation are dropped. Step A21. Convert the GMM modeling the joint distribution of FO and associated features to a GMM of single dimension modeling the conditional distribution of FO given the features.

, n

! !f<, n ⁼ ! 2 ⁺ ¾l∑ll ( \- ^ak, n > ^a k . n ' ^h. n ]

∑ !f<, n ⁼∑22^— ¾1∑11∑12

Wft A/ ( [ a_k„, a '_k w_k„]^Γ|μι.^∑ιι)

w

∑ ^^Λ/ ([α„_ιΠ , α '„ _π, ω„ _π]^Γ|μ₁, Σ₁₁)

m

where μ₁ is the 3x1 sub-vector in μ_¾ ; μ₂ is the last element in μ_¾ .

Step A22. Evaluate the conditional probability density of FO specified by the conditional GMM on a finite grid. The purpose is to facilitate various operations, for example integration and entropy computation, in later stages of the present method. In some cases care has to be taken to prevent the standard deviation of any mixture component to be smaller than the interval of the finite grid for approximated inference, by setting a flooring limit to the conditional variances ∑'_{lt ll} . For notational simplicity, the rest of this description uses continuous representation of FO and associated probabilities and functions. Fig. 7 shows an example of the log conditional probability of FO, at the 20^th band, estimated from a speech sample.

The final steps of stage A (step A31 and step A33) combine the conditional probabilities of FO from all bands into a posterior likelihood function, from which various useful statistical properties can be computed in stage B and stage C.

Step A31. Compute the biased posterior likelihood function L '„( ₀) by multiplying the conditional probabilities of FO from all bands, based on independence assumption and uniform prior assumption.

Optionally, take the result to the power of l/γ to control the kurtosis (i.e. sharpness) of the resulting distribution. When γ = K, the above equation is identical to computing the geometric mean of conditional probabilities of FO from all bands.

The result L '„( ₀) could be biased toward a certain range of frequencies due to the prior distribution of FO, and/or the non-linear arrangement of band central frequencies.

Step A32. Divide the biased posterior likelihood function by the mean posterior likelihood estimated by averaging multiple estimates from a white noise signal to remove the bias.

Step A33. Normalize the unbiased posterior likelihood function so that the probability density integrates to 1. Fig. 8 shows an example of normalized posterior likelihood function estimated from a speech sample.

Stage B (Fig. 3) in the example implementation, detecting the presence of sinusoidal component given the posterior likelihood function obtained in stage A, consists of the following steps.

Step BOl. Compute the entropy H_n of the normalized posterior likelihood function for each frame. The entropy is an indication of the uncertainty about FO at each frame, from which the probability of presence of sinusoidal component can be derived.

f_

H = - f L_n{ f₀ ) log { L_n { f₀ )) df₀

Step B02. Compute the difference between the entropy H_n and its upperbound.

A H_n= H_max- H_n

1

where H_mnx= -log

/ max I min

Step B03. Convert the entropy H_N into a posterior probability P_n (V \H_N) indicating the voicing status (i.e. presence of sinusoidal component). The entropy measured from unvoiced frames (i.e. frames at where the sinusoidal component is absent) is modeled using an exponential distribution with parameter λ set according to the standard deviation of entropy of FO measured from white noise signal.

P (V \H ) =

α +λ e

where is a user-specified parameter controlling the sensitivity of sinusoidal component detection. Fig. 9 shows a plot of the entropy difference, overlaid with the posterior voicing probability P_n ( V \H_n) , estimated from a speech sample.

Step B04. Estimate a binary sequence indicating the status of presence of sinusoidal

where p_t is the user-specified parameter of transition probability.

Alternatively, step B03 and step B04 can be replaced by a simple linear decision on the entropy H_n . For example, the n-th frame is labeled as voiced (i.e. with some sinusoidal component being present) if Δ Η_η>Η_θ where Η_θ is a user-specified thresholding parameter.

Stage C (Fig. 4) in the example implementation, adaptive Bayesian tracking of the time- varying FO based on the posterior likelihood functions computed in Stage A, consists of the following steps.

From step COl to step C03, the time-varying process variance of FO is estimated. In practice the posterior likelihood function is usually unimodal and thus it is statistically meaningful to assume normal distribution in early stages of adaptive tracking. In such case, a Kalman filter with time-varying process and measurement noise distributions can be used to efficiently decode the FO traj ectory and also to evaluate the total likelihood of the traj ectory. When assuming that the process variance is proportional to the moving variance of the expected FO, the scaling factor can be determined under maximum-likelihood principle.

Step COl. Compute the expectation and variance of FO from the normalized posterior likelihood function.

Step C02. Compute the moving variance of _n , but only for those frames whose voicing status (i.e. the status of presence of sinusoidal component) is q„= l ; otherwise the moving variance is set to a reasonably large constant. , for n : q_n= l

where M is the order of moving variance, which is empirically found to be around 10.

Step C03. Find out the best scaling factor to the moving variance so that the scaled variance, when used as the process variance, maximizes the total likelihood of Kalman filtering. For such small search space with one free variable, a finite grid based search is possible:

(1) generate an array of candidate scaling factors β^ logarithmically distributed on the order from 0.1 to 10, for example β= [0.1, 0.2, 0.4, 0.8, 1.6, 3.2, 6.4] ; in practice the actual candidate array could be of a larger size;

(2) for each candidate scaling factor, multiply it with the moving variance o² _{f n} and run Kalman filtering with process mean set to zero, process variance set to β_¾σ^ „ ,

measurement mean and variance set to _n and , respectively;

(3) find the best β^ such that the total likelihood of Kalman filtering is maximized.

Fig. 10 shows a plot comparing the moving standard deviation of expected F0 with the estimated standard deviation of process noise, with the upper part being the 2D visualization of the normalized posterior likelihood function of F0 estimated from a speech sample.

Step Cll. Estimate the maximum a posteriori (MAP) F0 traj ectory given the normalized posterior likelihood functions L_n (f₀) and the process variance fi_ko² _n determined in step

C03. The MAP inference can be done using either a finite grid approximation (decoded using Viterbi algorithm) or particle filtering. Fig. 11 shows a plot of the maximum a posteriori estimated F0 trajectory overlaid on the 2D visualization of the normalized posterior likelihood function, estimated from a speech sample.

The present invention involves a training stage (stage T) for determining the GMM parameters for converting speech features into the conditional distribution of F0. While it is well-known that the GMM parameters can be trained, using Expectation-Maximization (EM) algorithm, from speech data with known F0, the resulting model could have a bias toward the particular training data set and the accuracy is not guaranteed on different speakers and noise levels. To solve this problem, stage T in the present method also applies EM training of GMM, but on speech signals generated from a Monte-Carlo simulation, assuming a signal model with parameters following high-entropy distributions. In the following, stage T (Fig. 5) in the example implementation is described in detail from step T01 to step Til.

Step T01. Draw random samples of signal model parameters from their respective prior distributions. The following parameters are considered in this example.

Fundamental frequency ( f ₀ ), following uniform distribution from 40Hz to 1000Hz;

Amplitude of the k-th harmonic ( a_k ), following log-uniform distribution from 1/3 to 3, expressed as a ratio to the amplitude of the fundamental ( a_x = 1 );

Initial phase of the k-th harmonic ( p_k ), following uniform distribution from 0 to 2π;

Amplitude of the additive noise ( a_g ), following log-uniform distribution from -50dB to 50dB, as a ratio to the amplitude of the fundamental.

Step T02. Generate a synthetic speech signal s [ n] from the sampled parameters. In this example a harmonic plus noise model of the following form is used.

where g [n ] is a Gaussian white noise sequence with zero mean and standard deviation equal to 1; the noise sequence is uncorrelated with the sinusoidal component.

The harmonic plus noise signal model used in this example implementation is able to represent a wide variety of speech and non-speech (for example, monophonic musical instrument) signals.

Step T03. For each group (i.e. "band" as in this example implementation), extract features from the synthetic signal generated in step T02, in the identical way as in stage A. Store the set of features and the actual F0 of the synthetic signal, in the format coherent with the GMM modeling the joint distribution of speech features and FO. The features may be repeatedly extracted from different positions in the synthetic signal.

Repeat step TOl through step T03 for around 1000 iterations or more, until enough data is collected for GMM training.

Step Til. For each group of features, apply EM algorithm to train a GMM modeling the joint distribution of features and FO, on the data collected from step TOl through step T03. A typical configuration is to use GMMs with 16 mixture components and full covariance matrix.

Claims

1. A probabilistic method for estimating the fundamental frequency of a digital signal. The method comprises of at least the following steps,

a. from the input signal, extract one or a plurality of groups of features at each time instant; each feature group may contain one or a plurality of features;

b. for each group of features, convert the associated Gaussian Mixture Model (GMM) modeling the joint distribution of features and the fundamental frequency into a GMM representing the conditional distribution of fundamental frequency given the group of features;

c. compute the posterior likelihood function of fundamental frequency by combining the conditional distributions estimated in step b;

d. estimate the fundamental frequency from its posterior likelihood function.

2. A method for detection of the presence of sinusoidal component in a digital signal, given the posterior likelihood function of fundamental frequency at one or a plurality of time instants. The method involves computing the entropy of the posterior distribution of fundamental frequency from the normalized posterior likelihood function at each time instant. Decision about the presence of sinusoidal component is made by at least one of the following methods,

a. thresholding the entropy of the posterior distribution of fundamental frequency;

b. converting the entropy into a probability value indicating the presence of sinusoidal component and applying Viterbi decoding to the sequence of probabilities.

3. A method for adaptively tracking the time-varying fundamental frequency of a digital signal, given the posterior likelihood function of fundamental frequency at each time instant. The method comprises of at least the following steps,

a. from the input signal, estimate the posterior mean and variance of fundamental

frequency at each time instant;

b. compute the moving variance of the posterior mean;

c. determine the time-varying process variance of the fundamental frequency by scaling the moving variance of the posterior mean and finding the scaling factor that maximizes the total likelihood of Bayesian filtering on the fundamental frequency traj ectory;

d. apply Bayesian inference on the fundamental frequency trajectory using the process variance estimated in step c and the posterior likelihood function.

4. A Monte-Carlo method for estimating GMM parameters for the use in a probabilistic

fundamental frequency estimator. The method comprises of at least the following steps, a. draw samples of signals from a signal model and prior distributions of model

parameters;

b. from each sampled signal, extract one or a plurality of groups of features at each time instant; each feature group may contain one or a plurality of features;

c. for each group of features, train a GMM modeling the joint distribution of features and fundamental frequency, based on the samples generated in step a and step b.