CN101853661A

CN101853661A - Noise spectrum estimation and voice mobility detection method based on unsupervised learning

Info

Publication number: CN101853661A
Application number: CN201010178166A
Authority: CN
Inventors: 应冬文; 颜永红; 付强; 潘接林
Original assignee: Institute of Acoustics CAS
Current assignee: Institute of Acoustics CAS
Priority date: 2010-05-14
Filing date: 2010-05-14
Publication date: 2010-10-06
Anticipated expiration: 2030-05-14
Also published as: CN101853661B

Abstract

The noise power Power estimation and voice mobility detection method that the present invention relates to a kind of based on unsupervised learning,Include the following steps: the log-magnitude feature 1) for voice signal on each frequency point,Establish a GMM model; 2) for one section of voice data,M frame buffer is set,Preceding M frame input signal is stored in caching,The log-magnitude spectrum of M frame in caching is extracted,The GMM model for substituting into step 1) is initialized,The model λ 0 initialized,k; 3) in the model λ 0 initialized,After k,Since M+1 frame,Using the method for incremental learning,GMM model is updated frame by frame,Successively recursion obtains

, and obtain noise figure

With probability of occurrence of the voice signal on k-th of frequency point of the i-th frame. The present invention is the tight coupling solution of Power estimation and voice mobility detection, can enhance voice application system to the adaptability of noise circumstance; The present invention independent of " noise starting " it is assumed that also, the present invention description of the voice mobility on time-frequency two-dimensional space can also be provided.

Description

Noise spectrum based on unsupervised learning is estimated and the voice mobility detection method

Technical field

The present invention relates to the voice process technology field, specifically, the present invention relates to a kind of noise power spectrum and estimate and the voice mobility detection method based on unsupervised learning.Wherein, voice mobility detection is to judge the algorithm whether voice occur on time dimension, and it can answer existence with the form of "Yes" or "No", also can describe the existence of voice with the voice probability of occurrence.

Background technology

Most voice application system is had in the face of ambient noise interference.Forefathers have proposed a lot of methods and have removed the interference of noise to voice system, and nearly all method all depends on voice mobility detection and noise power spectrum is estimated.These two modules exist contact closely, and their accuracy directly influences the whole noiseproof feature of system.Traditional solution exists following several problem:

1. in general anti-noise algorithm, it is the loose coupling of a cascade that voice mobility detection and noise power spectrum are estimated, the mobility of first computing voice is come the estimating noise power spectrum according to mobility then.The voice mobility detection device directly influences the accuracy that noise power spectrum is estimated to the sensitivity of voice signal.The voice mobility detection device is too responsive, causes underestimating of noise power spectrum easily; Otherwise, too blunt, cause over-evaluating of noise power spectrum easily.Therefore, often need to regulate the sensitivity of speech detector in the traditional scheme, the adaptability of noise circumstance is brought influence to system according to noise circumstance.

2. traditional solution is based on the mode of semi-supervised learning.At initial period, general system need make the hypothesis of " noise is initial ", supposes that promptly always there is one section non-speech audio in the beginning of sentence.This section non-speech audio can be understood as the ground unrest sample of artificial mark, sets up the initialization model of noise from these mark samples, and this is a kind of supervised learning method.Its defective is: this hypothesis is difficult to be met in some applications, such as starting with voice signal when sentence, will cause the initialization failure of noise model so, and it is all inaccurate to make speech detection and noise power spectrum estimate then.Follow-up phase after setting up the initialization model of noise, traditional solution adopt detection and results estimated to come more new model mostly, and this learning method is towards decision-making, and it is a kind of study of non-supervision.This learning method towards decision-making, with the output result of estimation/detecting device, the back coupling feedback is used for more new model.But it feeds back to model with incorrect result easily, causes the precise decreasing of model, and model further causes the precise decreasing estimating/detect.Wrong like this along with the time is progressively accumulated, system performance also can be along with the time progressively descends.Supervised learning in initial period adds the unsupervised learning in the follow-up phase, has formed a semi-supervised learning process.Two problems in initial period and follow-up phase all are because the mode of this semi-supervised learning causes.

3. most of voice mobility detection devices in the past only provide the description of voice mobility on time dimension, lack the description of voice mobility on the frequency domain dimension, therefore can't carry out further process of refinement to noise.

Summary of the invention

The present invention is directed in the past the voice mobility detection device and the shortcoming of noise power spectrum estimator, a tightly coupled solution has been proposed, make voice mobility detection and noise power spectrum estimate under a unsupervised learning framework, to obtain unification, thereby strengthen the adaptability of voice application system noise circumstance.In addition, this invention does not rely on " noise is initial " and supposes that practicality is stronger than traditional method; Simultaneously, the present invention also provides the description of voice mobility on time frequency space, helps noise is carried out further process of refinement.

For achieving the above object, the invention provides a kind of noise power spectrum and estimate and the voice mobility detection method, as shown in Figure 2, comprise the following steps: based on unsupervised learning

1) for the logarithm amplitude characteristic of voice signal on each frequency, set up a GMM model, mathematic(al) representation is as follows:

p (x_{i, k} | λ_{i, k}) = w_{i, k}^{(0)} p (x_{i, k} | h = 0, λ_{i, k}) + w_{i, k}^{(1)} p (x_{i, k} | h = 1, λ_{i, k});

Wherein, gaussian component is expressed as:

p (x_{i, k} | h, λ_{i, k}) = \frac{1}{\sqrt{2 π κ_{i, k}^{(h)}}} \exp {- \frac{1}{2} (x_{i, k} - μ_{i, k}^{(h)})};

Wherein, x _{I, k}Represent the logarithm amplitude spectrum on k the frequency of i frame, h represents gaussian component, h ∈ 0,1},

The weight coefficient of expression GMM,

With

Represent average and variance respectively, wherein h=1 represents speech components, and h=0 represents noise component;

The parameter set of expression gauss hybrid models;

2) for one section speech data, set the M frame buffer, preceding M frame input signal is deposited in the buffer memory, extract the logarithm amplitude spectrum of M frame in the buffer memory, the GMM model of substitution step 1) carries out initialization, obtains initialized model λ _{0, k}Initialization procedure adopts constraint EM algorithm;

3) obtaining initialized model λ _{0, k}Afterwards, since the M+1 frame, adopt the method for incremental learning, upgrade the GMM model frame by frame, recursion obtains successively

And draw noise figure

With the probability of occurrence of voice signal on k frequency of i frame:

p (h = 1 | x_{i, k}, λ_{i, k}) = \frac{w_{i, k}^{(1)} p (x_{i, k} | h = 1, λ_{i, k})}{w_{i, k}^{(0)} p (x_{i, k} | h = 0, λ_{i, k}) + w_{i, k}^{(1)} p (x_{i, k} | h = 1, λ_{i, k})},

I=1 wherein, 2,3 ...

Wherein, the incremental learning method of described GMM comprises recursion weight coefficient, recursion average and recursion variance;

Recursion weight coefficient method is:

The recursion Mean Method is:

Perhaps

Recursion variance method is:

Perhaps

Perhaps

Wherein α is a smoothing factor.

Compared with prior art, the present invention has following technique effect:

The present invention is that a kind of voice mobility detection and noise power spectrum are estimated tightly coupled scheme, can strengthen the adaptability of voice application system to noise circumstance; In addition, the present invention does not rely on " noise is initial " and supposes to have stronger practicality; And the present invention can also provide the description of voice mobility on the time-frequency two-dimensional space, helps noise is carried out further process of refinement.

Description of drawings

Fig. 1 shows one section voice time-domain diagram and sound spectrograph of being subjected to noise;

Wherein (a) part is one section sound spectrograph that is destroyed by white noise, and signal to noise ratio (S/N ratio) is 0dB; (b) part is the probability graph of voice signal existence, and the gray scale among the figure represents that the probability of (promptly existing) appears in voice signal; From (a) and (b) the contrast of figure as can be seen, the probability that exists of this method output has been described the structure of sound spectrograph accurately.

Fig. 2 is of the present invention a kind of based on the noise power spectrum estimation of unsupervised learning and the process flow diagram of voice mobility detection method.

Embodiment

The present invention proposes a kind of noise power spectrum based on the unsupervised learning framework estimates and the voice mobility detection method.The maximum characteristics of unsupervised learning framework are that the model of noise and voice messaging is set up in a kind of mode of non-supervision, no matter in the initialization of model or in renewal process, all do not rely on the information of artificial mark.Particularly, it has following characteristics:

● at initial phase, do not rely on the initial hypothesis of noise, so the range of application that should invent is used more wide in range than general solution.

● in renewal process, do not need feedback information, therefore, the problem of error accumulation can be eased to a certain extent.

● providing the information of voice mobility and the information of noise power spectrum simultaneously, is tightly coupled relation between them, only need just can regulating system by a few parameters.And in loosely coupled system, voice mobility module and noise detection module exist adjusting parameter separately, and parameter is more, and system is difficult to regulate.

● voice mobility is the two-dimensional signal of " time---frequency ", and other voice mobility detection algorithm has only been described the existence of voice on time dimension.

In one embodiment, the carrier of unsupervised learning framework is the gauss hybrid models (GaussianMixture Model is abbreviated as GMM) of two components.The distribution of one of them representation in components speech energy, another component are the distributions of noise energy.The present invention becomes 8 subbands according to the Mel scale with band segmentation, extracts energy envelope on each subband, and sets up the GMM of a correspondence.At first adopt EM algorithm initialization GMM, adopt the mode of incremental learning progressively to upgrade GMM then.According to the GMM model, deduce out the mobility on this subband of voice and the power spectrum information of noise respectively.

The present invention adopts the GMM that has constraint condition that the spectrum-envelope of voice is carried out match.

In fit procedure, respectively average, the weight of GMM are closed variance etc. and retrain.No matter at the EM algorithm still in the incremental learning process, all requirements

And

Wherein, for the incremental learning method of GMM, specifically comprise the calculating of recursion weight coefficient, recursion average and recursion variance.

1) recursion weight coefficient:

Wherein α be one less than 1 but approach 1 smoothing factor, α=0.99 for example.

2) recursion average.

Perhaps

α wherein _μBe one less than 1 but approach 1 smoothing factor, for example α _μ=0.99.

3) recursion variance.

Perhaps

Perhaps

α wherein _κBe one less than 1 but approach 1 smoothing factor, for example α _κ=0.99.

Below in conjunction with a preferred embodiment the present invention is done description further.

Principle of the present invention is as follows:

For the logarithm amplitude characteristic of voice signal on each frequency, set up a gauss hybrid models GMM, this model changes along with the variation of time and input signal.The mathematic(al) representation of model is as follows:

p (x_{i, k} | λ_{i, k}) = w_{i, k}^{(0)} p (x_{i, k} | h = 0, λ_{i, k}) + w_{i, k}^{(1)} p (x_{i, k} | h = 1, λ_{i, k});

Wherein gaussian component is expressed as:

p (x_{i, k} | h, λ_{i, k}) = \frac{1}{\sqrt{2 π κ_{i, k}^{(h)}}} \exp {- \frac{1}{2} (x_{i, k} - μ_{i, k}^{(h)})};

Here x _{I, k}Represent the logarithm amplitude spectrum on k the frequency of i frame, h represents gaussian component, h ∈ 0,1},

The weight coefficient of expression GMM,

With Represent average and variance respectively.Wherein h=1 represents speech components, and h=0 represents noise component.

The parameter set of expression gauss hybrid models.

In this model

Be exactly that we want the noise estimated.Simultaneously, we can derive the probability of occurrence of voice signal on k frequency of i frame:

p (h = 1 | x_{i, k}, λ_{i, k}) = \frac{w_{i, k}^{(1)} p (x_{i, k} | h = 1, λ_{i, k})}{w_{i, k}^{(0)} p (x_{i, k} | h = 0, λ_{i, k}) + w_{i, k}^{(1)} p (x_{i, k} | h = 1, λ_{i, k})}

Based on above-mentioned principle, according to one embodiment of present invention, as shown in Figure 2, described noise power spectrum is estimated and the voice mobility detection method comprises the following steps:

Step 100: set the M frame buffer, preceding M frame input signal is deposited in the buffer memory, extract the amplitude spectrum of M frame in the buffer memory.The method of extracting frame amplitude spectrum is as follows:

At first the digitized sound signal of this frame is done pre-service (according to system's actual conditions, can comprise windowing, pre-emphasis etc.), establishing every frame length is the F point, and first zero padding is to N point (N 〉=F wherein, N=2 ^j, j is integer and j 〉=8), carry out leaf transformation in the N point discrete Fourier, obtain discrete spectrum

Y wherein _{I, n}N sampled point of i frame in the expression buffer memory, Y _{I, k}K Fourier transform value of i frame in the expression buffer memory (k=0,1 ..., N-1).So, its range value may be calculated

The initialization of step 200:GMM.The gauss hybrid models λ of two components of initialization on each frequency k _{I, k}, subscript i express time wherein, λ _{I=0, k}Represent initialized model.Initialization procedure adopts constraint EM algorithm, and on certain frequency k, concrete initialization step is as follows:

Step 201: the method by cluster (for example the non-supervision cluster of LBG, perhaps fuzzy clustering or the like) is divided into two classes with M+1 sample: With

M wherein ₀+ M ₁-1=M, the class that average is bigger is represented with subscript (1), and is another kind of with subscript (0) expression.The average of two classes is

The average of the class that energy is less is

Wherein

The variance of two classes is respectively:

The initializes weights coefficient of two classes: The likelihood score of novel model of calculating, In following iterative process, old model parameter set is expressed as λ ' _{0, k}, new model parameter is:

Before the beginning iteration,

L ' _kBe set to very big number, for example a L ' _k=-10000.Below begin interative computation.

Step 202: the probability that calculating noise and voice occur,

p (h | x_{i, k}, λ_{0, k}^{'}) = \frac{w_{0, k}^{(h)} p (x_{i, k} | h, λ_{0, k}^{'})}{; Σ_{h} w_{0, k}^{(h)} p (x_{i, k} | h, λ_{0, k}^{'})}, h &Element; {0,1};

Step 203: calculate new weight coefficient:

Step 204: if

Then stop iteration, simultaneously λ _{0, k}=λ ' _{0, k}Wherein υ is one and approaches 0 and greater than 0 number, for example υ=0.05.

Step 205: calculate new average:

Step 206: new average is retrained:

Wherein δ is a constant, and span is between 1 to 10.

Step 207: calculate new variance,

Step 208: new variance is retrained,

Step 209: the likelihood score of novel model of calculating

Step 210: if satisfy condition

Termination of iterations, wherein ε is a very little numeral, for example ε=0.1.If

Iteration jumps to

" step 202 ".

The progressively renewal of step 300:GMM.Setting up initialized model λ _{0, k}Afterwards,, adopt the method for incremental learning, upgrade the GMM model frame by frame since the M+1 frame.Iterative process can be expressed as: on each frequency k, and known λ _{I, k}With current observed value x _{I+1, k}, infer λ _{I+1, k}Carry out Fourier transform for the i+1 frame, obtain Y _{I+1, k}, 0≤k＜N wherein.On each frequency k, calculate amplitude spectrum x _{I, k}=20*log10|Y _{I, k}|.For k frequency, concrete iterative step is as follows:

Step 301: the probability that calculating noise and voice occur,

h∈{0，1}。

Step 302: calculate new weight coefficient:

Wherein, α be one less than 1 but approach 1 smoothing factor, α=0.99 for example.

Step 303: new weight coefficient is retrained, And

Step 304: calculate new average,

Step 305: new average is retrained:

Step 306: calculate new variance,

Step 307: new variance is retrained,

From above substep, we have obtained λ _{I+1, k}In all parameters, thereby obtained corresponding voice probability of occurrence p (h|x _{I+1, k}, λ _{I, k}) and the power spectrum valuation of noise signal

Algorithm based on the foregoing description, the noise power spectrum estimation performance is estimated, adopt each 8 sentence of men and women words person speech data in the TIMIT database and white Gaussian noise, F16 fight support storehouse noise and babble noise in the NOISEX92 noise data storehouse according to 0,5, signal to noise ratio (S/N ratio) such as 10dB mixes.Evaluation index is the line spectrum error, is defined as follows formula:

SegError = \frac{1}{M} Σ_{l = 1}^{M} {10 \log_{10} Σ_{k = 0}^{N - 1} D^{2} (k, l) / Σ_{k = 0}^{N - 1} {[D (k, l) - \hat{D} (k, l)]}^{2}}

Wherein D (k, l) the actual noise amplitude spectrum of expression,

The noise amplitude spectrum that expression is estimated notices that the SegErr value is more little, and the expression estimated value approaches actual value more, estimates approximately accurately.Algorithm compares respectively at three kinds of noise power spectrum algorithm for estimating of current main-stream, wherein MS represents the minimum statistics algorithm, MCRA represents the recurrence average algorithm of minimum control, and IMCRA represents that the minimum control that improves version returns average algorithm, and TV-GMM is an algorithm of the present invention.Table 1 has been expressed the result of line spectrum error SegError.

Table 1

As can be seen from the above table, the algorithm of the present invention's proposition all has remarkable advantages for three kinds of algorithms of present main flow.

Claims

1. the noise power spectrum based on unsupervised learning is estimated and the voice mobility detection method, comprises the following steps:

p (x_{i, k} | λ_{i, k}) = w_{i, k}^{(0)} p (x_{i, k} | h = 0, λ_{i, k}) + w_{i, k}^{(1)} p (x_{i, k} | h = 1, λ_{i, k});

Wherein, gaussian component is expressed as:

p (x_{i, k} | h, λ_{i, k}) = \frac{1}{\sqrt{2 π κ_{i, k}^{(h)}}} \exp {- \frac{1}{2} (x_{i, k} - μ_{i, k}^{(h)})},

The weight coefficient of expression GMM,

With

The parameter set of expression gauss hybrid models;

3) obtaining initialized model λ _{0, k}Afterwards, since the M+1 frame, adopt the method for incremental learning, upgrade the GMM model of each frequency band frame by frame, recursion obtains successively

And draw noise figure With the probability of occurrence of voice signal on k frequency of i frame:

p (h = 1 | x_{i, k}, λ_{i, k}) = \frac{w_{i, k}^{(1)} p (x_{i, k} | h = 1, λ_{i, k})}{w_{i, k}^{(0)} p (x_{i, k} | h = 0, λ_{i, k}) + w_{i, k}^{(1)} p (x_{i, k} | h = 1, λ_{i, k})},

I=1 wherein, 2,3 ...

2. noise power spectrum according to claim 1 is estimated and the voice mobility detection method, be it is characterized in that the incremental learning method of described GMM comprises: recursion weight coefficient, recursion average and recursion variance;

Recursion weight coefficient method is:

w_{i + 1, k}^{(h)} = α w_{i, k}^{(h)} + (1 - α) p (h | x_{k + 1}, λ_{i, k});

The recursion Mean Method is:

μ_{i + 1, k}^{(h)} = \frac{α w_{i, k}^{(h)} μ_{i, k}^{(h)} + (1 - α) p (h | x_{i + 1, k}, λ_{i, k}) x_{i + 1, k}}{w_{k + 1, z}};

Perhaps

μ_{i + 1, k}^{(h)} = α_{μ} μ_{i, k}^{(h)} + (1 - α_{μ}) p (h | x_{i + 1, k} λ_{i, k}) x_{i + 1, k};

Recursion variance method is:

κ_{i + 1, k}^{(h)} = \frac{α w_{i, k}^{(h)} κ_{i, k}^{(h)} + (1 - α) p (h | x_{i + 1, k}, λ_{i, k}) {(x_{i + 1, k} - μ_{i + 1, k}^{(h)})}^{2}}{w_{i + 1, k}^{(h)}};

Perhaps

κ_{i + 1, k}^{(h)} = α_{κ} κ_{i, k}^{(h)} + (1 - α_{κ}) p (h | x_{i + 1, k}, λ_{i, k}) {(x_{i + 1, k} - μ_{i + 1, k}^{(h)})}^{2};

Perhaps

κ_{i + 1, k}^{(h)} = α_{κ} κ_{i, k}^{(h)} + (1 - α_{κ}) p (h | x_{i + 1, k}, λ_{i, k}) {(x_{i + 1, k} - μ_{i, k}^{(h)})}^{2};

Wherein, α is a smoothing factor.