CN102800322A - Method for estimating noise power spectrum and voice activity - Google Patents

Method for estimating noise power spectrum and voice activity Download PDF

Info

Publication number
CN102800322A
CN102800322A CN2011101411375A CN201110141137A CN102800322A CN 102800322 A CN102800322 A CN 102800322A CN 2011101411375 A CN2011101411375 A CN 2011101411375A CN 201110141137 A CN201110141137 A CN 201110141137A CN 102800322 A CN102800322 A CN 102800322A
Authority
CN
China
Prior art keywords
lambda
overbar
alpha
voice
probability
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2011101411375A
Other languages
Chinese (zh)
Other versions
CN102800322B (en
Inventor
应冬文
颜永红
付强
潘接林
李军锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Acoustics CAS
Beijing Kexin Technology Co Ltd
Original Assignee
Institute of Acoustics CAS
Beijing Kexin Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Acoustics CAS, Beijing Kexin Technology Co Ltd filed Critical Institute of Acoustics CAS
Priority to CN201110141137.5A priority Critical patent/CN102800322B/en
Publication of CN102800322A publication Critical patent/CN102800322A/en
Application granted granted Critical
Publication of CN102800322B publication Critical patent/CN102800322B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Complex Calculations (AREA)

Abstract

The invention relates to a method for estimating the noise power spectrum and the voice activity. According to the method, the appear probability of a voice on a frequency sub band and power spectrum information of noise can be finally deduced according to the time sequence relevance of a sequential hidden markov model (SHMM) description language based on first-order regression on each frequency component. The method comprises the following steps of: 1) extracting a logarithmic amplitude spectrum envelop for a voice signal on each frequency component, and constructing a corresponding binary hidden markov model, wherein each state is represented by Gaussian distribution; 2) for a field of voice data, setting M frames of caches, storing the previous M frames of input signals into the caches, extracting M frames of logarithmic amplitude spectrums from the caches, and constructing an initialization model by adopting a maximum likelihood estimation algorithm; and 3) after the initialization model lambdaM is obtained, starting from the (M+1)th frame, gradually updating the HMM of each frequency band by adopting an incremental learning method, and sequentially performing recurrence to obtain a noise value and the appear probability of a voice signal.

Description

A kind of noise power spectrum is estimated and the voice activity detection method
Technical field
The present invention relates to voice signal Treatment Technology field, specifically, the present invention relates to a kind of noise spectrum and estimate and the voice activity detection method based on sequential hidden Markov model.Wherein, it is on time dimension, to judge the algorithm whether voice occur that voice activity detects, and it can answer existence with the form of " being " or " denying ", also can describe the existence of voice with the voice probability of occurrence.
Background technology
Voice activity detects and noise power spectrum estimates it is the requisite ingredient of noise reduction algorithm; Their performance directly influences the performance of noise reduction algorithm; Particularly under abominable noise circumstance, their remote effect the performance of speech processing system (like speech recognition, words person's identification and speech recognizer).
Most voice application system is had in the face of ambient noise interference.Forefathers have proposed a lot of methods and have removed the interference of noise to voice system, and nearly all method all depends on the voice activity detection and noise power spectrum is estimated.These two modules exist contact closely, and their accuracy directly influences the whole noiseproof feature of system.Although traditional method of estimation is functional, but still have two places to be worth improving:
1, makes full use of the sequential correlativity of continuous speech/non-speech audio on a certain frequency component; Existing algorithm is abundant inadequately for the utilization of temporal correlation; They often adopt fairly simple single order recurrence smoother that the amplitude spectrum envelope is carried out smoothly, and the smoothing factor of smoother is fixed.And voice signal itself is exactly a segmentation signal stably, and its statistical nature comprises the sequential correlativity, and all along with the time is constantly changing, a fixing model can't reflect this time-varying characteristics.If we can adopt adaptive model that the sequential correlativity is carried out modeling, the performance of algorithm will get a promotion undoubtedly so.This method is not mentioned in documents and materials in the past.
2, the parameter adaptive of traditional sequential HMM adopts the recurrence average mode of high-order, and current HMM parameter set depends on the model in a moment, the current observed value and the observed value in a plurality of moment in past, and the mode calculated amount of this parametric regression is huge.If can return at following this high-order of the little situation of loss of significance and be reduced to the single order recurrence, so, the counting yield of algorithm will greatly improve.Based on the sequential HMM algorithm that single order returns, in documents and materials in the past, do not mention yet.
In addition, traditional solution is based on the mode of semi-supervised learning.At initial period, one system need make the hypothesis of " noise is initial ", supposes that promptly always there is one section non-speech audio in the beginning of sentence.This section non-speech audio is appreciated that the ground unrest sample into the manual work mark, from these mark samples, sets up the initial model of noise, and this is a kind of supervised learning method.Its defective is: this hypothesis is difficult to be met in some applications, such as starting with voice signal when sentence, will cause the initialization failure of noise model so, and it is all inaccurate to make speech detection and noise power spectrum estimate then.This initialized method is open in the patent of Chinese application number 201010178166.4.
Summary of the invention
The purpose of the object of the invention is; For being provided, a kind of noise spectrum based on sequential hidden Markov model estimates and the voice activity detection method; This method utilizes hidden Markov model that the sequential correlativity that voice signal exists on certain frequency component is carried out modeling, and the log power spectrum envelope on certain frequency component can be regarded a Markov chain as, the redirect between voice " appearance " and " not occurring " two states of this chain; For each state; Adopt a Gaussian distribution to describe the distribution of its power spectrum, again according to the forward direction factor of HMM, can derive voice certain the time frequency probability of occurrence.
For realizing the foregoing invention purpose; The invention provides a kind of noise power spectrum and estimate and the voice activity detection method that this method is described the sequential correlativity of voice on each frequency component based on the sequential hidden Markov model SHMM that single order returns, and adopt the mode of incremental learning to come progressively to upgrade SHMM; Finally; Deduce out the power spectrum information of the probability of occurrence on this frequency subband and the noise of voice, with the sequential statistical nature of accurate reflection voice, this method comprises the following steps:
1) on each frequency component, extracts logarithm amplitude spectrum envelope for voice signal; And set up the binary hidden Markov model of a correspondence, wherein, one-component is represented the distribution of speech energy; Another component is the distribution of noise energy, and each state is represented by Gaussian distribution;
2) for one section speech data, set the M frame buffer, deposit preceding M frame input signal in the buffer memory in, extract the logarithm amplitude spectrum of M frame in the buffer memory, adopt the maximum likelihood algorithm for estimating to set up an initialized model;
3) obtaining initialized model λ MAfterwards, since the M+1 frame, adopt the method for incremental learning, by the HMM model of each frequency band of frame update, recursion obtains the probability of occurrence of noise figure and voice signal successively.
The concrete steps of this method comprise:
1) on each frequency component, extracts logarithm amplitude spectrum envelope for voice signal, for the logarithm amplitude spectrum time series x on the frequency component l={ x 1, x 2..., x l, set up a hidden Markov model s l={ s 1, s 2..., s l, s t{ 0,1} is its corresponding status switch to ∈, and 1 expression voice go out present condition, and 0 expression noise goes out present condition, λ lExpression is from sequence x lIn the model parameter valuation of obtaining, so, for a given parameter set λ l, corresponding observed value sequence x lProbability density function can be expressed as:
p ( x l | λ l ) = Σ s l p ( s l | λ l ) p ( x l | λ l , s l ) ;
Wherein, p (s l| λ l) expression status switch s lThe prior probability that occurs, gaussian component is expressed as:
p ( s l | λ l ) = Π t = 1 l a s t - 1 , s t ;
Here
Figure BDA0000064390660000032
The expression state transition probability,
Figure BDA0000064390660000033
Expression original state probability, p (x l| λ l, s l) expression given state s lWith parameter set λ lSituation under observed value sequence x lLikelihood score:
p ( x l | λ l , s l ) = Π t = 1 l b ( x t | s t , λ l ) ;
Wherein,
b ( x t | s t , λ l ) = 1 2 π κ s t , l exp { - 1 2 ( x t - μ s t , l ) 2 / κ s t , l } ;
Figure BDA0000064390660000036
changes;
μ in this model 0, lBe exactly that we want the noise estimated, simultaneously, we can derive that voice signal occurs on certain frequency of l frame probability does
Figure BDA0000064390660000037
2) for one section speech data, set the M frame buffer, deposit preceding M frame input signal in the buffer memory in, extract the logarithm amplitude spectrum of M frame in the buffer memory, the HMM model of substitution step 1) is to hidden Markov model λ of initialization on each frequency M, subscript M representes initialized time window length, l>=M;
3) obtaining initialized model λ MAfterwards, since the M+1 frame, the HMM model adopts the method for incremental learning, and by frame update SHMM model, recursion obtains λ successively lAnd draw noise figure μ 0, lWith the probability of occurrence of voice signal on certain frequency of l frame.
As a kind of improvement of technique scheme, the step of extracting frame amplitude spectrum in the described step 1) comprises:
At first, the digitized sound signal of this frame is done pre-service, establishing every frame length is the F point, and first zero padding is to N point, N>=F, N=2 j, j is an integer, and j>=8, carries out leaf transformation in the N point discrete Fourier, obtains discrete spectrum
Figure BDA0000064390660000038
Wherein, y L, nN sampled point of l frame in the expression buffer memory, Y L, kK Fourier transform value of i frame in the expression buffer memory (k=0,1 ..., N-1); So, its range value may be calculated
Figure BDA0000064390660000039
In the formula, b (r) is a windowed function.Described pre-service comprises windowing or/and pre-emphasis; Described windowed function adopts Hanning window or breathes out peaceful window.
As a kind of improvement of technique scheme, described step 2) in the initialization of HMM, concrete initialized step comprises on certain frequency:
Step 201): the method through cluster is divided into two types with M sample:
Figure BDA0000064390660000041
With
Figure BDA0000064390660000042
Wherein, M 0+ M 1=M, average bigger a type is with subscript (1) expression, another kind ofly representes with subscript (0); The method of the cluster described step 201) adopts non-supervision cluster of LBG or fuzzy clustering method;
Two types average is one a type less average of energy for
Figure BDA0000064390660000044
wherein,
Figure BDA0000064390660000045
Two types variance is respectively: κ ‾ 0 , M = 1 M 0 Σ j = 1 M 0 ( x i j - μ ‾ 0 , M ) 2 , κ ‾ 1 , M = 1 M 1 Σ j = 1 M 1 ( x i j - μ ‾ 1 , M ) 2 ;
Two types initializes weights coefficient is: a ‾ 00 , M = a ‾ 01 , M = a ‾ 11 , M = a ‾ 10 , M = 0.5 ;
The likelihood score of novel model of calculating,
Figure BDA0000064390660000048
And beginning interative computation; In following iterative process, old model parameter set is expressed as λ ' M, new model parameter is:
Figure BDA0000064390660000049
Before the beginning iteration,
Figure BDA00000643906600000410
L ' is set to a very big negative, the initialization forward direction factor,
Figure BDA00000643906600000411
After the initialization to the factor,
Figure BDA00000643906600000412
Step 202): calculate the forward direction factor: F ‾ l ( z ) = Σ y F ‾ l - 1 ( z ) a ‾ y , z , M b ( x l | y , λ ‾ M ) , z , y ∈ { 0,1 } ;
Step 203): calculate back to the factor: B ‾ l ( z ) = Σ y B ‾ l + 1 ( y ) a ‾ z , y , M b ( x l + 1 | y , λ ‾ M ) , z , y ∈ { 0,1 } ;
Step 204): calculating noise and voice probability of occurrence: p ( z | x l , λ ‾ M ) = F ‾ l ( z ) B ‾ l ( z ) Σ z F ‾ l ( z ) B ‾ l ( z ) , z ∈ { 0,1 } ;
Step 205): if
Figure BDA00000643906600000416
stops to fall generation, wherein ζ approaches zero but greater than zero decimal;
Step 206): calculate transition probability:
p ( s l - 1 = y , s l = z | x l , λ ‾ M ) = F ‾ l - 1 ( y ) B ‾ l ( z ) a ‾ yz , M b ( x l | z , λ ‾ M ) Σ z F ‾ l - 1 ( y ) B ‾ l ( z ) a ‾ yz , M b ( x l | z , λ ‾ M ) ;
Step 207): calculate new initialization probability π z ′ = p ( s 1 = z | x 1 , λ ‾ M ) ;
Step 208): calculate new average μ z , M ′ = Σ t = 1 M p ( s t = z | x t , λ ‾ M ) x t Σ t = 1 M p ( s t = z | x t , λ ‾ M ) λ ‾ M λ ‾ M λ ‾ M ;
Step 209): the average to new retrains: μ ' 1, M=max{ μ ' 0, M, μ ' 0, M+ δ }, wherein, δ is a constant, span is between 0 to 100;
Step 210): calculate new variance κ z , M ′ = Σ t = 1 M p ( s t = z | x t , λ ‾ M ) ( x t - μ ‾ z , M ) 2 Σ t = 1 M p ( s t = z | x t , λ ‾ M ) ;
Step 211): new variance is retrained κ ' 1, M=max{ κ ' 0, M, κ ' 1, M;
Step 212): calculate new transition probability, a Yz , M ′ = Σ t = 1 M p ( s t - 1 = y , s t = z | x t , λ ‾ M ) Σ t = 1 M Σ z p ( s t - 1 = y , s t = z | x t , λ ‾ M ) ;
Step 213): the likelihood score of novel model of calculating L ‾ = Log ( p ( x M | λ M ′ ) ) ;
Step 214): if satisfy condition
Figure BDA0000064390660000055
termination of iterations; Wherein, ε is a very little numeral, if
Figure BDA0000064390660000056
iteration jumps to step 202).
In above-mentioned HMM modeling of parameters process, average, weight, variance and the transition probability to HMM retrains respectively.It is to be noted: in initialization procedure; Weight coefficient in effect that the transition probability is here brought into play and the patent 201010178166.4 is suitable; Because the weight coefficient in 201010178166.4 is used as denominator term in initialization procedure, so it must retrain in initialization procedure.And there is not this problem in the transition probability in this patent.
As a kind of improvement of technique scheme, the sequential renewal of the HMM in the described step 3) is to set up initialized model λ MAfterwards, since the M+1 frame, adopt the method for incremental learning, by frame update HMM model, its iterative process can be expressed as: on each frequency, and known λ lWith current observed value x l, infer λ L+1Carry out Fourier transform for the l+1 frame, obtain Y L+1, k, wherein, 0≤k<N; On each frequency, calculate range value For each frequency, following in the parameter update step of l+1 frame:
Step 301): calculate the forward direction factor, F l + 1 | λ l ( z ) = Σ y F l | λ l - 1 ( z ) a Yz , l b ( x l + 1 | s l + 1 = z , λ l ) , Z ∈ 0,1};
Step 302): computing voice and noise probability of occurrence, γ l + 1 | λ l ( z ) = F l + 1 | λ l ( z ) Σ z F l + 1 | λ l ( z ) , z ∈ { 0,1 } ;
Step 303): the design conditions transition probability,
ξ l + 1 | λ l ( y , z ) = F l + 1 | λ l ( y ) a yz , l b ( x l + 1 | s l + 1 = z , λ l ) Σ yz F l + 1 | λ l ( y ) a yz , l b ( x l + 1 | s l + 1 = z , λ l ) ;
Step 304): calculating average noise voice probability of occurrence, γ ~ l + 1 ( z ) = α γ ~ l ( z ) + ( 1 - α ) γ l + 1 | λ l ( z ) ;
Step 305): rely on smoothing factor computing time, α ~ l + 1 ( z ) = α γ ‾ l ( z ) α γ ‾ l ( z ) + ( 1 - α ) γ l + 1 | λ l ( z ) ;
Step 306): the computing mode average, μ z , l + 1 = α ~ l + 1 ( z ) μ z , l + [ 1 - α ~ l + 1 ( z ) ] x l + 1 ;
Step 307): the state average to new retrains: μ 1, l+1=max{ μ 1, l+1, μ 0, l+1+ δ }, l>=M;
Step 308): calculate new state variance, κ z , l + 1 = α ~ l + 1 ( z ) κ z , l + [ 1 - α ~ l + 1 ( z ) ] ( x l + 1 - μ z , l ) 2 ;
Step 309): the new state variance is retrained κ 1, l+1=max{ κ 0, l+1, κ 1, l+1, l>=M;
Step 310): calculate average transition probability, ξ ~ l + 1 ( y , z ) = α ξ ~ l ( y , z ) + ( 1 - α ) ξ l + 1 | λ l ( y , z ) ;
Step 311): the computing mode probability, a Yz , l + 1 = a Yz , l + ξ l + 1 | λ l ( y , z ) a Yz , l - ξ l + 1 | λ l ( y , 1 - z ) 1 - a Yz , l K a Yz , l 2 ξ ‾ l + 1 ( y , z ) + K ( 1 - a Yz , l ) 2 ξ ‾ l + 1 ( y , 1 - z ) ;
Step 312): the transition probability to new retrains, a 01, l=max{a 01, l, η }, a 00, l=1-a 01, l, a 10, l=max{a 10, l, η }, a 11, l=1-a 10, l, l>=M;
From above substep, obtained λ L+1In all parameters, thereby obtained relevant voice probability of occurrence γ L+1| λ l(1) and the power spectrum valuation μ of noise signal 0, l+1
The incremental learning method that HMM model in the described step 3) adopts comprises: recursion weight coefficient, recursion average and pass the vertebra variance;
Wherein, Described recursion average: in formula;
Figure BDA00000643906600000610
is a smoothing factor that depends on the voice probability of occurrence, less than 1 but approach 1;
Described recursion variance: κ z , l + 1 = α ~ l + 1 ( z ) κ z , l + [ 1 - α ~ l + 1 ( z ) ] ( x l + 1 - μ z , l ) 2 ;
Described recursion transition probability: Perhaps a Yz, l+1=β a Yz, l+ (1-β) ξ L+1| λ l(y, z); In the formula, β be one less than 1 but approach 1 smoothing factor, β=0.99 for example.
The parameter recurrence method of the described sequential hidden Markov model that returns based on single order is:
Calculate the forward direction factor of HMM: F l + 1 | λ l ( z ) = Σ y F l | λ l - 1 ( z ) a Yz , l b ( x l + 1 | s l + 1 = z , λ l ) , z ∈ { 0,1 } ;
Computing voice and noise probability of occurrence, γ l + 1 | λ l ( z ) = F l + 1 | λ l ( z ) Σ z F l + 1 | λ l ( z ) , z ∈ { 0,1 } ;
The design conditions transition probability, ξ l + 1 | λ l ( y , z ) = F l + 1 | λ l ( y ) a Yz , l b ( x l + 1 | s l + 1 = z , λ l ) Σ Yz F l + 1 | λ l ( y ) a Yz , l b ( x l + 1 | s l + 1 = z , λ l ) ;
Calculating average noise voice probability of occurrence, γ ~ l + 1 ( z ) = α γ ~ l ( z ) + ( 1 - α ) γ l + 1 | λ l ( z ) ;
Rely on smoothing factor computing time, α ~ l + 1 ( z ) = α γ ‾ l ( z ) α γ ‾ l ( z ) + ( 1 - α ) γ l + 1 | λ l ( z ) ;
Computation of mean values, μ z , l + 1 = α ~ l + 1 ( z ) μ z , l + [ 1 - α ~ l + 1 ( z ) ] x l + 1 ;
Calculate new variance, κ z , l + 1 = α ~ l + 1 ( z ) κ z , l + [ 1 - α ~ l + 1 ( z ) ] ( x l + 1 - μ z , l ) 2 ;
Calculate average transition probability, ξ ~ l + 1 ( y , z ) = α ξ ~ l ( y , z ) + ( 1 - α ) ξ l + 1 | λ l ( y , z ) ;
Calculate transition probability, a Yz , l + 1 = a Yz , l + ξ l + 1 | λ l ( y , z ) a Yz , l - ξ l + 1 | λ l ( y , 1 - z ) 1 - a Yz , l K a Yz , l 2 ξ ‾ l + 1 ( y , z ) + K ( 1 - a Yz , l ) 2 ξ ‾ l + 1 ( y , 1 - z ) .
In the technique scheme, the tied mechanism of the guarantee bigram statistics model sound and stable operation of employing comprises:
1) in the starting stage; When the average probability of occurrence of voice during less than certain fixed threshold ζ,
Figure BDA00000643906600000711
algorithm for estimating stops to fall generation.This constrains in the step 205 and implements.
2), the transfering state of model is retrained for to prevent that the state redirect of hidden Markov model from stopping.a 01,l=max{a 01,l,η},a 00,l=1-a 01,l,a 10,l=max{a 10,l,η},a 11,l=1-a 10,l,l≥M。This constrains in the step 312 and implements.
3) in tracing process, to the constraint of average, μ 1, l+1=max{ μ 1, l+1, μ 0, l+1+ δ }, l>=M.This constrains in the step 307 and implements.
4) to the constraint of variance, κ 1, l+1=max{ κ 0, l+1, κ 1, l+1, l>=M.This constrains in the step 309 and implements.
The present invention relates to a kind of based on sequential hidden Markov model (Sequential Hidden Markov Model; SHMM) noise power spectrum is estimated and the voice activity detection method, is comprised the following steps: 1) for the logarithm amplitude characteristic of voice signal on each frequency component, set up a SHMM model; 2) for one section speech data; Set the M frame buffer, deposit preceding M frame input signal in the buffer memory in, extract the logarithm amplitude spectrum of M frame in the buffer memory; The SHMM model of substitution step 1) carries out initialization, obtains initialized model λ M3) obtaining initialized model λ MAfterwards, since the M+1 frame, adopt the method for incremental learning, by frame update SHMM model.Noise states mean value in the model is exactly current noise estimation value, and the voice probability of occurrence in the estimation procedure is represented the activity of voice at time-frequency domain.The method of recursion is: according to current observed value x lModel parameter collection λ with a last moment L-1, the model parameter collection λ of estimation current time lThus, obtain each the constantly noise power spectrum on certain frequency component and probability of voice appearance successively.The present invention is that spectrum is estimated and the tight coupling solution of voice activity detection, can strengthen the adaptability of voice application system to noise circumstance; The present invention does not rely on " noise is initial " and supposes; And the present invention can also provide the description of voice activity on the time-frequency two-dimensional space.This patent is development and come on the patent basis of the patent No. of having applied for 201010178166.4, owing to adopted the accurate modeling method of model more, the performance of this patent to be superior to 201010178166.4, but computation complexity is higher than 201010178166.4.
Compared with prior art, the present invention has following technique effect:
On certain frequency component, there is the sequential correlativity based on voice signal; The present invention utilizes hidden Markov model that this sequential correlativity is carried out modeling; Log power spectrum envelope on certain frequency component can be regarded a Markov chain as; The redirect between voice " appearance " and " not occurring " two states of this chain for each state, adopts a Gaussian distribution to describe the distribution of its power spectrum.In order to simplify calculating, the invention allows for the sequential HMM method for tracing that single order returns, its parameter is along with input signal constantly changes.Wherein the estimated value that the average of state is exactly a noise power spectrum " does not appear " in the voice of HMM, according to the forward direction factor of HMM, can derive voice certain the time frequency probability of occurrence.
The present invention is that a kind of voice activity detects and noise power spectrum is estimated tightly coupled scheme, can strengthen the adaptability of voice application system to noise circumstance; And the present invention can also provide the description of voice activity on the time-frequency two-dimensional space, helps noise is carried out further process of refinement.
Description of drawings
Fig. 1 noise spectrum of the present invention estimate with the voice activity detection method process flow diagram;
SHMM Noise Estimation algorithm of the present invention, classical minimum statistics algorithm (MS), minimum control return average algorithm (MCRA) and its raising version IMCRA effect comparison figure to Fig. 2 for instance has compared.
Embodiment
The present invention proposes a kind of noise power spectrum based on sequential hidden Markov model estimates and the voice activity detection method.
As shown in Figure 1, comprise the following steps:
1) for the logarithm amplitude characteristic of voice signal on each frequency, set up a HMM model, mathematic(al) representation is following:
p ( x l | λ l ) = Σ s l Π t = 1 l a s t - 1 , s t Π t = 1 l b ( x t | s t , λ l )
expression state transition probability here;
Figure BDA0000064390660000093
expression original state probability, wherein gaussian component is expressed as:
b ( x t | s t , λ l ) = 1 2 π κ z , l exp { - 1 2 ( x t - μ z , l ) 2 / κ z , l }
Wherein, x lRepresent the logarithm amplitude spectrum on certain frequency of l frame, z=0 representes that voice do not go out present condition, and z=1 expresses present condition.μ Z, kAnd κ Z, kRepresent average and variance respectively, parameter set λ l={ μ 0, l, μ 1, l, κ 1, l, κ 0, l, a 01, l, a 10, l, a 00, l, a 11, l, π 0, π 1.
2) for one section speech data, set the M frame buffer, deposit preceding M frame input signal in the buffer memory in, extract the logarithm amplitude spectrum of M frame in the buffer memory, the GMM model of substitution step 1) carries out initialization, obtains initialized model λ 0, kInitialization procedure adopts constraint EM algorithm; M representes the length of initialization window.
3) obtaining initialized model λ MAfterwards, since the M+1 frame, adopt the method for incremental learning, by frame update HMM model, recursion obtains λ successively lAnd draw noise figure μ 0, lWith the probability of occurrence of voice signal on certain frequency of l frame.
I=1 wherein, 2,3 ...
Wherein, the incremental learning method of said GMM comprises recursion weight coefficient, recursion average and recursion variance;
Wherein forward direction factor recurrence method is: F l + 1 | λ l ( z ) = Σ y F l | λ l - 1 ( z ) a Yz , l b ( x l + 1 | s l + 1 = z , λ l ) , Z ∈ 0,1}.
Voice and noise probability of occurrence recurrence method are: γ l + 1 | λ l ( z ) = F l + 1 | λ l ( z ) Σ z F l + 1 | λ l ( z ) , z ∈ { 0,1 }
Conditional transfer probability recurrence method is: ξ l + 1 | λ l ( y , z ) = F l + 1 | λ l ( y ) a Yz , l b ( x l + 1 | s l + 1 = z , λ l ) Σ Yz F l + 1 | λ l ( y ) a Yz , l b ( x l + 1 | s l + 1 = z , λ l )
Average noise voice probability of occurrence recurrence method is: γ ~ l + 1 ( z ) = α γ ~ l ( z ) + ( 1 - α ) γ l + 1 | λ l ( z )
Time-dependent smoothing factor recurrence method is: α ~ l + 1 ( z ) = α γ ‾ l ( z ) α γ ‾ l ( z ) + ( 1 - α ) γ l + 1 | λ l ( z )
State average recurrence method is: μ z , l + 1 = α ~ l + 1 ( z ) μ z , l + [ 1 - α ~ l + 1 ( z ) ] x l + 1
The state variance recurrence method is: κ z , l + 1 = α ~ l + 1 ( z ) κ z , l + [ 1 - α ~ l + 1 ( z ) ] ( x l + 1 - μ z , l ) 2
Mean transferred probability recurrence method is: ξ ~ l + 1 ( y , z ) = α ξ ~ l ( y , z ) + ( 1 - α ) ξ l + 1 | λ l ( y , z )
The state probability recurrence method is: a Yz , l + 1 = a Yz , l + ξ l + 1 | λ l ( y , z ) a Yz , l - ξ l + 1 | λ l ( y , 1 - z ) 1 - a Yz , l K a Yz , l 2 ξ ‾ l + 1 ( y , z ) + K ( 1 - a Yz , l ) 2 ξ ‾ l + 1 ( y , 1 - z )
The maximum characteristics of sequential hidden Markov model be can online tracking frequency component on the sequential correlativity that occurs of voice, it regards the general envelope of the power on certain frequency component as a Markov chain of between voice and non-voice state, switching.It adopts the mode of non-supervision to make up initial model.Particularly, it has following characteristics:
● owing to adopt HMM, can adopt the mode of Viterbi decoding, provide on the time series optimal estimation whether voice occur.
● at initial phase, do not rely on the initial hypothesis of noise, so the range of application that should invent is used more wide in range than one solution.
● voice activity is the two-dimensional signal of " time---frequency ", and other voice activity detection algorithm has only been described the existence of voice on time dimension.
In one embodiment, the carrier of unsupervised learning framework is binary hidden Markov model (Hidden MarkovModel is abbreviated as HMM).The distribution of one of them representation in components speech energy, another component are the distributions of noise energy.On each frequency component, extract logarithm amplitude spectrum envelope, and set up the HMM of a correspondence.At first adopt EM algorithm initialization HMM, adopt the mode of incremental learning to come progressively to upgrade HMM then.According to the HMM model, deduce out the power spectrum information of the probability of occurrence on this subband and the noise of voice respectively.In HMM modeling of parameters process, average, weight, variance and the transition probability to HMM retrains respectively.Wherein, for the sequential estimation method of HMM parameter, specifically comprise the calculating of recursion weight coefficient, recursion average and recursion variance and recursion.
1) recursive mean:
Figure BDA0000064390660000111
where is a dependent on language
The smoothing factor of sound probability of occurrence is less than 1 but approach 1.
2) recursion variance, κ z , l + 1 = α ~ l + 1 ( z ) κ z , l + [ 1 - α ~ l + 1 ( z ) ] ( x l + 1 - μ z , l ) 2 ;
3) recursion transition probability;
Figure BDA0000064390660000114
perhaps wherein β be one less than 1 but approach 1 smoothing factor, β=0.99 for example.
Below in conjunction with a preferred embodiment the present invention is done description further.
Principle of the present invention is following:
Noise estimation procedure parallel running on each frequency component, so, dispense frequency component index k in the following description.For the logarithm amplitude spectrum time series x of voice signal on each frequency component l={ x 1, x 2..., x l, set up a hidden Markov model, s l={ s 1, s 2..., s l, s t{ 0,1} is its corresponding status switch to ∈, and 1 expression voice go out present condition, and 0 expression noise goes out present condition, λ lExpression is from sequence x lIn the model parameter valuation of obtaining, so for a given parameter set λ l, corresponding observed value sequence x lProbability density function can be expressed as:
p ( x l | λ l ) = Σ s l p ( s l | λ l ) p ( x l | λ l , s l )
Wherein, p (s l| λ l) expression status switch s lThe prior probability gaussian component that occurs is expressed as:
p ( s l | λ l ) = Π t = 1 l a s t - 1 , s t
Here The expression state transitions Expression original state probability, p (x l| λ l, s l) expression given state s lWith parameter set λ lSituation under observed value sequence x lLikelihood score:
p ( x l | λ l , s l ) = Π t = 1 l b ( x t | s t , λ l )
Wherein
b ( x t | s t , λ l ) = 1 2 π κ z , l exp { - 1 2 ( x t - μ z , l ) 2 / κ z , l }
Here κ Z, lExpression Gaussian distribution variance, μ Z, lThe expression average, s l=z, λ l={ μ 0, l, μ 1, l, κ 1, l, κ 0, l, a 01, l, a 10, l, a 00, l, a 11, l, π 0, π 1, the initial probability π in the parameter set zNot along with the time changes.
μ in this model 0, lBe exactly that we want the noise estimated.Simultaneously, our probability that can derive that voice signal occurs on certain frequency of l frame is γ T| λ l(z)=p (s t=z|x t, λ l).
Based on above-mentioned principle, according to one embodiment of present invention, said noise power spectrum is estimated and the voice activity detection method comprises the following steps:
Step 100: set the M frame buffer, deposit preceding M frame input signal in the buffer memory in, extract the amplitude spectrum of M frame in the buffer memory.The method of extracting frame amplitude spectrum is following:
At first the digitized sound signal of this frame is done pre-service (according to system's actual conditions, can comprise windowing, pre-emphasis etc.), establishing every frame length is the F point, and first zero padding is to N point (N>=F wherein, N=2 j, j is integer and j>=8), carry out leaf transformation in the N point discrete Fourier, obtain discrete spectrum
Figure BDA0000064390660000121
Y wherein L, nN sampled point of l frame in the expression buffer memory, Y L, kK Fourier transform value of i frame in the expression buffer memory (k=0,1 ..., N-1).So; It is that windowed function is (like Hanning window that its range value may be calculated
Figure BDA0000064390660000122
b (r); Breathe out peaceful window or the like), notice that the k here is omitted in the following description.
The initialization of step 200:HMM.Hidden Markov model λ of initialization on each frequency M, wherein subscript M representes initialized time window length, and initialization procedure adopts constraint EM algorithm, and on certain frequency, concrete initialization step is following:
Step 201: the method through cluster (for example the non-supervision cluster of LBG, perhaps fuzzy clustering or the like) is divided into two types with M sample:
Figure BDA0000064390660000123
With
Figure BDA0000064390660000124
M wherein 0+ M 1=M, average bigger a type is with subscript (1) expression, another kind ofly representes with subscript (0).Two types average does One type less average of energy does
Figure BDA0000064390660000126
Wherein Two types variance is respectively:
Figure BDA0000064390660000128
Two types initializes weights coefficient:
Figure BDA0000064390660000129
The likelihood score of novel model of calculating,
Figure BDA00000643906600001210
In following iterative process, old model parameter set is expressed as λ ' M, new model parameter is:
Figure BDA00000643906600001211
Before the beginning iteration,
Figure BDA00000643906600001212
L ' is set to very big negative, for example a L ' k=-10000.The initialization forward direction factor; To the factor, begin interative computation below
Figure BDA0000064390660000132
after
Figure BDA0000064390660000131
initialization.
Step 202: calculate the forward direction factor: F ‾ t ( z ) = Σ y F ‾ t - 1 ( z ) a ‾ Yz , M b ( x t | y , λ ‾ M ) , z ∈ { 0,1 } .
Step 203: calculate back to the factor: B ‾ t ( z ) = Σ y B ‾ t + 1 ( y ) a ‾ z y , M b ( x l + 1 | y , λ ‾ M ) , z ∈ { 0,1 } .
Step 204: calculating noise and voice probability of occurrence: p ( z | x t , λ ‾ M ) = F ‾ t ( z ) B ‾ t ( z ) Σ z F ‾ t ( z ) B ‾ t ( z ) , z ∈ { 0,1 }
Step 205: if
Figure BDA0000064390660000136
stops to fall generation.Wherein ζ approaches zero but greater than zero decimal.
Step 206: calculate transition probability: p ( s t - 1 = y , s t = z | x t , λ ‾ M ) = F ‾ t - 1 ( y ) B ‾ t ( z ) a ‾ Yz , M b ( x t | z , λ ‾ M ) Σ z F ‾ t - 1 ( y ) B ‾ t ( z ) a ‾ Yz , M b ( x t | z , λ ‾ M ) .
Step 207: calculate new initialization probability π z ′ = p ( s 1 = z | x 1 , λ ‾ M )
Step 208: calculate new average μ z , M ′ = Σ t = 1 M p ( s t = z | x t , λ ‾ M ) Σ t = 1 M p ( s t = z | x t , λ ‾ M ) λ ‾ M λ ‾ M λ ‾ M
Step 209: the average to new retrains: μ ' 1, k=max{ μ ' 0, k, μ ' 0, k+ δ }, wherein δ is a constant, span is between 0 to 100.
Step 210: calculate new variance κ z , M ′ = Σ t = 1 M p ( s t = z | x t , λ ‾ M ) ( x t - μ ‾ z , M ) 2 Σ t = 1 M p ( s t = z | x t , λ ‾ M )
Step 211: new variance is retrained κ ' 1, M=max{ κ ' 0, M, κ ' 1, M}
Step 212: calculate new transition probability, a Yz , M ′ = Σ t = 1 M p ( s t - 1 = y , s t = z | x t , λ ‾ M ) Σ t = 1 M Σ z p ( s t - 1 = y , s t = z | x t , λ ‾ M )
Step 213: the likelihood score of novel model of calculating L ‾ = Log ( p ( x M | λ M ′ ) ) ,
Step 214: if satisfy condition
Figure BDA00000643906600001313
termination of iterations; Wherein ε is a very little numeral, for example ε=0.1.If
Figure BDA00000643906600001314
iteration jumps to " step 202 ".
The sequential renewal of step 300:HMM.Setting up initialized model λ MAfterwards, since the M+1 frame, adopt the method for incremental learning, by frame update HMM model.Iterative process can be expressed as: on each frequency, and known λ lWith current observed value x l, infer λ L+1Carry out Fourier transform for the l+1 frame, obtain T L+1, k, 0≤k<N wherein.On each frequency; Calculate range value
Figure BDA0000064390660000141
for each frequency, following in the parameter update step of l+1 frame:
Step 301: calculate the forward direction factor, F l + 1 | λ l ( z ) = Σ y F l | λ l - 1 ( z ) a Yz , l b ( x l + 1 | s l + 1 = z , λ l ) , z ∈ { 0,1 } .
Step 302: computing voice and noise probability of occurrence, γ l + 1 | λ l ( z ) = F l + 1 | λ l ( z ) Σ z F l + 1 | λ l ( z ) , z ∈ { 0,1 }
Step 303: the design conditions transition probability, ξ l + 1 | λ l ( y , z ) = F l + 1 | λ l ( y ) a Yz , l b ( x l + 1 | s l + 1 = z , λ l ) Σ Yz F l + 1 | λ l ( y ) a Yz , l b ( x l + 1 | s l + 1 = z , λ l )
Step 304: calculating average noise voice probability of occurrence, γ ~ l + 1 ( z ) = α γ ~ l ( z ) + ( 1 - α ) γ l + 1 | λ l ( z )
Step 305: rely on smoothing factor computing time, α ~ l + 1 ( z ) = α γ ‾ l ( z ) α γ ‾ l ( z ) + ( 1 - α ) γ l + 1 | λ l ( z )
Step 306: computation of mean values, μ z , l + 1 = α ~ l + 1 ( z ) μ z , l + [ 1 - α ~ l + 1 ( z ) ] x l + 1
Step 307: the average to new retrains: μ 1, l+1=max{ μ 1, l+1, μ 0, l+1+ δ }.
Step 308: calculate new variance, κ z , l + 1 = α ~ l + 1 ( z ) κ z , l + [ 1 - α ~ l + 1 ( z ) ] ( x l + 1 - μ z , l ) 2
Step 309: new variance is retrained κ 1, l+1=max{ κ 0, l+1, κ 1, l+1}
Step 310: calculate average transition probability, ξ ~ l + 1 ( y , z ) = α ξ ~ l ( y , z ) + ( 1 - α ) ξ l + 1 | λ l ( y , z )
Step 311: calculate transition probability, a Yz , l + 1 = a Yz , l + ξ l + 1 | λ l ( y , z ) a Yz , l - ξ l + 1 | λ l ( y , 1 - z ) 1 - a Yz , l K a Yz , l 2 ξ ‾ l + 1 ( y , z ) + K ( 1 - a Yz , l ) 2 ξ ‾ l + 1 ( y , 1 - z )
Step 312: the transition probability to new retrains, a 01, l=max{a 01, l, η }, a 00, l=1-a 01, l, a 10, l=max{a 10, l, η }, a 11, l=1-a 10, l
From above substep, we have obtained λ L+1In all parameters, thereby obtained the relevant voice probability of occurrence Power spectrum valuation μ with noise signal 0, l+1
Algorithm based on the foregoing description; The noise power spectrum estimation performance is estimated; Adopt each 8 sentence of men and women words person speech data and white Gaussian noise, F16 fight support storehouse noise and babble noise in the NOISEX92 noise data storehouse in the TIMIT database according to 0,5, signal to noise ratio (S/N ratio) such as 10dB mixes.First kind of evaluation index linear segmented error defines as follows:
ϵ n = 1 L Σ l = 1 L { 10 log 10 Σ k = 1 N [ D k , l - D ^ k , l ] 2 / Σ k = 1 N D k , l 2 }
D (k wherein; L) the actual noise amplitude spectrum of expression; The noise amplitude spectrum that
Figure BDA0000064390660000152
expression is estimated; Notice that error amount is more little, the expression estimated value approaches actual value more, and it is accurate more to estimate.Second kind of evaluation index logarithm segmentation error defines as follows:
ϵ r = 1 L Σ l = 1 L { 1 N Σ k = 1 N [ 20 log 10 | D k , l | - 20 log 10 | D ^ k , l | ] 2 } 1 / 2 .
In like manner, error amount is more little, and the expression estimating noise is accurate more.
Algorithm compares with three kinds of noise power spectrum algorithm for estimating of current main-stream respectively; Wherein MS representes the minimum statistics algorithm; MCRA representes the recurrence average algorithm of minimum control, and IMCRA representes that the minimum control that improves version returns average algorithm, and SHMM is an algorithm of the present invention.Table 1 has been expressed the result of line spectrum error SegError.
Noise Estimation linear segmented error under the various environment of table 1
Figure BDA0000064390660000154
Noise Estimation logarithm segmentation error under the various environment of table 2
Figure BDA0000064390660000155
Can find out that from last table the algorithm that the present invention proposes all has remarkable advantages for three kinds of algorithms of present main flow.
In addition, Fig. 2 has compared SHMM Noise Estimation algorithm, classical minimum statistics algorithm (MS), minimum control recurrence average algorithm (MCRA) and its raising version IMCRA through an instance.In this example, Noisy Speech Signal was 3.75 seconds position, and signal to noise ratio (S/N ratio) drops to 4dB suddenly from 10dB, and the position at 13.1 seconds rises to 10dB from 4dB again.(a) be the power spectrum of noisy speech on a certain subband; (b) the noise power spectrum envelope of estimating for the MS algorithm, wherein dotted line is represented real noise power spectrum envelope; (c) estimated result of expression MCRA algorithm; (d) estimated result of expression IMCRA algorithm; (e) represent the estimated result of this algorithm.As can be seen from the figure, other three algorithms are slower for the reaction of the unexpected rising of noise, can not catch up with jumping of noise fast.And the better performances that the SHMM algorithm table reveals.
In decades in the past, people have invented various algorithms, are used to estimate voice activity and noise power spectrum.Voice signal is one of them important clue in the sequential correlativity of frequency domain, because voice signal is non-stabilization signal, this sequential correlativity is also along with the time changes.Yet the algorithm in past does not cause enough attention to this sequential correlativity, just is used simply, becomes correlativity when not adopting adaptive method to describe.The present invention adopts sequential hidden Markov model SHMM to describe the sequential correlativity of voice on each frequency component; This sequential estimation method is to be based upon on the basis of single order recurrence; Sequential correlativity and its parameter set change along with the variation of input signal together.This statistical model has accurately reflected the sequential statistical nature of voice, and therefore, the algorithm for estimating that the present invention proposes is superior to the algorithm (for example minimum statistics, the recurrence of minimum control is average) of present main flow in performance.
In addition, the SHMM model all was on the basis that is based upon the high-order recurrence in the past, and the single order that proposes among the present invention returns SHMM with respect to the SHMM that high-order returns, and has practiced thrift calculated amount and storage space greatly.This is another innovation part of the present invention.
It should be noted last that above embodiment is only unrestricted in order to technical scheme of the present invention to be described.Although the present invention is specified with reference to embodiment; Those of ordinary skill in the art is to be understood that; Technical scheme of the present invention is made amendment or is equal to replacement, do not break away from the spirit and the scope of technical scheme of the present invention, it all should be encompassed in the middle of the claim scope of the present invention.

Claims (10)

1. a noise power spectrum is estimated and the voice activity detection method; This method is described the sequential correlativity of voice on each frequency component based on the sequential hidden Markov model SHMM that single order returns; And adopt the mode of incremental learning to come progressively to upgrade SHMM, and final, deduce out the power spectrum information of the probability of occurrence on this frequency subband and the noise of voice; With the sequential statistical nature of accurate reflection voice, this method comprises the following steps:
1) on each frequency component, extracts logarithm amplitude spectrum envelope for voice signal; And set up the binary hidden Markov model of a correspondence, wherein, one-component is represented the distribution of speech energy; Another component is the distribution of noise energy, and each state is represented by Gaussian distribution;
2) for one section speech data, set the M frame buffer, deposit preceding M frame input signal in the buffer memory in, extract the logarithm amplitude spectrum of M frame in the buffer memory, adopt the maximum likelihood algorithm for estimating to set up an initialized model;
3) obtaining initialized model λ MAfterwards, since the M+1 frame, adopt the method for incremental learning, by the HMM model of each frequency band of frame update, recursion obtains the probability of occurrence of noise figure and voice signal successively.
2. noise power spectrum according to claim 1 estimates and the voice activity detection method that the concrete steps of this method comprise:
1) on each frequency component, extracts logarithm amplitude spectrum envelope for voice signal, for the logarithm amplitude spectrum time series x on the frequency component l={ x 1, x 2..., x l, set up a hidden Markov model s l={ s 1, s 2..., s l, s t{ 0,1} is its corresponding status switch to ∈, and 1 expression voice go out present condition, and 0 expression noise goes out present condition, λ lExpression is from sequence x lIn the model parameter valuation of obtaining, so, for a given parameter set λ l, corresponding observed value sequence x lProbability density function can be expressed as:
p ( x l | λ l ) = Σ s l p ( s l | λ l ) p ( x l | λ l , s l ) ;
Wherein, p (s l| λ l) expression status switch s lThe prior probability that occurs, gaussian component is expressed as:
p ( s l | λ l ) = Π t = 1 l a s t - 1 , s t ;
Here
Figure FDA0000064390650000013
The expression state transition probability,
Figure FDA0000064390650000014
Expression original state probability, p (x l| λ l, s l) expression given state s lWith parameter set λ lSituation under observed value sequence x lLikelihood score:
p ( x l | λ l , s l ) = Π t = 1 l b ( x t | s t , λ l ) ;
Wherein,
b ( x t | s t , λ l ) = 1 2 π κ s t , l exp { - 1 2 ( x t - μ s t , l ) 2 / κ s t , l } ;
changes;
μ in this model 0, lBe exactly that we want the noise estimated, simultaneously, we can derive that voice signal occurs on certain frequency of l frame probability does
Figure FDA0000064390650000022
2) for one section speech data, set the M frame buffer, deposit preceding M frame input signal in the buffer memory in, extract the logarithm amplitude spectrum of M frame in the buffer memory, the HMM model of substitution step 1) is to hidden Markov model λ of initialization on each frequency M, subscript M representes initialized time window length, l>=M;
3) obtaining initialized model λ MAfterwards, since the M+1 frame, the HMM model adopts the method for incremental learning, and by frame update SHMM model, recursion obtains λ successively lAnd draw noise figure μ 0, lWith the probability of occurrence of voice signal on certain frequency of l frame.
3. noise power spectrum according to claim 1 and 2 is estimated and the voice activity detection method, it is characterized in that, the step of extracting frame amplitude spectrum in the described step 1) comprises:
At first, the digitized sound signal of this frame is done pre-service, establishing every frame length is the F point, and first zero padding is to N point, N>=F, N=2 j, j is an integer, and j>=8, carries out leaf transformation in the N point discrete Fourier, obtains discrete spectrum
Figure FDA0000064390650000023
Wherein, y L, nN sampled point of l frame in the expression buffer memory, Y L, kK Fourier transform value of i frame in the expression buffer memory (k=0,1 ..., N-1); So, its range value may be calculated
Figure FDA0000064390650000024
In the formula, b (r) is a windowed function.
4. noise power spectrum according to claim 3 is estimated and the voice activity detection method, it is characterized in that described pre-service comprises windowing or/and pre-emphasis.
5. noise power spectrum according to claim 3 is estimated and the voice activity detection method, it is characterized in that, described windowed function adopts Hanning window or breathes out peaceful window.
6. noise power spectrum according to claim 1 and 2 is estimated and the voice activity detection method, it is characterized in that described step 2) in the initialization of HMM, concrete initialized step comprises on certain frequency:
Step 201): the method through cluster is divided into two types with M sample: With
Figure FDA0000064390650000026
Wherein, M 0+ M 1=M, average bigger a type is with subscript (1) expression, another kind ofly representes with subscript (0);
Two types average is one a type less average of
Figure FDA0000064390650000027
energy for
Figure FDA0000064390650000028
wherein,
Figure FDA0000064390650000031
Two types variance is respectively: κ ‾ 0 , M = 1 M 0 Σ j = 1 M 0 ( x i j - μ ‾ 0 , M ) 2 , κ ‾ 1 , M = 1 M 1 Σ j = 1 M 1 ( x i j - μ ‾ 1 , M ) 2 ;
Two types initializes weights coefficient is: a ‾ 00 , M = a ‾ 01 , M = a ‾ 11 , M = a ‾ 10 , M = 0.5 ;
The likelihood score of novel model of calculating,
Figure FDA0000064390650000034
And beginning interative computation; In following iterative process, old model parameter set is expressed as λ ' M, new model parameter is: Before the beginning iteration,
Figure FDA0000064390650000036
L ' is set to a very big negative, the initialization forward direction factor, After the initialization to the factor,
Figure FDA0000064390650000038
Step 202): calculate the forward direction factor: F ‾ l ( z ) = Σ y F ‾ l - 1 ( z ) a ‾ y , z , M b ( x l | y , λ ‾ M ) , z , y ∈ { 0,1 } ;
Step 203): calculate back to the factor: B ‾ l ( z ) = Σ y B ‾ l + 1 ( y ) a ‾ z , y , M b ( x l + 1 | y , λ ‾ M ) , z , y ∈ { 0,1 } ;
Step 204): calculating noise and voice probability of occurrence: p ( z | x l , λ ‾ M ) = F ‾ l ( z ) B ‾ l ( z ) Σ z F ‾ l ( z ) B ‾ l ( z ) , z ∈ { 0,1 } ;
Step 205): if stops to fall generation, wherein ζ approaches zero but greater than zero decimal;
Step 206): calculate transition probability:
p ( s l - 1 = y , s l = z | x l , λ ‾ M ) = F ‾ l - 1 ( y ) B ‾ l ( z ) a ‾ yz , M b ( x l | z , λ ‾ M ) Σ z F ‾ l - 1 ( y ) B ‾ l ( z ) a ‾ yz , M b ( x l | z , λ ‾ M ) ;
Step 207): calculate new initialization probability π z ′ = p ( s 1 = z | x 1 , λ ‾ M ) ;
Step 208): calculate new average μ z , M ′ = Σ t = 1 M p ( s t = z | x t , λ ‾ M ) x t Σ t = 1 M p ( s t = z | x t , λ ‾ M ) λ ‾ M λ ‾ M λ ‾ M ;
Step 209): the average to new retrains: μ ' 1, M=max{ μ ' 0, M, μ ' 0, M+ δ }, wherein, δ is a constant, span is between 0 to 100;
Step 210): calculate new variance κ z , M ′ = Σ t = 1 M p ( s t = z | x t , λ ‾ M ) ( x t - μ ‾ z , M ) 2 Σ t = 1 M p ( s t = z | x t , λ ‾ M ) ;
Step 211): new variance is retrained κ ' 1, M=max{ κ ' 0, M, κ ' 1, M;
Step 212): calculate new transition probability, a Yz , M ′ = Σ t = 1 M p ( s t - 1 = y , s t = z | x t , λ ‾ M ) Σ t = 1 M Σ z p ( s t - 1 = y , s t = z | x t , λ ‾ M ) ;
Step 213): the likelihood score of novel model of calculating L ‾ = Log ( p ( x M | λ M ′ ) ) ;
Step 214): if satisfy condition
Figure FDA0000064390650000043
termination of iterations; Wherein, ε is a very little numeral, if
Figure FDA0000064390650000044
iteration jumps to step 202).
7. noise power spectrum according to claim 6 is estimated and the voice activity detection method, it is characterized in that described step 201) in the method for cluster adopt non-supervision cluster of LBG or fuzzy clustering method.
8. noise power spectrum according to claim 1 and 2 is estimated and the voice activity detection method, it is characterized in that the sequential renewal of the HMM in the described step 3) is to set up initialized model λ MAfterwards, since the M+1 frame, adopt the method for incremental learning, by frame update HMM model, its iterative process can be expressed as: on each frequency, and known λ lWith current observed value x l, infer λ L+1Carry out Fourier transform for the l+1 frame, obtain Y L+1k, wherein, 0≤k<N; On each frequency, calculate range value
Figure FDA0000064390650000045
For each frequency, following in the parameter update step of l+1 frame:
Step 301): calculate the forward direction factor, F l + 1 | λ l ( z ) = Σ y F l | λ l - 1 ( z ) a Yz , l b ( x l + 1 | s l + 1 = z , λ l ) , Z ∈ 0,1}.
Step 302): computing voice and noise probability of occurrence, γ l + 1 | λ l ( z ) = F l + 1 | λ l ( z ) Σ z F l + 1 | λ l ( z ) , z ∈ { 0,1 } ;
Step 303): the design conditions transition probability,
ξ l + 1 | λ l ( y , z ) = F l + 1 | λ l ( y ) a yz , l b ( x l + 1 | s l + 1 = z , λ l ) Σ yz F l + 1 | λ l ( y ) a yz , l b ( x l + 1 | s l + 1 = z , λ l ) ;
Step 304): calculating average noise voice probability of occurrence, γ ~ l + 1 ( z ) = α γ ~ l ( z ) + ( 1 - α ) γ l + 1 | λ l ( z ) ;
Step 305): rely on smoothing factor computing time, α ~ l + 1 ( z ) = α γ ‾ l ( z ) α γ ‾ l ( z ) + ( 1 - α ) γ l + 1 | λ l ( z ) ;
Step 306): the computing mode average, μ z , l + 1 = α ~ l + 1 ( z ) μ z , l + [ 1 - α ~ l + 1 ( z ) ] x l + 1 ;
Step 307): the state average to new retrains: μ 1, l+1=max{ μ 1, l+1, μ 0, l+1+ δ }, l>=M;
Step 308): calculate new state variance, κ z , l + 1 = α ~ l + 1 ( z ) κ z , l + [ 1 - α ~ l + 1 ( z ) ] ( x l + 1 - μ z , l ) 2 ;
Step 309): the new state variance is retrained κ 1, l+1=max{ κ 0, l+1, κ 1, l+1, l>=M;
Step 310): calculate average transition probability, ξ ~ l + 1 ( y , z ) = α ξ ~ l ( y , z ) + ( 1 - α ) ξ l + 1 | λ l ( y , z ) ;
Step 311): the computing mode probability, a Yz , l + 1 = a Yz , l + ξ l + 1 | λ l ( y , z ) a Yz , l - ξ l + 1 | λ l ( y , 1 - z ) 1 - a Yz , l K a Yz , l 2 ξ ‾ l + 1 ( y , z ) + K ( 1 - a Yz , l ) 2 ξ ‾ l + 1 ( y , 1 - z ) ;
Step 312): the transition probability to new retrains, a 01, l=max{a 01, l, η }, a 00, l=1-a 01, l, a 10, l=max{a 10, l, η }, a 11, l=1-a 10, l, l>=M;
From above substep, obtained λ L+1In all parameters, thereby obtained relevant voice probability of occurrence γ L+1 λ l(1) and the power spectrum valuation μ of noise signal 0, l+1
9. noise power spectrum according to claim 8 is estimated and the voice activity detection method, it is characterized in that, the incremental learning method that the HMM model in the described step 3) adopts comprises: recursion weight coefficient, recursion average and pass the vertebra variance;
Wherein, Described recursion average: in formula;
Figure FDA0000064390650000055
is a smoothing factor that depends on the voice probability of occurrence, less than 1 but approach 1;
Described recursion variance: κ z , l + 1 = α ~ l + 1 ( z ) κ z , l + [ 1 - α ~ l + 1 ( z ) ] ( x l + 1 - μ z , l ) 2 ;
Described recursion transition probability:
Figure FDA0000064390650000057
perhaps in
Figure FDA0000064390650000058
formula, β be one less than 1 but approach 1 smoothing factor.
10. noise power spectrum according to claim 1 is estimated and the voice activity detection method, it is characterized in that, the parameter recurrence method of the described sequential hidden Markov model that returns based on single order is:
Calculate the forward direction factor of HMM: F l + 1 | λ l ( z ) = Σ y F l | λ l - 1 ( z ) a Yz , l b ( x l + 1 | s l + 1 = z , λ l ) , z ∈ { 0,1 } ;
Computing voice and noise probability of occurrence, γ l + 1 | λ l ( z ) = F l + 1 | λ l ( z ) Σ z F l + 1 | λ l ( z ) , z ∈ { 0,1 } ;
The design conditions transition probability, ξ l + 1 | λ l ( y , z ) = F l + 1 | λ l ( y ) a Yz , l b ( x l + 1 | s l + 1 = z , λ l ) Σ Yz F l + 1 | λ l ( y ) a Yz , l b ( x l + 1 | s l + 1 = z , λ l ) ;
Calculating average noise voice probability of occurrence, γ ~ l + 1 ( z ) = α γ ~ l ( z ) + ( 1 - α ) γ l + 1 | λ l ( z ) ;
Rely on smoothing factor computing time, α ~ l + 1 ( z ) = α γ ‾ l ( z ) α γ ‾ l ( z ) + ( 1 - α ) γ l + 1 | λ l ( z ) ;
Computation of mean values, μ z , l + 1 = α ~ l + 1 ( z ) μ z , l + [ 1 - α ~ l + 1 ( z ) ] x l + 1 ;
Calculate new variance, κ z , l + 1 = α ~ l + 1 ( z ) κ z , l + [ 1 - α ~ l + 1 ( z ) ] ( x l + 1 - μ z , l ) 2 ;
Calculate average transition probability, ξ ~ l + 1 ( y , z ) = α ξ ~ l ( y , z ) + ( 1 - α ) ξ l + 1 | λ l ( y , z ) ;
Calculate transition probability, a Yz , l + 1 = a Yz , l + ξ l + 1 | λ l ( y , z ) a Yz , l - ξ l + 1 | λ l ( y , 1 - z ) 1 - a Yz , l K a Yz , l 2 ξ ‾ l + 1 ( y , z ) + K ( 1 - a Yz , l ) 2 ξ ‾ l + 1 ( y , 1 - z ) .
CN201110141137.5A 2011-05-27 2011-05-27 Method for estimating noise power spectrum and voice activity Expired - Fee Related CN102800322B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110141137.5A CN102800322B (en) 2011-05-27 2011-05-27 Method for estimating noise power spectrum and voice activity

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110141137.5A CN102800322B (en) 2011-05-27 2011-05-27 Method for estimating noise power spectrum and voice activity

Publications (2)

Publication Number Publication Date
CN102800322A true CN102800322A (en) 2012-11-28
CN102800322B CN102800322B (en) 2014-03-26

Family

ID=47199411

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110141137.5A Expired - Fee Related CN102800322B (en) 2011-05-27 2011-05-27 Method for estimating noise power spectrum and voice activity

Country Status (1)

Country Link
CN (1) CN102800322B (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103489454A (en) * 2013-09-22 2014-01-01 浙江大学 Voice endpoint detection method based on waveform morphological characteristic clustering
CN104575513A (en) * 2013-10-24 2015-04-29 展讯通信(上海)有限公司 Burst noise processing system and burst noise detection and suppression method and device
WO2015078268A1 (en) * 2013-11-27 2015-06-04 Tencent Technology (Shenzhen) Company Limited Method, apparatus and server for processing noisy speech
CN105355199A (en) * 2015-10-20 2016-02-24 河海大学 Model combination type speech recognition method based on GMM (Gaussian mixture model) noise estimation
CN105810201A (en) * 2014-12-31 2016-07-27 展讯通信(上海)有限公司 Voice activity detection method and system
CN106331969A (en) * 2015-07-01 2017-01-11 奥迪康有限公司 Enhancement of noisy speech based on statistical speech and noise models
WO2017063516A1 (en) * 2015-10-13 2017-04-20 阿里巴巴集团控股有限公司 Method of determining noise signal, and method and device for audio noise removal
CN106653047A (en) * 2016-12-16 2017-05-10 广州视源电子科技股份有限公司 Automatic gain control method and device for audio data
CN107731230A (en) * 2017-11-10 2018-02-23 北京联华博创科技有限公司 A kind of court's trial writing-record system and method
CN104269180B (en) * 2014-09-29 2018-04-13 华南理工大学 A kind of quasi- clean speech building method for speech quality objective assessment
CN108113646A (en) * 2016-11-28 2018-06-05 中国科学院声学研究所 A kind of detection in cardiechema signals cycle and the state dividing method of heart sound
CN108986832A (en) * 2018-07-12 2018-12-11 北京大学深圳研究生院 Ears speech dereverberation method and device based on voice probability of occurrence and consistency
CN109616139A (en) * 2018-12-25 2019-04-12 平安科技(深圳)有限公司 Pronunciation signal noise power spectral density estimation method and device
CN109741759A (en) * 2018-12-21 2019-05-10 南京理工大学 A kind of acoustics automatic testing method towards specific birds species
CN110136738A (en) * 2019-06-13 2019-08-16 苏州思必驰信息科技有限公司 Noise estimation method and device
WO2021057239A1 (en) * 2019-09-23 2021-04-01 腾讯科技(深圳)有限公司 Speech data processing method and apparatus, electronic device and readable storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1242553A (en) * 1998-03-24 2000-01-26 松下电器产业株式会社 Speech detection system for noisy conditions
US20020188445A1 (en) * 2001-06-01 2002-12-12 Dunling Li Background noise estimation method for an improved G.729 annex B compliant voice activity detection circuit
CN101853661A (en) * 2010-05-14 2010-10-06 中国科学院声学研究所 Noise spectrum estimation and voice mobility detection method based on unsupervised learning
CN102044243A (en) * 2009-10-15 2011-05-04 华为技术有限公司 Method and device for voice activity detection (VAD) and encoder

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1242553A (en) * 1998-03-24 2000-01-26 松下电器产业株式会社 Speech detection system for noisy conditions
US20020188445A1 (en) * 2001-06-01 2002-12-12 Dunling Li Background noise estimation method for an improved G.729 annex B compliant voice activity detection circuit
CN102044243A (en) * 2009-10-15 2011-05-04 华为技术有限公司 Method and device for voice activity detection (VAD) and encoder
CN101853661A (en) * 2010-05-14 2010-10-06 中国科学院声学研究所 Noise spectrum estimation and voice mobility detection method based on unsupervised learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
赵力: "《语音信号处理》", 31 March 2003, 机械工业出版社 *

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103489454B (en) * 2013-09-22 2016-01-20 浙江大学 Based on the sound end detecting method of wave configuration feature cluster
CN103489454A (en) * 2013-09-22 2014-01-01 浙江大学 Voice endpoint detection method based on waveform morphological characteristic clustering
CN104575513B (en) * 2013-10-24 2017-11-21 展讯通信(上海)有限公司 The processing system of burst noise, the detection of burst noise and suppressing method and device
CN104575513A (en) * 2013-10-24 2015-04-29 展讯通信(上海)有限公司 Burst noise processing system and burst noise detection and suppression method and device
WO2015078268A1 (en) * 2013-11-27 2015-06-04 Tencent Technology (Shenzhen) Company Limited Method, apparatus and server for processing noisy speech
US9978391B2 (en) 2013-11-27 2018-05-22 Tencent Technology (Shenzhen) Company Limited Method, apparatus and server for processing noisy speech
CN104269180B (en) * 2014-09-29 2018-04-13 华南理工大学 A kind of quasi- clean speech building method for speech quality objective assessment
CN105810201B (en) * 2014-12-31 2019-07-02 展讯通信(上海)有限公司 Voice activity detection method and its system
CN105810201A (en) * 2014-12-31 2016-07-27 展讯通信(上海)有限公司 Voice activity detection method and system
CN106331969A (en) * 2015-07-01 2017-01-11 奥迪康有限公司 Enhancement of noisy speech based on statistical speech and noise models
WO2017063516A1 (en) * 2015-10-13 2017-04-20 阿里巴巴集团控股有限公司 Method of determining noise signal, and method and device for audio noise removal
US10796713B2 (en) 2015-10-13 2020-10-06 Alibaba Group Holding Limited Identification of noise signal for voice denoising device
CN105355199B (en) * 2015-10-20 2019-03-12 河海大学 A kind of model combination audio recognition method based on the estimation of GMM noise
CN105355199A (en) * 2015-10-20 2016-02-24 河海大学 Model combination type speech recognition method based on GMM (Gaussian mixture model) noise estimation
CN108113646A (en) * 2016-11-28 2018-06-05 中国科学院声学研究所 A kind of detection in cardiechema signals cycle and the state dividing method of heart sound
CN106653047A (en) * 2016-12-16 2017-05-10 广州视源电子科技股份有限公司 Automatic gain control method and device for audio data
CN107731230A (en) * 2017-11-10 2018-02-23 北京联华博创科技有限公司 A kind of court's trial writing-record system and method
CN108986832A (en) * 2018-07-12 2018-12-11 北京大学深圳研究生院 Ears speech dereverberation method and device based on voice probability of occurrence and consistency
CN108986832B (en) * 2018-07-12 2020-12-15 北京大学深圳研究生院 Binaural voice dereverberation method and device based on voice occurrence probability and consistency
CN109741759B (en) * 2018-12-21 2020-07-31 南京理工大学 Acoustic automatic detection method for specific bird species
CN109741759A (en) * 2018-12-21 2019-05-10 南京理工大学 A kind of acoustics automatic testing method towards specific birds species
CN109616139A (en) * 2018-12-25 2019-04-12 平安科技(深圳)有限公司 Pronunciation signal noise power spectral density estimation method and device
CN109616139B (en) * 2018-12-25 2023-11-03 平安科技(深圳)有限公司 Speech signal noise power spectral density estimation method and device
CN110136738A (en) * 2019-06-13 2019-08-16 苏州思必驰信息科技有限公司 Noise estimation method and device
WO2021057239A1 (en) * 2019-09-23 2021-04-01 腾讯科技(深圳)有限公司 Speech data processing method and apparatus, electronic device and readable storage medium

Also Published As

Publication number Publication date
CN102800322B (en) 2014-03-26

Similar Documents

Publication Publication Date Title
CN102800322B (en) Method for estimating noise power spectrum and voice activity
CN100543842C (en) Realize the method that ground unrest suppresses based on multiple statistics model and least mean-square error
CN102800316B (en) Optimal codebook design method for voiceprint recognition system based on nerve network
CN109272990A (en) Audio recognition method based on convolutional neural networks
Cui et al. Noise robust speech recognition using feature compensation based on polynomial regression of utterance SNR
CN109192200B (en) Speech recognition method
CN101853661A (en) Noise spectrum estimation and voice mobility detection method based on unsupervised learning
JPS62231996A (en) Allowance evaluation of word corresponding to voice input
CN103345923A (en) Sparse representation based short-voice speaker recognition method
CN105139864A (en) Voice recognition method and voice recognition device
Frey et al. Algonquin-learning dynamic noise models from noisy speech for robust speech recognition
Todkar et al. Speaker recognition techniques: A review
Jung et al. Linear-scale filterbank for deep neural network-based voice activity detection
CN103345920A (en) Self-adaptation interpolation weighted spectrum model voice conversion and reconstructing method based on Mel-KSVD sparse representation
CN111081273A (en) Voice emotion recognition method based on glottal wave signal feature extraction
Liang et al. An improved noise-robust voice activity detector based on hidden semi-Markov models
Li et al. A Convolutional Neural Network with Non-Local Module for Speech Enhancement.
Wan et al. Variational bayesian learning for removal of sparse impulsive noise from speech signals
Pham et al. Using artificial neural network for robust voice activity detection under adverse conditions
CN116189671A (en) Data mining method and system for language teaching
Arslan et al. Noise robust voice activity detection based on multi-layer feed-forward neural network
Van Dalen Statistical models for noise-robust speech recognition
Xiao et al. Single-channel speech separation method based on attention mechanism
Kathania et al. Soft-weighting technique for robust children speech recognition under mismatched condition
Chen et al. Machine Learning for Predictive Analytics in the Improvement of English Speech Feature Recognition

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20140326