Summary of the invention
The object of the invention has been to solve the defect of above-mentioned prior art, designed a kind of based on weighting Bayes mixture model with method for distinguishing speek person text-independent.
The present invention solves the technical scheme that its technical matters takes: a kind of based on weighting Bayes mixture model with method for distinguishing speek person text-independent, the method comprises the following steps:
Step 1: voice signal is carried out to pre-service: comprise sampling and Quantifying, pre-emphasis, minute frame and windowing;
Step 2: the feature extraction on speech frame: to each speech frame, calculate D rank linear prediction cepstrum coefficient coefficient, the D dimensional feature vector using it as this frame;
Step 3: for the corresponding training set of each speaker X={x
n}
n=1 ..., N, wherein N is the D dimensional feature vector x of this speaker for training
nnumber; With weighting Bayes mixture model (that is: WBMM), carry out modeling X, by training, estimate parameter value in WBMM and the distribution of stochastic variable; As needed to identify G speaker in this recognition system, repetition training process G time, obtains respectively WBMM
1..., WBMM
g..., WBMM
g;
Step 4: for voice to be identified, first carry out pre-service and feature extraction, obtain corresponding D dimensional feature vector x'; Calculate x' about model WBMM corresponding to each speaker
1..., WBMM
g..., WBMM
gedge likelihood value { MLIK
g(x') }
g=1 ..., G, final recognition result is maximum MLIK
g(x') corresponding speaker speaker, that is:
Of the present invention based on weighting Bayes mixture model with method for distinguishing speek person text-independent in, to described in step 3 by training, to estimate the step of distribution of parameter value in WBMM and stochastic variable as follows:
Step 3-1: set the super parameter { λ in WBMM
0, m
0, β
0, ν
0, V
0value, wherein, λ
0=0.01, m
0=0 (0 is D dimension zero vector), β
0=1, ν
0=D, V
0=400I (I is the unit matrix of (D * D));
Step 3-2: set the value of α, α gets the arbitrary integer between-8~-1;
Step 3-3: produce equally distributed random integers on N obedience [1, K] interval, wherein K is WBMM is mixed into mark, can get the arbitrary integer in 16~32, adds up the probability that on this interval, each integer occurs; That is, if produced N
iindividual integer i, so θ
i=N
i/ N; For each x
n, corresponding hidden variable z
ninitial distribution be
In addition, set iterations counting variable t=1, start iterative loop;
Step 3-4: calculate three intermediate variables:
Step 3-5: upgrade the stochastic variable { π in WBMM
i}
i=1 ..., Kdistribution, it represents the proportion of i blending constituent, its is obeyed Dirichlet and distributes, that is, and q (π
i)=Dir (π
i| λ
i), super parameter { λ accordingly
i}
i=1 ..., Kmore new formula as follows:
Step 3-6: upgrade stochastic variable { μ in WBMM
i, Τ
i}
i=1 ..., Kdistribution, it represents respectively average and the inverse covariance matrix of i composition, their obey associating Gaussian-Wishart distribution,
corresponding super parameter { m
i, β
i, ν
i, V
i}
i=1 ..., Krenewal as follows:
Step 3-7: upgrade hidden variable { z
n}
n=1 ..., Ndistribution, as follows:
Wherein,
In above formula, the computing formula of every expectation <> is as follows:
The digamma function that in formula, ψ () is standard above (derivative of the logarithm of standard Gamma function gamma (), i.e. ψ ()=(ln Γ ()) ');
Step 3-8: calculate the edge likelihood value MLIK after current iteration
t, t is current iterations:
Step 3-9: calculate after current iteration with last iteration after the difference DELTA MLIK=MLIK of edge likelihood value
t-MLIK
t-1; If Δ MLIK≤δ, finishes by the process that training estimates the distribution of parameter value in WBMM and stochastic variable so, otherwise forward above-mentioned steps 3-4 to, the value of t increases by 1, carries out next iteration; The span of threshold value δ is 10
-5~10
-4, δ can get the arbitrary value within the scope of this.
Of the present invention based on weighting Bayes in mixture model and method for distinguishing speek person text-independent, to calculating x' about the relevant model WBMM of each speaker in identifying described in step 4
1..., WBMM
g... WBMM
gedge likelihood value { MLIK
g(x') }
g=1 ..., Gformula as follows:
Wherein, <> and q (z
ni=1) be the WBMM through after training
gin expectation and probability.
In the present invention, adopt based on weighting Bayes mixture model with method for distinguishing speek person text-independent be under Bayesian frame, prior imformation is introduced and and training data carry out effective integration, solved the model over-fitting easily occurring in the Speaker Identification of the GMM of tradition based on maximum-likelihood criterion and owed fitting problems, its model has higher dirigibility.
Of the present invention based on weighting Bayes mixture model with method for distinguishing speek person text-independent in an additional parameter α, control the weight of data in training, the relative weighting of prior imformation and training data is more prone to and controls neatly.
What in the present invention, adopt can obtain the posteriority distribution of optimum estimates of parameters and correlation parameter based on weighting Bayes is mixture model exactly with method for distinguishing speek person text-independent according to the distribution situation of data, adopt after the method, greatly improve with the discrimination of the Speaker Recognition System of text-independent.
Beneficial effect:
1, model of the present invention has higher dirigibility.
2, the invention enables the relative weighting of prior imformation and training data be more prone to and control neatly.
3, discrimination of the present invention improves greatly.
Embodiment
Below in conjunction with drawings and Examples, technical solutions according to the invention are done further and set forth.
As shown in Figure 1, the invention provides a kind of based on weighting Bayes mixture model with method for distinguishing speek person text-independent, the method comprises the steps:
The first step: the pre-service of voice signal
(1) sampling and Quantifying
Each section of voice signal y to the data acquisition for training for the data centralization identified
a(t) sample, thereby obtain the amplitude sequence y (n) of audio digital signals.By pulse code modulation (PCM) (PCM) technology, y (n) is carried out to quantization encoding, thereby the quantized value that obtains amplitude sequence represents form y'(n).The precision of sampling here and quantizing decides according to the requirement that is applied to the Speaker Recognition System under varying environment.For most of voice signals, sample frequency F is 8KHz-16KHz, and quantization digit is 16 or 24.
(2) pre-emphasis
By y'(n) by digital filter Z, obtain amplitude sequence s that the high, medium and low frequency amplitude of voice signal is suitable " (n).Here the Z transport function of digital filter is H (z)=1-az
-1.Wherein, the span of pre emphasis factor a is 0.8~0.97.
(3) divide frame, windowing
With frame length τ (unit: millisecond), the frame amount of moving is τ/4, s " (n) is divided into a series of speech frame F
t.That is, each speech frame comprises τ * F voice signal sample.Then, calculate the value of Hamming window function:
Finally, to each speech frame F
tadds Hamming window, obtain, thereby the pre-service that completes voice signal is processed:
n=1,...,τ×F
Second step: the feature extraction on speech frame
In this method for each frame
by calculating, obtain D rank linear prediction cepstrum coefficient coefficients (LPCC), this coefficient conduct
corresponding D dimensional feature vector, the D here decides according to the requirement that is applied to the Speaker Recognition System under varying environment, and the span of D is 10~20.The calculating of LPCC comprises following process:
(1) calculate the linear predictor coefficient on D rank
its computing formula is as follows:
d=1,...,D
Wherein,
(formula
) represent that the system of equations that D equation forms, unknown number are D.Solve this system of equations, just can obtain present frame
the linear predictor coefficient on corresponding D rank
(2) by D rank linear predictor coefficient
by following formula, calculate the linear prediction cepstrum coefficient coefficient x on D rank
1..., x
d:
d=1,...,D
With said method, calculate all speakers for training and D dimensional feature vector for identifying.Suppose that in training set, the corresponding eigenvector of certain speaker (being assumed to be g speaker) has N, the corresponding training set of this speaker can be expressed as so
(quantity that G is speaker).Owing to being the training dataset for certain speaker in each training, in order to represent that conveniently the present invention omits X
(g)with
subscript " (g) ", that is, and X={x
n}
n=1 ..., N, x wherein
n=(x
n1..., x
nD) be g speaker's calculating by pre-service and characteristic extraction step n D dimensional feature vector.
The 3rd step: training
Due to Speaker Identification and text-independent here, the speech characteristic vector that adopts mixture model to come modeling to extract.Here the present invention has designed a kind of weighting Bayes mixture model (weighted Bayesian mixture model is called for short WBMM).Compare with traditional gauss hybrid models for text-independent Speaker Identification (GMM), WBMM has two significant differences: first, WBMM has introduced an additional parameter α,, adopted weighted likelihood function to describe training data, such advantage is to regulate more flexibly the weight of training data in whole model, controls better the relative weighting of prior imformation and training data; Secondly, regard the parameter in WBMM as stochastic variable, under Bayesian frame, calculate its posteriority and distribute rather than direct estimation parameter value, such mode can obtain good effect when training data is not enough.
Particularly, with following formula, set up the likelihood function of observation data collection X:
In above formula, K is blending constituent number, conventionally gets the arbitrary integer in 16~32 in Speaker Identification.In order to introduce prior imformation and itself and training data to be merged, by the parameter π in model, μ, T, as stochastic variable, specifies corresponding prior distribution to them.Particularly, π={ π
i}
i=1 ..., Kobey Dirichlet prior distribution,
c (λ wherein
0) be the normalized factor of this distribution; { μ, T}={ μ
i, T
i}
i=1 ..., Kobey associating Gaussian-Wishart and distribute (being the product that Gaussian distributes and Wishart distributes, N () W ()), that is:
{ m wherein
0, β
0, ν
0, V
0for this, combine the super parameter that Gaussian-Wishart distributes.M
0for D dimension column vector, β
0and ν
0for scalar, V
0it is the matrix of (D * D).In addition, also need to introduce a hidden variable Z={z
n}
n=1 ..., N, z wherein
n=(z
n1..., z
ni..., z
nK) in to only have an element be 1, all the other are 0.Z
neffect be indication and mark x
nthat which blending constituent produces in WBMM.For example, work as x
nwhile being produced by i blending constituent, z
ni=1.
Under WBMM defined above, the step of training process is as follows:
(1) set the super parameter { λ of WBMM
0, m
0, β
0, ν
0, V
0value, particularly, λ
0=0.01, m
0=0 (0 is D dimension zero vector), β
0=1, ν
0=D, V
0=400I (I is unit matrix).
(2) set the value of α, α can get the arbitrary integer between-8~-1.
(3) produce equally distributed random integers on N obedience [1, K] interval, add up the probability that on this interval, each integer occurs.That is, if produced N
iindividual integer i, so θ
i=N
i/ N.For each x
n, corresponding hidden variable z
ninitial distribution be
In addition, iterations counting variable t=1, starts iterative loop.
(4) calculate three intermediate variables:
(5) upgrade stochastic variable { π
i}
i=1 ..., Kdistribution, it is still obeyed Dirichlet and distributes, that is, q (π
i)=Dir (π
i| λ
i), super parameter { λ accordingly
i}
i=1 ..., Kmore new formula as follows:
(6) upgrade stochastic variable { μ
i, Τ
i}
i=1 ..., Kdistribution, it is still obeyed associating Gaussian-Wishart and distributes,
corresponding super parameter { m
i, β
i, ν
i, V
i}
i=1 ..., Kmore new formula as follows:
(7) upgrade hidden variable { z
n}
n=1 ..., Ndistribution, as follows:
Wherein,
In above formula, the computing formula of every expectation <> is as follows:
The digamma function that ψ () in upper two formulas is standard (derivative of the logarithm of Gamma function gamma (), i.e. ψ ()=(ln Γ ()) ').So,
(8) calculate the edge likelihood value MLIK after current iteration
t, t is current iterations:
, wherein identical in the computing formula of every expectation <> and step (7).
(9) calculate after current iteration with last iteration after the difference DELTA MLIK=MLIK of edge likelihood value
t-MLIK
t-1; If Δ MLIK≤δ, parameter estimation procedure finishes so, otherwise forwards step (4) to, and the value of t increases by 1, proceeds iteration next time; The span of threshold value δ is 10
-5~10
-4.
The training process that in above-mentioned estimation WBMM, parameter and stochastic variable distribute is as shown in left side dashed rectangle in Fig. 1.Need to annotatedly be, the Dirichlet distribution Dir () mentioning in above-mentioned steps, Gaussian distribution N (), Wishart distribution W () and Gamma function gamma () are all the functions with canonical form, the expression formula that has these functions in most probability statistics books and documents and materials, they are all also the functions that this area scientific and technical personnel know and often need to use, implementing only need to consult corresponding probability statistics teaching material when of the present invention or relevant encyclopaedia introduction can obtain easily, providing no longer one by one its concrete form herein.
For the corresponding training set of each speaker X
(1)..., X
(g)..., X
(G), adopt in this way and train, obtain respectively the weighting Bayes mixture model WBMM corresponding with it
1..., WBMM
g..., WBMM
g(quantity that G is speaker).
The 4th step: identification
In identifying, first the voice relevant to current speaker to be identified pass through the pre-service of the first step and the feature extraction of second step, obtain corresponding D dimensional feature vector x'.Calculate respectively it about the relevant model WBMM of each speaker
1..., WBMM
g... WBMM
gedge likelihood value { MLIK
g(x') }
g=1 ..., G.For example, x' is about g speaker model WBMM
gedge likelihood be
Wherein each expectation <> and q (z
ni=1) be for g speaker, through resulting expectation and probability after the 3rd step training, (step (7) in the 3rd step training obtains, and unique difference is to calculate < (x'-μ
i)
tΤ
i(x'-μ
i) during >, by (formula
) in x
nwith x', replace).
Final recognition result is maximum MLIK
g(x') corresponding speaker, that is:
Performance evaluation of the present invention:
In order to verify the system performance that has adopted and method for distinguishing speek person text-independent (WBMM) based on weighting Bayes mixture model of the present invention, contrast with the system performance of method for distinguishing speek person text-independent itself and gauss hybrid models based on traditional are (GMM), here select TIMIT data set (speech sample frequency is 16KHz, and quantization digit is 16) to test.Here test method for distinguishing speek person proposed by the invention to the recognition effect in clean speech and two kinds of scenes of call voice.For generated telephone speech environment, the bandpass filter that the present invention is 0.3KHz~3.4KHz by pure language by an effective bandwidth scope, then adds the white Gaussian noise that signal to noise ratio snr is 20dB, thereby obtains call voice.Fig. 2 has provided the original clean speech of one section of TIMIT and the sound spectrograph of corresponding call voice.In preprocessing process, frame length τ=20ms, pre emphasis factor is 0.95, the dimension D=20 of eigenvector.
In the telephone voice data storehouse in TIMIT and generation, have 250 speakers, each speaker always has 10 sections of voice.Here, by 5 sections of voice wherein, for identification, 5 sections of remaining voice are according to circumstances for training.Here the mark K that is mixed into of WBMM and GMM is fixed on 16.Recognition result is weighed with discrimination, and discrimination is defined as the ratio of the speech frame quantity with the total number of speech frames that correctly identify speaker.
First, the performance of WBMM under more different α, considers two kinds of situations here, i.e. the situation of training utterance data sufficient (for the voice hop count TU=5 training) and training utterance data deficiencies (for the voice hop count TU=2 training).In addition, in order better to analyze the performance of WBMM, itself and WGMM are contrasted, WGMM is the gauss hybrid models based on weighting, the difference of itself and WBMM is to parameter, not give corresponding prior distribution, based on maximum-likelihood criterion, estimate relevant parameter, when α=-1, WGMM deteriorates to GMM.Fig. 3 has provided the discrimination in clean speech situation.Can find out, while no matter being TU=2 or TU=5, for α, get the arbitrary integer between-8~-1, the discrimination of WBMM is all higher than WGMM, also higher than the discrimination of GMM.Fig. 4 has provided the discrimination in call voice situation, although due to the existence of noise, make it compare with clean speech that discrimination is whole to decline, WBMM is still better than corresponding WGMM (α=-1 o'clock is GMM).It is former because the WBMM that the present invention proposes has adopted the likelihood function of weighting, make to give prominence to better the effect of observation data, in addition, introduced prior imformation, and adopted the training patterns based on bayesian criterion, can fully utilize prior imformation and observation data, discrimination is improved greatly.In addition, under two kinds of voice environments, all there is an optimum α: clean speech, during TU=5, α=-2; Clean speech, during TU=2, α=-4; Call voice, during TU=5, α=-3; Call voice, during TU=2, α=-7.For other databases, also can, by this kind of mode, by experimental result, determine optimum α.
Then the whole discrimination that, compares the whole bag of tricks system in the situation that speaker's number is different.Fig. 5 has provided call voice, and during TU=5, WBMM is in α=-1, and-3 ,-6 o'clock, WGMM, at α=-1 (GMM),, was respectively 50,100 at speaker's number, the discrimination of 150,200,250 o'clock at-3 ,-6 o'clock.Can find out that the WBMM that the present invention proposes is higher than the discrimination of corresponding WGMM and GMM.In addition, Fig. 6 has provided clean speech, discrimination during TU=2, and the WBMM that adopts the present invention to propose has obtained equally than WGMM and the better performance of GMM.
The scope that the present invention asks for protection is not limited only to the description of this embodiment, and particular content should be as the criterion with claims.