CN104183239A

CN104183239A - Method for identifying speaker unrelated to text based on weighted Bayes mixture model

Info

Publication number: CN104183239A
Application number: CN201410361706.0A
Authority: CN
Inventors: 魏昕; 周亮; 赵力; 陈建新
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Nanjing Tian Gu Information Technology Co ltd; Information and Telecommunication Branch of State Grid Jiangsu Electric Power Co Ltd
Priority date: 2014-07-25
Filing date: 2014-07-25
Publication date: 2014-12-03
Anticipated expiration: 2034-07-25
Also published as: CN104183239B

Abstract

The invention discloses a method for identifying a speaker unrelated to text based on a weighted Bayes mixture model. The method comprises that a voice signal set used for training is pre-processed and feature of the voice signal set is extracted, the training set is described via the weighted Bayes mixture model in the training process, parameter values and random variable distribution in the weighted Bayes mixture model are estimated via training, and thus, the weighted Bayes mixture model corresponding to each speaker is obtained. During identification, the marginal likelihood values of the trained weighted Bayes mixture models corresponding to the speakers are calculated via identification voices after preprocessing and feature extraction, and the maximal marginal likelihood corresponding to the speaker is used as an identification result. The method can effectively improve the correct identification rate of a text-related speaker identification system, avoids the problems of over-fitting and under-fitting that tend to occur in a traditional method, and enable that the relative weight of prior information and training data is easier and more flexible to control.

Description

Based on weighting Bayes mixture model with method for distinguishing speek person text-independent

Technical field

The present invention relates to a kind of based on weighting Bayes mixture model with method for distinguishing speek person text-independent, belong to voice process technology field.

Background technology

At aspects such as gate inhibition, credit card trade and court evidences, it is more and more important that Speaker Identification plays a part, and its target is that voice to be identified are correctly judged to be and belong in sound bank some among a plurality of reference men.

At present, in the method for distinguishing speek person with text-independent, the method based on gauss hybrid models (that is: GMM) is most widely used.Due to it, to have discrimination high, and training is simple, and amount of training data requires the advantages such as little, become at present and the main stream approach of the Speaker Identification of text-independent.Because GMM has the ability of the distribution of good expression data, as long as there is abundant state, abundant training data, GMM just can approach any distributed model relevant to time series.But, while GMM being applied in reality to the Speaker Identification with text-independent, there are several problems.First, traditional GMM training process, based on maximum-likelihood criterion, easily produces over-fitting or owes matching phenomenon training data.Secondly, traditional only considers observation data based on GMM with Speaker Identification text-independent, prior imformation is not introduced.The problems referred to above usually make GMM lower with the recognition correct rate of Speaker Recognition System text-independent based on traditional.Therefore how effectively to introduce prior imformation itself and training data effective integration is very important; In addition, after having merged prior imformation, how further the weight of balance prior imformation and training data, adopts the relatively simply relative weighting of mode control observation data, is also one and not yet solves but very important problem.And the present invention can solve problem above well.

Summary of the invention

The object of the invention has been to solve the defect of above-mentioned prior art, designed a kind of based on weighting Bayes mixture model with method for distinguishing speek person text-independent.

The present invention solves the technical scheme that its technical matters takes: a kind of based on weighting Bayes mixture model with method for distinguishing speek person text-independent, the method comprises the following steps:

Step 1: voice signal is carried out to pre-service: comprise sampling and Quantifying, pre-emphasis, minute frame and windowing;

Step 2: the feature extraction on speech frame: to each speech frame, calculate D rank linear prediction cepstrum coefficient coefficient, the D dimensional feature vector using it as this frame;

Step 3: for the corresponding training set of each speaker X={x _n} _{n=1 ..., N}, wherein N is the D dimensional feature vector x of this speaker for training _nnumber; With weighting Bayes mixture model (that is: WBMM), carry out modeling X, by training, estimate parameter value in WBMM and the distribution of stochastic variable; As needed to identify G speaker in this recognition system, repetition training process G time, obtains respectively WBMM ₁..., WBMM _g..., WBMM _g;

Step 4: for voice to be identified, first carry out pre-service and feature extraction, obtain corresponding D dimensional feature vector x'; Calculate x' about model WBMM corresponding to each speaker ₁..., WBMM _g..., WBMM _gedge likelihood value { MLIK _g(x') } _{g=1 ..., G}, final recognition result is maximum MLIK _g(x') corresponding speaker speaker, that is:

speaker (x^{'}) = \arg \max_{g = 1}^{G} {MLIK}_{g} (x^{'}) .

Of the present invention based on weighting Bayes mixture model with method for distinguishing speek person text-independent in, to described in step 3 by training, to estimate the step of distribution of parameter value in WBMM and stochastic variable as follows:

Step 3-1: set the super parameter { λ in WBMM ₀, m ₀, β ₀, ν ₀, V ₀value, wherein, λ ₀=0.01, m ₀=0 (0 is D dimension zero vector), β ₀=1, ν ₀=D, V ₀=400I (I is the unit matrix of (D * D));

Step 3-2: set the value of α, α gets the arbitrary integer between-8～-1;

Step 3-3: produce equally distributed random integers on N obedience [1, K] interval, wherein K is WBMM is mixed into mark, can get the arbitrary integer in 16～32, adds up the probability that on this interval, each integer occurs; That is, if produced N _iindividual integer i, so θ _i=N _i/ N; For each x _n, corresponding hidden variable z _ninitial distribution be

q (z_{n}) = Π_{i = 1}^{K} q (z_{ni} = 1) = Π_{i = 1}^{K} θ_{i};

In addition, set iterations counting variable t=1, start iterative loop;

Step 3-4: calculate three intermediate variables:

s_{i} = Σ_{n = 1}^{N} q (z_{ni} = 1)

{\overset{&OverBar;}{x}}_{i} = \frac{1}{s_{i}} Σ_{n = 1}^{N} q (z_{ni} = 1) \cdot x_{n}

C_{i} = \frac{1}{s_{i}} Σ_{n = 1}^{N} q (z_{ni} = 1) \cdot (x_{n} - {\overset{&OverBar;}{x}}_{i}) \cdot {(x_{n} - {\overset{&OverBar;}{x}}_{i})}^{T}

Step 3-5: upgrade the stochastic variable { π in WBMM _i} _{i=1 ..., K}distribution, it represents the proportion of i blending constituent, its is obeyed Dirichlet and distributes, that is, and q (π _i)=Dir (π _i| λ _i), super parameter { λ accordingly _i} _{i=1 ..., K}more new formula as follows:

λ_{i} = λ_{0} + \frac{1 - α}{2} \cdot s_{i}

Step 3-6: upgrade stochastic variable { μ in WBMM _i, Τ _i} _{i=1 ..., K}distribution, it represents respectively average and the inverse covariance matrix of i composition, their obey associating Gaussian-Wishart distribution, corresponding super parameter { m _i, β _i, ν _i, V _i} _{i=1 ..., K}renewal as follows:

β_{i} = β_{0} + \frac{1 - α}{2} \cdot s_{i},

m_{i} = \frac{1}{β_{i}} (β_{0} m_{0} + \frac{1 - α}{2} \cdot s_{i} \cdot {\overset{&OverBar;}{x}}_{i}),

v_{i} = v_{0} + \frac{1 - α}{2} \cdot s_{i},

V_{i}^{- 1} = V_{0}^{- 1} + \frac{1 - α}{2} \cdot s_{i} \cdot C_{i} + \frac{β_{0} s_{i} (1 - α)}{2 β_{0} + s_{i} (1 - α)} \cdot ({\overset{&OverBar;}{x}}_{i} - m_{0}) \cdot {({\overset{&OverBar;}{x}}_{i} - m_{0})}^{T};

Step 3-7: upgrade hidden variable { z _n} _{n=1 ..., N}distribution, as follows:

q (z_{n}) = Π_{i = 1}^{K} {(\frac{γ_{ni}}{Σ_{j = 1}^{K} γ_{nj}})}^{z_{ni}}

Wherein,

γ_{ni} = \exp {(\frac{1 - α}{2}) \cdot &lang; \ln π_{i} &rang; + (\frac{1 - α}{4}) \cdot [&lang; \ln | T_{i} | &rang; - D \ln (2 π) - &lang; {(x_{n} - μ_{i})}^{T} T_{i} (x_{n} - μ_{i}) &rang;]}

In above formula, the computing formula of every expectation <> is as follows:

&lang; \ln π_{i} &rang; = ψ (λ_{i}) - ψ (Σ_{j = 1}^{K} λ_{j}),

&lang; \ln | T_{i} | &rang; = Σ_{d = 1}^{D} ψ (\frac{v_{i} + 1 - d}{2}) + D \ln 2 + \ln | V_{i} |,

&lang; {(x_{n} - μ_{i})}^{T} T_{i} (x_{n} - μ_{i}) &rang; = D β_{i}^{- 1} + v_{i} {(x_{n} - m_{i})}^{T} V_{i} (x_{n} - m_{i})

The digamma function that in formula, ψ () is standard above (derivative of the logarithm of standard Gamma function gamma (), i.e. ψ ()=(ln Γ ()) ');

q (z_{ni} = 1) = γ_{ni} / Σ_{j = 1}^{K} γ_{nj};

Step 3-8: calculate the edge likelihood value MLIK after current iteration _t, t is current iterations:

{MLIK}_{t} = Σ_{n = 1}^{N} Σ_{i = 1}^{K} q (z_{ni} = 1) \cdot (\frac{1 - α}{2}) \cdot {&lang; \ln π_{i} &rang; + 0.5 \times [&lang; \ln | T_{i} | &rang; - D \ln (2 π) - &lang; {(x_{n} - μ_{i})}^{T} T_{i} (x_{n} - μ_{i}) &rang;]};

Step 3-9: calculate after current iteration with last iteration after the difference DELTA MLIK=MLIK of edge likelihood value _t-MLIK _t-1; If Δ MLIK≤δ, finishes by the process that training estimates the distribution of parameter value in WBMM and stochastic variable so, otherwise forward above-mentioned steps 3-4 to, the value of t increases by 1, carries out next iteration; The span of threshold value δ is 10 ^-5～10 ^-4, δ can get the arbitrary value within the scope of this.

Of the present invention based on weighting Bayes in mixture model and method for distinguishing speek person text-independent, to calculating x' about the relevant model WBMM of each speaker in identifying described in step 4 ₁..., WBMM _g... WBMM _gedge likelihood value { MLIK _g(x') } _{g=1 ..., G}formula as follows:

{MLIK}_{g} (x^{'}) = Σ_{i = 1}^{K} q (z_{ni} = 1) \cdot (\frac{1 - α}{2}) \cdot {&lang; \ln π_{i} &rang; + 0.5 \times [&lang; \ln | T_{i} | &rang; - D \ln (2 π) - &lang; {(x^{'} - μ_{i})}^{T} T_{i} (x^{'} - μ_{i}) &rang;]}

Wherein, <> and q (z _ni=1) be the WBMM through after training _gin expectation and probability.

In the present invention, adopt based on weighting Bayes mixture model with method for distinguishing speek person text-independent be under Bayesian frame, prior imformation is introduced and and training data carry out effective integration, solved the model over-fitting easily occurring in the Speaker Identification of the GMM of tradition based on maximum-likelihood criterion and owed fitting problems, its model has higher dirigibility.

Of the present invention based on weighting Bayes mixture model with method for distinguishing speek person text-independent in an additional parameter α, control the weight of data in training, the relative weighting of prior imformation and training data is more prone to and controls neatly.

What in the present invention, adopt can obtain the posteriority distribution of optimum estimates of parameters and correlation parameter based on weighting Bayes is mixture model exactly with method for distinguishing speek person text-independent according to the distribution situation of data, adopt after the method, greatly improve with the discrimination of the Speaker Recognition System of text-independent.

Beneficial effect:

1, model of the present invention has higher dirigibility.

2, the invention enables the relative weighting of prior imformation and training data be more prone to and control neatly.

3, discrimination of the present invention improves greatly.

Accompanying drawing explanation

Fig. 1 is method flow diagram of the present invention.

Fig. 2 is the sound spectrograph of one section of clean speech of the present invention and corresponding call voice thereof.

Fig. 3 is clean speech, during different α, and the comparing result schematic diagram of the recognition correct rate of method of the present invention and traditional and method for distinguishing speek person text-independent.

Fig. 4 is call voice, during different α, and the comparing result schematic diagram of the recognition correct rate of method of the present invention and traditional and method for distinguishing speek person text-independent.

Fig. 5 is call voice, TU=5, and during different speaker's number, the comparing result schematic diagram of the recognition correct rate of method of the present invention and traditional and method for distinguishing speek person text-independent.

Fig. 6 is clean speech, TU=2, and during different speaker's number, the comparing result schematic diagram of the recognition correct rate of method of the present invention and traditional and method for distinguishing speek person text-independent.

Embodiment

Below in conjunction with drawings and Examples, technical solutions according to the invention are done further and set forth.

As shown in Figure 1, the invention provides a kind of based on weighting Bayes mixture model with method for distinguishing speek person text-independent, the method comprises the steps:

The first step: the pre-service of voice signal

(1) sampling and Quantifying

Each section of voice signal y to the data acquisition for training for the data centralization identified _a(t) sample, thereby obtain the amplitude sequence y (n) of audio digital signals.By pulse code modulation (PCM) (PCM) technology, y (n) is carried out to quantization encoding, thereby the quantized value that obtains amplitude sequence represents form y'(n).The precision of sampling here and quantizing decides according to the requirement that is applied to the Speaker Recognition System under varying environment.For most of voice signals, sample frequency F is 8KHz-16KHz, and quantization digit is 16 or 24.

(2) pre-emphasis

By y'(n) by digital filter Z, obtain amplitude sequence s that the high, medium and low frequency amplitude of voice signal is suitable " (n).Here the Z transport function of digital filter is H (z)=1-az ^-1.Wherein, the span of pre emphasis factor a is 0.8～0.97.

(3) divide frame, windowing

With frame length τ (unit: millisecond), the frame amount of moving is τ/4, s " (n) is divided into a series of speech frame F _t.That is, each speech frame comprises τ * F voice signal sample.Then, calculate the value of Hamming window function:

w_{H} (n) = \{\begin{matrix} 0.54 - 0.46 \cos (\frac{2 πn}{N - 1}) & 1 \leq n \leq τ \times F \\ 1 & others \end{matrix}

Finally, to each speech frame F _tadds Hamming window, obtain, thereby the pre-service that completes voice signal is processed:

F_{t}^{*} (n) = w_{H} (n) \times F_{t} (n),

n＝1,...,τ×F

Second step: the feature extraction on speech frame

In this method for each frame by calculating, obtain D rank linear prediction cepstrum coefficient coefficients (LPCC), this coefficient conduct corresponding D dimensional feature vector, the D here decides according to the requirement that is applied to the Speaker Recognition System under varying environment, and the span of D is 10～20.The calculating of LPCC comprises following process:

(1) calculate the linear predictor coefficient on D rank its computing formula is as follows:

φ_{m} (i, 0) = Σ_{d = 1}^{D} {\hat{x}}_{d} φ_{m} (i, d),

d＝1,...,D

Wherein,

φ_{m} (i, d) = Σ_{n = 1}^{N} F_{m}^{*} (n - i) F_{m}^{*} (n - k) .

(formula ) represent that the system of equations that D equation forms, unknown number are D.Solve this system of equations, just can obtain present frame the linear predictor coefficient on corresponding D rank

(2) by D rank linear predictor coefficient by following formula, calculate the linear prediction cepstrum coefficient coefficient x on D rank ₁..., x _d:

x_{d} = {\hat{x}}_{d} + Σ_{k = 1}^{d - 1} \frac{k}{d} x_{k} {\hat{x}}_{d - k},

d＝1,...,D

With said method, calculate all speakers for training and D dimensional feature vector for identifying.Suppose that in training set, the corresponding eigenvector of certain speaker (being assumed to be g speaker) has N, the corresponding training set of this speaker can be expressed as so (quantity that G is speaker).Owing to being the training dataset for certain speaker in each training, in order to represent that conveniently the present invention omits X ^(g)with subscript " (g) ", that is, and X={x _n} _{n=1 ..., N}, x wherein _n=(x _n1..., x _nD) be g speaker's calculating by pre-service and characteristic extraction step n D dimensional feature vector.

The 3rd step: training

Due to Speaker Identification and text-independent here, the speech characteristic vector that adopts mixture model to come modeling to extract.Here the present invention has designed a kind of weighting Bayes mixture model (weighted Bayesian mixture model is called for short WBMM).Compare with traditional gauss hybrid models for text-independent Speaker Identification (GMM), WBMM has two significant differences: first, WBMM has introduced an additional parameter α,, adopted weighted likelihood function to describe training data, such advantage is to regulate more flexibly the weight of training data in whole model, controls better the relative weighting of prior imformation and training data; Secondly, regard the parameter in WBMM as stochastic variable, under Bayesian frame, calculate its posteriority and distribute rather than direct estimation parameter value, such mode can obtain good effect when training data is not enough.

Particularly, with following formula, set up the likelihood function of observation data collection X:

p (X | π, μ, T, α) = Π_{n = 1}^{N} {[Σ_{i = 1}^{K} π_{i} N (x_{n} | μ_{i}, T_{i}^{- 1})]}^{\frac{1 - α}{2}}

In above formula, K is blending constituent number, conventionally gets the arbitrary integer in 16～32 in Speaker Identification.In order to introduce prior imformation and itself and training data to be merged, by the parameter π in model, μ, T, as stochastic variable, specifies corresponding prior distribution to them.Particularly, π={ π _i} _{i=1 ..., K}obey Dirichlet prior distribution, c (λ wherein ₀) be the normalized factor of this distribution; { μ, T}={ μ _i, T _i} _{i=1 ..., K}obey associating Gaussian-Wishart and distribute (being the product that Gaussian distributes and Wishart distributes, N () W ()), that is:

p (μ, T) = p (μ | T) p (T) = Π_{i = 1}^{K} N (μ_{i} | m_{0}, {(β_{0} T_{i})}^{- 1}) W (T_{i} | v_{0}, V_{0}),

{ m wherein ₀, β ₀, ν ₀, V ₀for this, combine the super parameter that Gaussian-Wishart distributes.M ₀for D dimension column vector, β ₀and ν ₀for scalar, V ₀it is the matrix of (D * D).In addition, also need to introduce a hidden variable Z={z _n} _{n=1 ..., N}, z wherein _n=(z _n1..., z _ni..., z _nK) in to only have an element be 1, all the other are 0.Z _neffect be indication and mark x _nthat which blending constituent produces in WBMM.For example, work as x _nwhile being produced by i blending constituent, z _ni=1.

Under WBMM defined above, the step of training process is as follows:

(1) set the super parameter { λ of WBMM ₀, m ₀, β ₀, ν ₀, V ₀value, particularly, λ ₀=0.01, m ₀=0 (0 is D dimension zero vector), β ₀=1, ν ₀=D, V ₀=400I (I is unit matrix).

(2) set the value of α, α can get the arbitrary integer between-8～-1.

(3) produce equally distributed random integers on N obedience [1, K] interval, add up the probability that on this interval, each integer occurs.That is, if produced N _iindividual integer i, so θ _i=N _i/ N.For each x _n, corresponding hidden variable z _ninitial distribution be

q (z_{n}) = Π_{i = 1}^{K} q (z_{ni} = 1) = Π_{i = 1}^{K} θ_{i}

In addition, iterations counting variable t=1, starts iterative loop.

(4) calculate three intermediate variables:

s_{i} = Σ_{n = 1}^{N} q (z_{ni} = 1)

{\overset{&OverBar;}{x}}_{i} = \frac{1}{s_{i}} Σ_{n = 1}^{N} q (z_{ni} = 1) \cdot x_{n}

C_{i} = \frac{1}{s_{i}} Σ_{n = 1}^{N} q (z_{ni} = 1) \cdot (x_{n} - {\overset{&OverBar;}{x}}_{i}) \cdot {(x_{n} - {\overset{&OverBar;}{x}}_{i})}^{T}

(5) upgrade stochastic variable { π _i} _{i=1 ..., K}distribution, it is still obeyed Dirichlet and distributes, that is, q (π _i)=Dir (π _i| λ _i), super parameter { λ accordingly _i} _{i=1 ..., K}more new formula as follows:

λ_{i} = λ_{0} + \frac{1 - α}{2} \cdot s_{i}

(6) upgrade stochastic variable { μ _i, Τ _i} _{i=1 ..., K}distribution, it is still obeyed associating Gaussian-Wishart and distributes, corresponding super parameter { m _i, β _i, ν _i, V _i} _{i=1 ..., K}more new formula as follows:

β_{i} = β_{0} + \frac{1 - α}{2} \cdot s_{i},

m_{i} = \frac{1}{β_{i}} (β_{0} m_{0} + \frac{1 - α}{2} \cdot s_{i} \cdot {\overset{&OverBar;}{x}}_{i}),

v_{i} = v_{0} + \frac{1 - α}{2} \cdot s_{i},

V_{i}^{- 1} = V_{0}^{- 1} + \frac{1 - α}{2} \cdot s_{i} \cdot C_{i} + \frac{β_{0} s_{i} (1 - α)}{2 β_{0} + s_{i} (1 - α)} \cdot ({\overset{&OverBar;}{x}}_{i} - m_{0}) \cdot {({\overset{&OverBar;}{x}}_{i} - m_{0})}^{T};

(7) upgrade hidden variable { z _n} _{n=1 ..., N}distribution, as follows:

q (z_{n}) = Π_{i = 1}^{K} {(\frac{γ_{ni}}{Σ_{j = 1}^{K} γ_{nj}})}^{z_{ni}}

Wherein,

γ_{ni} = \exp {(\frac{1 - α}{2}) \cdot &lang; \ln π_{i} &rang; + (\frac{1 - α}{4}) \cdot [&lang; \ln | T_{i} | &rang; - D \ln (2 π) - &lang; {(x_{n} - μ_{i})}^{T} T_{i} (x_{n} - μ_{i}) &rang;]}

In above formula, the computing formula of every expectation <> is as follows:

&lang; \ln π_{i} &rang; = ψ (λ_{i}) - ψ (Σ_{j = 1}^{K} λ_{j}),

&lang; \ln | T_{i} | &rang; = Σ_{d = 1}^{D} ψ (\frac{v_{i} + 1 - d}{2}) + D \ln 2 + \ln | V_{i} |,

&lang; {(x_{n} - μ_{i})}^{T} T_{i} (x_{n} - μ_{i}) &rang; = D β_{i}^{- 1} + v_{i} {(x_{n} - m_{i})}^{T} V_{i} (x_{n} - m_{i}) &CirclePlus;

The digamma function that ψ () in upper two formulas is standard (derivative of the logarithm of Gamma function gamma (), i.e. ψ ()=(ln Γ ()) ').So,

(8) calculate the edge likelihood value MLIK after current iteration _t, t is current iterations:

{MLIK}_{t} = Σ_{n = 1}^{N} Σ_{i = 1}^{K} q (z_{ni} = 1) \cdot (\frac{1 - α}{2}) \cdot {&lang; \ln π_{i} &rang; + 0.5 \times [&lang; \ln | T_{i} | &rang; - D \ln (2 π) - &lang; {(x_{n} - μ_{i})}^{T} T_{i} (x_{n} - μ_{i}) &rang;]}

, wherein identical in the computing formula of every expectation <> and step (7).

(9) calculate after current iteration with last iteration after the difference DELTA MLIK=MLIK of edge likelihood value _t-MLIK _t-1; If Δ MLIK≤δ, parameter estimation procedure finishes so, otherwise forwards step (4) to, and the value of t increases by 1, proceeds iteration next time; The span of threshold value δ is 10 ^-5～10 ^-4.

The training process that in above-mentioned estimation WBMM, parameter and stochastic variable distribute is as shown in left side dashed rectangle in Fig. 1.Need to annotatedly be, the Dirichlet distribution Dir () mentioning in above-mentioned steps, Gaussian distribution N (), Wishart distribution W () and Gamma function gamma () are all the functions with canonical form, the expression formula that has these functions in most probability statistics books and documents and materials, they are all also the functions that this area scientific and technical personnel know and often need to use, implementing only need to consult corresponding probability statistics teaching material when of the present invention or relevant encyclopaedia introduction can obtain easily, providing no longer one by one its concrete form herein.

For the corresponding training set of each speaker X ⁽¹⁾..., X ^(g)..., X ^(G), adopt in this way and train, obtain respectively the weighting Bayes mixture model WBMM corresponding with it ₁..., WBMM _g..., WBMM _g(quantity that G is speaker).

The 4th step: identification

In identifying, first the voice relevant to current speaker to be identified pass through the pre-service of the first step and the feature extraction of second step, obtain corresponding D dimensional feature vector x'.Calculate respectively it about the relevant model WBMM of each speaker ₁..., WBMM _g... WBMM _gedge likelihood value { MLIK _g(x') } _{g=1 ..., G}.For example, x' is about g speaker model WBMM _gedge likelihood be

{MLIK}_{g} (x^{'}) = Σ_{i = 1}^{K} q (z_{ni} = 1) \cdot (\frac{1 - α}{2}) \cdot {&lang; \ln π_{i} &rang; + 0.5 \times [&lang; \ln | T_{i} | &rang; - D \ln (2 π) - &lang; {(x^{'} - μ_{i})}^{T} T_{i} (x^{'} - μ_{i}) &rang;]}

Wherein each expectation <> and q (z _ni=1) be for g speaker, through resulting expectation and probability after the 3rd step training, (step (7) in the 3rd step training obtains, and unique difference is to calculate < (x'-μ _i) ^tΤ _i(x'-μ _i) during >, by (formula ) in x _nwith x', replace).

Final recognition result is maximum MLIK _g(x') corresponding speaker, that is:

speaker (x^{'}) = \arg \max_{g = 1}^{G} {MLIK}_{g} (x^{'}) .

Performance evaluation of the present invention:

In order to verify the system performance that has adopted and method for distinguishing speek person text-independent (WBMM) based on weighting Bayes mixture model of the present invention, contrast with the system performance of method for distinguishing speek person text-independent itself and gauss hybrid models based on traditional are (GMM), here select TIMIT data set (speech sample frequency is 16KHz, and quantization digit is 16) to test.Here test method for distinguishing speek person proposed by the invention to the recognition effect in clean speech and two kinds of scenes of call voice.For generated telephone speech environment, the bandpass filter that the present invention is 0.3KHz～3.4KHz by pure language by an effective bandwidth scope, then adds the white Gaussian noise that signal to noise ratio snr is 20dB, thereby obtains call voice.Fig. 2 has provided the original clean speech of one section of TIMIT and the sound spectrograph of corresponding call voice.In preprocessing process, frame length τ=20ms, pre emphasis factor is 0.95, the dimension D=20 of eigenvector.

In the telephone voice data storehouse in TIMIT and generation, have 250 speakers, each speaker always has 10 sections of voice.Here, by 5 sections of voice wherein, for identification, 5 sections of remaining voice are according to circumstances for training.Here the mark K that is mixed into of WBMM and GMM is fixed on 16.Recognition result is weighed with discrimination, and discrimination is defined as the ratio of the speech frame quantity with the total number of speech frames that correctly identify speaker.

First, the performance of WBMM under more different α, considers two kinds of situations here, i.e. the situation of training utterance data sufficient (for the voice hop count TU=5 training) and training utterance data deficiencies (for the voice hop count TU=2 training).In addition, in order better to analyze the performance of WBMM, itself and WGMM are contrasted, WGMM is the gauss hybrid models based on weighting, the difference of itself and WBMM is to parameter, not give corresponding prior distribution, based on maximum-likelihood criterion, estimate relevant parameter, when α=-1, WGMM deteriorates to GMM.Fig. 3 has provided the discrimination in clean speech situation.Can find out, while no matter being TU=2 or TU=5, for α, get the arbitrary integer between-8～-1, the discrimination of WBMM is all higher than WGMM, also higher than the discrimination of GMM.Fig. 4 has provided the discrimination in call voice situation, although due to the existence of noise, make it compare with clean speech that discrimination is whole to decline, WBMM is still better than corresponding WGMM (α=-1 o'clock is GMM).It is former because the WBMM that the present invention proposes has adopted the likelihood function of weighting, make to give prominence to better the effect of observation data, in addition, introduced prior imformation, and adopted the training patterns based on bayesian criterion, can fully utilize prior imformation and observation data, discrimination is improved greatly.In addition, under two kinds of voice environments, all there is an optimum α: clean speech, during TU=5, α=-2; Clean speech, during TU=2, α=-4; Call voice, during TU=5, α=-3; Call voice, during TU=2, α=-7.For other databases, also can, by this kind of mode, by experimental result, determine optimum α.

Then the whole discrimination that, compares the whole bag of tricks system in the situation that speaker's number is different.Fig. 5 has provided call voice, and during TU=5, WBMM is in α=-1, and-3 ,-6 o'clock, WGMM, at α=-1 (GMM),, was respectively 50,100 at speaker's number, the discrimination of 150,200,250 o'clock at-3 ,-6 o'clock.Can find out that the WBMM that the present invention proposes is higher than the discrimination of corresponding WGMM and GMM.In addition, Fig. 6 has provided clean speech, discrimination during TU=2, and the WBMM that adopts the present invention to propose has obtained equally than WGMM and the better performance of GMM.

The scope that the present invention asks for protection is not limited only to the description of this embodiment, and particular content should be as the criterion with claims.

Claims

Based on weighting Bayes mixture model with method for distinguishing speek person text-independent, it is characterized in that, described method comprises the steps:

Step 1: voice signal is carried out to pre-service: comprise sampling and Quantifying, pre-emphasis, minute frame and windowing;

Step 2: the feature extraction on speech frame: to each speech frame, calculate D rank linear prediction cepstrum coefficient coefficient, the D dimensional feature vector using it as this frame;

Step 3: for the corresponding training set of each speaker X={x _n} _{n=1 ..., N}, wherein N is the D dimensional feature vector x of this speaker for training _nnumber; With weighting Bayes mixture model, WBMM carrys out modeling X, by training, estimates parameter value in WBMM and the distribution of stochastic variable; As needed to identify G speaker in this recognition system, repetition training process G time, obtains respectively WBMM ₁..., WBMM _g..., WBMM _g;

Step 4: for voice to be identified, first carry out pre-service and feature extraction, obtain corresponding D dimensional feature vector x'; Calculate x' about model WBMM corresponding to each speaker ₁..., WBMM _g..., WBMM _gedge likelihood value { MLIK _g(x') } _{g=1 ..., G}, final recognition result is maximum MLIK _g(x') corresponding speaker speaker, that is:

$speaker (x^{'}) = \arg \max_{g = 1}^{G} {MLIK}_{g} (x^{'}) .$
According to claim 1 a kind of based on weighting Bayes mixture model with method for distinguishing speek person text-independent, it is characterized in that, described in described method step 3 by training, to estimate the step of distribution of parameter value in WBMM and stochastic variable as follows:

Step 3-1: set the super parameter { λ in WBMM ₀, m ₀, β ₀, ν ₀, V ₀value, wherein, λ ₀=0.01, m ₀=0 (0 is D dimension zero vector), β ₀=1, ν ₀=D, V ₀=400I (I is the unit matrix of (D * D));

Step 3-2: set the value of α, α gets the arbitrary integer between-8～-1;

Step 3-3: produce equally distributed random integers on N obedience [1, K] interval, wherein K is WBMM is mixed into mark, can get the arbitrary integer in 16～32, adds up the probability that on this interval, each integer occurs; That is, if produced N _iindividual integer i, so θ _i=N _i/ N; For each { x _n} _{n=1 ..., N}, corresponding hidden variable { z _n} _{n=1 ..., N}initial distribution be

$q (z_{n}) = Π_{i = 1}^{K} q (z_{ni} = 1) = Π_{i = 1}^{K} θ_{i};$

In addition, set iterations counting variable t=1, start iterative loop;

Step 3-4: calculate three intermediate variables:

$s_{i} = Σ_{n = 1}^{N} q (z_{ni} = 1)$

${\overset{&OverBar;}{x}}_{i} = \frac{1}{s_{i}} Σ_{n = 1}^{N} q (z_{ni} = 1) \cdot x_{n}$

$C_{i} = \frac{1}{s_{i}} Σ_{n = 1}^{N} q (z_{ni} = 1) \cdot (x_{n} - {\overset{&OverBar;}{x}}_{i}) \cdot {(x_{n} - {\overset{&OverBar;}{x}}_{i})}^{T}$

Step 3-5: upgrade the stochastic variable { π in WBMM _i} _{i=1 ..., K}distribution, it represents the proportion of i blending constituent, its is obeyed Dirichlet and distributes, that is, and q (π _i)=Dir (π _i| λ _i), super parameter { λ accordingly _i} _{i=1 ..., K}more new formula as follows:

$λ_{i} = λ_{0} + \frac{1 - α}{2} \cdot s_{i}$

Step 3-6: upgrade stochastic variable { μ in WBMM _i, Τ _i} _{i=1 ..., K}distribution, it represents respectively average and the inverse covariance matrix of i composition, they obey associating Gaussian-Wishart distribution, i.e. q (μ _i, Τ _i)=N (μ _i| m _i, (β _iΤ _i) ^-1) W (Τ _i| ν _i, V _i), super parameter { m accordingly _i, β _i, ν _i, V _i} _{i=1 ..., K}renewal as follows:

$β_{i} = β_{0} + \frac{1 - α}{2} \cdot s_{i},$

$m_{i} = \frac{1}{β_{i}} (β_{0} m_{0} + \frac{1 - α}{2} \cdot s_{i} \cdot {\overset{&OverBar;}{x}}_{i}),$

$v_{i} = v_{0} + \frac{1 - α}{2} \cdot s_{i},$

$V_{i}^{- 1} = V_{0}^{- 1} + \frac{1 - α}{2} \cdot s_{i} \cdot C_{i} + \frac{β_{0} s_{i} (1 - α)}{2 β_{0} + s_{i} (1 - α)} \cdot ({\overset{&OverBar;}{x}}_{i} - m_{0}) \cdot {({\overset{&OverBar;}{x}}_{i} - m_{0})}^{T};$

Step 3-7: upgrade hidden variable { z _n} _{n=1 ..., N}distribution, as follows:

$q (z_{n}) = Π_{i = 1}^{K} {(\frac{γ_{ni}}{Σ_{j = 1}^{K} γ_{nj}})}^{z_{ni}}$

Wherein,

$γ_{ni} = \exp {(\frac{1 - α}{2}) \cdot &lang; \ln π_{i} &rang; + (\frac{1 - α}{4}) \cdot [&lang; \ln | T_{i} | &rang; - D \ln (2 π) - &lang; {(x_{n} - μ_{i})}^{T} T_{i} (x_{n} - μ_{i}) &rang;]}$

In above formula, the computing formula of every expectation <> is as follows:

$&lang; \ln π_{i} &rang; = ψ (λ_{i}) - ψ (Σ_{j = 1}^{K} λ_{j}),$

$&lang; \ln | T_{i} | &rang; = Σ_{d = 1}^{D} ψ (\frac{v_{i} + 1 - d}{2}) + D \ln 2 + \ln | V_{i} |,$

$&lang; {(x_{n} - μ_{i})}^{T} T_{i} (x_{n} - μ_{i}) &rang; = D β_{i}^{- 1} + v_{i} {(x_{n} - m_{i})}^{T} V_{i} (x_{n} - m_{i})$

The digamma function that in formula, ψ () is standard above (derivative of the logarithm of Gamma function gamma (), i.e. ψ ()=(ln Γ ()) '); $q (z_{ni} = 1) = γ_{ni} / Σ_{j = 1}^{K} γ_{nj};$

Step 3-8: calculate the edge likelihood value MLIK after current iteration _t, t is current iterations:

${MLIK}_{t} = Σ_{n = 1}^{N} Σ_{i = 1}^{K} q (z_{ni} = 1) \cdot (\frac{1 - α}{2}) \cdot {&lang; \ln π_{i} &rang; + 0.5 \times [&lang; \ln | T_{i} | &rang; - D \ln (2 π) - &lang; {(x_{n} - μ_{i})}^{T} T_{i} (x_{n} - μ_{i}) &rang;]};$

Step 3-9: calculate after current iteration with last iteration after the difference DELTA MLIK=MLIK of edge likelihood value _t-MLIK _t-1; If Δ MLIK≤δ, finishes by the process that training estimates the distribution of parameter value in WBMM and stochastic variable so, otherwise forward above-mentioned steps 3-4 to, the value of t increases by 1, carries out next iteration; The span of threshold value δ is 10 ^-5～10 ^-4.
According to claim 1 a kind of based on weighting Bayes mixture model with method for distinguishing speek person text-independent, it is characterized in that, described in described method step 4, in identifying, calculate x' about the relevant model WBMM of each speaker ₁..., WBMM _g... WBMM _gedge likelihood value { MLIK _g(x') } _{g=1 ..., G}formula as follows:

${MLIK}_{g} (x^{'}) = Σ_{i = 1}^{K} q (z_{ni} = 1) \cdot (\frac{1 - α}{2}) \cdot {&lang; \ln π_{i} &rang; + 0.5 \times [&lang; \ln | T_{i} | &rang; - D \ln (2 π) - &lang; {(x^{'} - μ_{i})}^{T} T_{i} (x^{'} - μ_{i}) &rang;]}$ Wherein, <> and q (z _ni=1) be the WBMM through after training _gin expectation and probability.
According to claim 1 a kind of based on weighting Bayes mixture model with method for distinguishing speek person text-independent, it is characterized in that: described method is under Bayesian frame, prior imformation is introduced and and training data merge.
According to claim 1 a kind of based on weighting Bayes mixture model with method for distinguishing speek person text-independent, it is characterized in that: described method is to control the weight of data in training with an additional parameter α.