CN104183239A - Method for identifying speaker unrelated to text based on weighted Bayes mixture model - Google Patents

Method for identifying speaker unrelated to text based on weighted Bayes mixture model Download PDF

Info

Publication number
CN104183239A
CN104183239A CN201410361706.0A CN201410361706A CN104183239A CN 104183239 A CN104183239 A CN 104183239A CN 201410361706 A CN201410361706 A CN 201410361706A CN 104183239 A CN104183239 A CN 104183239A
Authority
CN
China
Prior art keywords
wbmm
centerdot
training
speaker
rang
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410361706.0A
Other languages
Chinese (zh)
Other versions
CN104183239B (en
Inventor
魏昕
周亮
赵力
陈建新
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Tian Gu Information Technology Co ltd
Information and Telecommunication Branch of State Grid Jiangsu Electric Power Co Ltd
Original Assignee
Nanjing Post and Telecommunication University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Post and Telecommunication University filed Critical Nanjing Post and Telecommunication University
Priority to CN201410361706.0A priority Critical patent/CN104183239B/en
Publication of CN104183239A publication Critical patent/CN104183239A/en
Application granted granted Critical
Publication of CN104183239B publication Critical patent/CN104183239B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention discloses a method for identifying a speaker unrelated to text based on a weighted Bayes mixture model. The method comprises that a voice signal set used for training is pre-processed and feature of the voice signal set is extracted, the training set is described via the weighted Bayes mixture model in the training process, parameter values and random variable distribution in the weighted Bayes mixture model are estimated via training, and thus, the weighted Bayes mixture model corresponding to each speaker is obtained. During identification, the marginal likelihood values of the trained weighted Bayes mixture models corresponding to the speakers are calculated via identification voices after preprocessing and feature extraction, and the maximal marginal likelihood corresponding to the speaker is used as an identification result. The method can effectively improve the correct identification rate of a text-related speaker identification system, avoids the problems of over-fitting and under-fitting that tend to occur in a traditional method, and enable that the relative weight of prior information and training data is easier and more flexible to control.

Description

Based on weighting Bayes mixture model with method for distinguishing speek person text-independent
Technical field
The present invention relates to a kind of based on weighting Bayes mixture model with method for distinguishing speek person text-independent, belong to voice process technology field.
Background technology
At aspects such as gate inhibition, credit card trade and court evidences, it is more and more important that Speaker Identification plays a part, and its target is that voice to be identified are correctly judged to be and belong in sound bank some among a plurality of reference men.
At present, in the method for distinguishing speek person with text-independent, the method based on gauss hybrid models (that is: GMM) is most widely used.Due to it, to have discrimination high, and training is simple, and amount of training data requires the advantages such as little, become at present and the main stream approach of the Speaker Identification of text-independent.Because GMM has the ability of the distribution of good expression data, as long as there is abundant state, abundant training data, GMM just can approach any distributed model relevant to time series.But, while GMM being applied in reality to the Speaker Identification with text-independent, there are several problems.First, traditional GMM training process, based on maximum-likelihood criterion, easily produces over-fitting or owes matching phenomenon training data.Secondly, traditional only considers observation data based on GMM with Speaker Identification text-independent, prior imformation is not introduced.The problems referred to above usually make GMM lower with the recognition correct rate of Speaker Recognition System text-independent based on traditional.Therefore how effectively to introduce prior imformation itself and training data effective integration is very important; In addition, after having merged prior imformation, how further the weight of balance prior imformation and training data, adopts the relatively simply relative weighting of mode control observation data, is also one and not yet solves but very important problem.And the present invention can solve problem above well.
Summary of the invention
The object of the invention has been to solve the defect of above-mentioned prior art, designed a kind of based on weighting Bayes mixture model with method for distinguishing speek person text-independent.
The present invention solves the technical scheme that its technical matters takes: a kind of based on weighting Bayes mixture model with method for distinguishing speek person text-independent, the method comprises the following steps:
Step 1: voice signal is carried out to pre-service: comprise sampling and Quantifying, pre-emphasis, minute frame and windowing;
Step 2: the feature extraction on speech frame: to each speech frame, calculate D rank linear prediction cepstrum coefficient coefficient, the D dimensional feature vector using it as this frame;
Step 3: for the corresponding training set of each speaker X={x n} n=1 ..., N, wherein N is the D dimensional feature vector x of this speaker for training nnumber; With weighting Bayes mixture model (that is: WBMM), carry out modeling X, by training, estimate parameter value in WBMM and the distribution of stochastic variable; As needed to identify G speaker in this recognition system, repetition training process G time, obtains respectively WBMM 1..., WBMM g..., WBMM g;
Step 4: for voice to be identified, first carry out pre-service and feature extraction, obtain corresponding D dimensional feature vector x'; Calculate x' about model WBMM corresponding to each speaker 1..., WBMM g..., WBMM gedge likelihood value { MLIK g(x') } g=1 ..., G, final recognition result is maximum MLIK g(x') corresponding speaker speaker, that is:
speaker ( x ′ ) = arg max g = 1 G MLIK g ( x ′ ) .
Of the present invention based on weighting Bayes mixture model with method for distinguishing speek person text-independent in, to described in step 3 by training, to estimate the step of distribution of parameter value in WBMM and stochastic variable as follows:
Step 3-1: set the super parameter { λ in WBMM 0, m 0, β 0, ν 0, V 0value, wherein, λ 0=0.01, m 0=0 (0 is D dimension zero vector), β 0=1, ν 0=D, V 0=400I (I is the unit matrix of (D * D));
Step 3-2: set the value of α, α gets the arbitrary integer between-8~-1;
Step 3-3: produce equally distributed random integers on N obedience [1, K] interval, wherein K is WBMM is mixed into mark, can get the arbitrary integer in 16~32, adds up the probability that on this interval, each integer occurs; That is, if produced N iindividual integer i, so θ i=N i/ N; For each x n, corresponding hidden variable z ninitial distribution be
q ( z n ) = Π i = 1 K q ( z ni = 1 ) = Π i = 1 K θ i ;
In addition, set iterations counting variable t=1, start iterative loop;
Step 3-4: calculate three intermediate variables:
s i = Σ n = 1 N q ( z ni = 1 )
x ‾ i = 1 s i Σ n = 1 N q ( z ni = 1 ) · x n
C i = 1 s i Σ n = 1 N q ( z ni = 1 ) · ( x n - x ‾ i ) · ( x n - x ‾ i ) T
Step 3-5: upgrade the stochastic variable { π in WBMM i} i=1 ..., Kdistribution, it represents the proportion of i blending constituent, its is obeyed Dirichlet and distributes, that is, and q (π i)=Dir (π i| λ i), super parameter { λ accordingly i} i=1 ..., Kmore new formula as follows:
λ i = λ 0 + 1 - α 2 · s i
Step 3-6: upgrade stochastic variable { μ in WBMM i, Τ i} i=1 ..., Kdistribution, it represents respectively average and the inverse covariance matrix of i composition, their obey associating Gaussian-Wishart distribution, corresponding super parameter { m i, β i, ν i, V i} i=1 ..., Krenewal as follows:
β i = β 0 + 1 - α 2 · s i ,
m i = 1 β i ( β 0 m 0 + 1 - α 2 · s i · x ‾ i ) ,
v i = v 0 + 1 - α 2 · s i ,
V i - 1 = V 0 - 1 + 1 - α 2 · s i · C i + β 0 s i ( 1 - α ) 2 β 0 + s i ( 1 - α ) · ( x ‾ i - m 0 ) · ( x ‾ i - m 0 ) T ;
Step 3-7: upgrade hidden variable { z n} n=1 ..., Ndistribution, as follows:
q ( z n ) = Π i = 1 K ( γ ni Σ j = 1 K γ nj ) z ni
Wherein,
γ ni = exp { ( 1 - α 2 ) · ⟨ ln π i ⟩ + ( 1 - α 4 ) · [ ⟨ ln | T i | ⟩ - D ln ( 2 π ) - ⟨ ( x n - μ i ) T T i ( x n - μ i ) ⟩ ] }
In above formula, the computing formula of every expectation <> is as follows:
&lang; ln &pi; i &rang; = &psi; ( &lambda; i ) - &psi; ( &Sigma; j = 1 K &lambda; j ) ,
&lang; ln | T i | &rang; = &Sigma; d = 1 D &psi; ( v i + 1 - d 2 ) + D ln 2 + ln | V i | ,
&lang; ( x n - &mu; i ) T T i ( x n - &mu; i ) &rang; = D &beta; i - 1 + v i ( x n - m i ) T V i ( x n - m i )
The digamma function that in formula, ψ () is standard above (derivative of the logarithm of standard Gamma function gamma (), i.e. ψ ()=(ln Γ ()) '); q ( z ni = 1 ) = &gamma; ni / &Sigma; j = 1 K &gamma; nj ;
Step 3-8: calculate the edge likelihood value MLIK after current iteration t, t is current iterations:
MLIK t = &Sigma; n = 1 N &Sigma; i = 1 K q ( z ni = 1 ) &CenterDot; ( 1 - &alpha; 2 ) &CenterDot; { &lang; ln &pi; i &rang; + 0.5 &times; [ &lang; ln | T i | &rang; - D ln ( 2 &pi; ) - &lang; ( x n - &mu; i ) T T i ( x n - &mu; i ) &rang; ] } ;
Step 3-9: calculate after current iteration with last iteration after the difference DELTA MLIK=MLIK of edge likelihood value t-MLIK t-1; If Δ MLIK≤δ, finishes by the process that training estimates the distribution of parameter value in WBMM and stochastic variable so, otherwise forward above-mentioned steps 3-4 to, the value of t increases by 1, carries out next iteration; The span of threshold value δ is 10 -5~10 -4, δ can get the arbitrary value within the scope of this.
Of the present invention based on weighting Bayes in mixture model and method for distinguishing speek person text-independent, to calculating x' about the relevant model WBMM of each speaker in identifying described in step 4 1..., WBMM g... WBMM gedge likelihood value { MLIK g(x') } g=1 ..., Gformula as follows:
MLIK g ( x &prime; ) = &Sigma; i = 1 K q ( z ni = 1 ) &CenterDot; ( 1 - &alpha; 2 ) &CenterDot; { &lang; ln &pi; i &rang; + 0.5 &times; [ &lang; ln | T i | &rang; - D ln ( 2 &pi; ) - &lang; ( x &prime; - &mu; i ) T T i ( x &prime; - &mu; i ) &rang; ] }
Wherein, <> and q (z ni=1) be the WBMM through after training gin expectation and probability.
In the present invention, adopt based on weighting Bayes mixture model with method for distinguishing speek person text-independent be under Bayesian frame, prior imformation is introduced and and training data carry out effective integration, solved the model over-fitting easily occurring in the Speaker Identification of the GMM of tradition based on maximum-likelihood criterion and owed fitting problems, its model has higher dirigibility.
Of the present invention based on weighting Bayes mixture model with method for distinguishing speek person text-independent in an additional parameter α, control the weight of data in training, the relative weighting of prior imformation and training data is more prone to and controls neatly.
What in the present invention, adopt can obtain the posteriority distribution of optimum estimates of parameters and correlation parameter based on weighting Bayes is mixture model exactly with method for distinguishing speek person text-independent according to the distribution situation of data, adopt after the method, greatly improve with the discrimination of the Speaker Recognition System of text-independent.
Beneficial effect:
1, model of the present invention has higher dirigibility.
2, the invention enables the relative weighting of prior imformation and training data be more prone to and control neatly.
3, discrimination of the present invention improves greatly.
Accompanying drawing explanation
Fig. 1 is method flow diagram of the present invention.
Fig. 2 is the sound spectrograph of one section of clean speech of the present invention and corresponding call voice thereof.
Fig. 3 is clean speech, during different α, and the comparing result schematic diagram of the recognition correct rate of method of the present invention and traditional and method for distinguishing speek person text-independent.
Fig. 4 is call voice, during different α, and the comparing result schematic diagram of the recognition correct rate of method of the present invention and traditional and method for distinguishing speek person text-independent.
Fig. 5 is call voice, TU=5, and during different speaker's number, the comparing result schematic diagram of the recognition correct rate of method of the present invention and traditional and method for distinguishing speek person text-independent.
Fig. 6 is clean speech, TU=2, and during different speaker's number, the comparing result schematic diagram of the recognition correct rate of method of the present invention and traditional and method for distinguishing speek person text-independent.
Embodiment
Below in conjunction with drawings and Examples, technical solutions according to the invention are done further and set forth.
As shown in Figure 1, the invention provides a kind of based on weighting Bayes mixture model with method for distinguishing speek person text-independent, the method comprises the steps:
The first step: the pre-service of voice signal
(1) sampling and Quantifying
Each section of voice signal y to the data acquisition for training for the data centralization identified a(t) sample, thereby obtain the amplitude sequence y (n) of audio digital signals.By pulse code modulation (PCM) (PCM) technology, y (n) is carried out to quantization encoding, thereby the quantized value that obtains amplitude sequence represents form y'(n).The precision of sampling here and quantizing decides according to the requirement that is applied to the Speaker Recognition System under varying environment.For most of voice signals, sample frequency F is 8KHz-16KHz, and quantization digit is 16 or 24.
(2) pre-emphasis
By y'(n) by digital filter Z, obtain amplitude sequence s that the high, medium and low frequency amplitude of voice signal is suitable " (n).Here the Z transport function of digital filter is H (z)=1-az -1.Wherein, the span of pre emphasis factor a is 0.8~0.97.
(3) divide frame, windowing
With frame length τ (unit: millisecond), the frame amount of moving is τ/4, s " (n) is divided into a series of speech frame F t.That is, each speech frame comprises τ * F voice signal sample.Then, calculate the value of Hamming window function:
w H ( n ) = 0.54 - 0.46 cos ( 2 &pi;n N - 1 ) 1 &le; n &le; &tau; &times; F 1 others
Finally, to each speech frame F tadds Hamming window, obtain, thereby the pre-service that completes voice signal is processed:
F t * ( n ) = w H ( n ) &times; F t ( n ) , n=1,...,τ×F
Second step: the feature extraction on speech frame
In this method for each frame by calculating, obtain D rank linear prediction cepstrum coefficient coefficients (LPCC), this coefficient conduct corresponding D dimensional feature vector, the D here decides according to the requirement that is applied to the Speaker Recognition System under varying environment, and the span of D is 10~20.The calculating of LPCC comprises following process:
(1) calculate the linear predictor coefficient on D rank its computing formula is as follows:
&phi; m ( i , 0 ) = &Sigma; d = 1 D x ^ d &phi; m ( i , d ) , d=1,...,D
Wherein,
&phi; m ( i , d ) = &Sigma; n = 1 N F m * ( n - i ) F m * ( n - k ) .
(formula ) represent that the system of equations that D equation forms, unknown number are D.Solve this system of equations, just can obtain present frame the linear predictor coefficient on corresponding D rank
(2) by D rank linear predictor coefficient by following formula, calculate the linear prediction cepstrum coefficient coefficient x on D rank 1..., x d:
x d = x ^ d + &Sigma; k = 1 d - 1 k d x k x ^ d - k , d=1,...,D
With said method, calculate all speakers for training and D dimensional feature vector for identifying.Suppose that in training set, the corresponding eigenvector of certain speaker (being assumed to be g speaker) has N, the corresponding training set of this speaker can be expressed as so (quantity that G is speaker).Owing to being the training dataset for certain speaker in each training, in order to represent that conveniently the present invention omits X (g)with subscript " (g) ", that is, and X={x n} n=1 ..., N, x wherein n=(x n1..., x nD) be g speaker's calculating by pre-service and characteristic extraction step n D dimensional feature vector.
The 3rd step: training
Due to Speaker Identification and text-independent here, the speech characteristic vector that adopts mixture model to come modeling to extract.Here the present invention has designed a kind of weighting Bayes mixture model (weighted Bayesian mixture model is called for short WBMM).Compare with traditional gauss hybrid models for text-independent Speaker Identification (GMM), WBMM has two significant differences: first, WBMM has introduced an additional parameter α,, adopted weighted likelihood function to describe training data, such advantage is to regulate more flexibly the weight of training data in whole model, controls better the relative weighting of prior imformation and training data; Secondly, regard the parameter in WBMM as stochastic variable, under Bayesian frame, calculate its posteriority and distribute rather than direct estimation parameter value, such mode can obtain good effect when training data is not enough.
Particularly, with following formula, set up the likelihood function of observation data collection X:
p ( X | &pi; , &mu; , T , &alpha; ) = &Pi; n = 1 N [ &Sigma; i = 1 K &pi; i N ( x n | &mu; i , T i - 1 ) ] 1 - &alpha; 2
In above formula, K is blending constituent number, conventionally gets the arbitrary integer in 16~32 in Speaker Identification.In order to introduce prior imformation and itself and training data to be merged, by the parameter π in model, μ, T, as stochastic variable, specifies corresponding prior distribution to them.Particularly, π={ π i} i=1 ..., Kobey Dirichlet prior distribution, c (λ wherein 0) be the normalized factor of this distribution; { μ, T}={ μ i, T i} i=1 ..., Kobey associating Gaussian-Wishart and distribute (being the product that Gaussian distributes and Wishart distributes, N () W ()), that is:
p ( &mu; , T ) = p ( &mu; | T ) p ( T ) = &Pi; i = 1 K N ( &mu; i | m 0 , ( &beta; 0 T i ) - 1 ) W ( T i | v 0 , V 0 ) ,
{ m wherein 0, β 0, ν 0, V 0for this, combine the super parameter that Gaussian-Wishart distributes.M 0for D dimension column vector, β 0and ν 0for scalar, V 0it is the matrix of (D * D).In addition, also need to introduce a hidden variable Z={z n} n=1 ..., N, z wherein n=(z n1..., z ni..., z nK) in to only have an element be 1, all the other are 0.Z neffect be indication and mark x nthat which blending constituent produces in WBMM.For example, work as x nwhile being produced by i blending constituent, z ni=1.
Under WBMM defined above, the step of training process is as follows:
(1) set the super parameter { λ of WBMM 0, m 0, β 0, ν 0, V 0value, particularly, λ 0=0.01, m 0=0 (0 is D dimension zero vector), β 0=1, ν 0=D, V 0=400I (I is unit matrix).
(2) set the value of α, α can get the arbitrary integer between-8~-1.
(3) produce equally distributed random integers on N obedience [1, K] interval, add up the probability that on this interval, each integer occurs.That is, if produced N iindividual integer i, so θ i=N i/ N.For each x n, corresponding hidden variable z ninitial distribution be
q ( z n ) = &Pi; i = 1 K q ( z ni = 1 ) = &Pi; i = 1 K &theta; i
In addition, iterations counting variable t=1, starts iterative loop.
(4) calculate three intermediate variables:
s i = &Sigma; n = 1 N q ( z ni = 1 )
x &OverBar; i = 1 s i &Sigma; n = 1 N q ( z ni = 1 ) &CenterDot; x n
C i = 1 s i &Sigma; n = 1 N q ( z ni = 1 ) &CenterDot; ( x n - x &OverBar; i ) &CenterDot; ( x n - x &OverBar; i ) T
(5) upgrade stochastic variable { π i} i=1 ..., Kdistribution, it is still obeyed Dirichlet and distributes, that is, q (π i)=Dir (π i| λ i), super parameter { λ accordingly i} i=1 ..., Kmore new formula as follows:
&lambda; i = &lambda; 0 + 1 - &alpha; 2 &CenterDot; s i
(6) upgrade stochastic variable { μ i, Τ i} i=1 ..., Kdistribution, it is still obeyed associating Gaussian-Wishart and distributes, corresponding super parameter { m i, β i, ν i, V i} i=1 ..., Kmore new formula as follows:
&beta; i = &beta; 0 + 1 - &alpha; 2 &CenterDot; s i ,
m i = 1 &beta; i ( &beta; 0 m 0 + 1 - &alpha; 2 &CenterDot; s i &CenterDot; x &OverBar; i ) ,
v i = v 0 + 1 - &alpha; 2 &CenterDot; s i ,
V i - 1 = V 0 - 1 + 1 - &alpha; 2 &CenterDot; s i &CenterDot; C i + &beta; 0 s i ( 1 - &alpha; ) 2 &beta; 0 + s i ( 1 - &alpha; ) &CenterDot; ( x &OverBar; i - m 0 ) &CenterDot; ( x &OverBar; i - m 0 ) T ;
(7) upgrade hidden variable { z n} n=1 ..., Ndistribution, as follows:
q ( z n ) = &Pi; i = 1 K ( &gamma; ni &Sigma; j = 1 K &gamma; nj ) z ni
Wherein,
&gamma; ni = exp { ( 1 - &alpha; 2 ) &CenterDot; &lang; ln &pi; i &rang; + ( 1 - &alpha; 4 ) &CenterDot; [ &lang; ln | T i | &rang; - D ln ( 2 &pi; ) - &lang; ( x n - &mu; i ) T T i ( x n - &mu; i ) &rang; ] }
In above formula, the computing formula of every expectation <> is as follows:
&lang; ln &pi; i &rang; = &psi; ( &lambda; i ) - &psi; ( &Sigma; j = 1 K &lambda; j ) ,
&lang; ln | T i | &rang; = &Sigma; d = 1 D &psi; ( v i + 1 - d 2 ) + D ln 2 + ln | V i | ,
&lang; ( x n - &mu; i ) T T i ( x n - &mu; i ) &rang; = D &beta; i - 1 + v i ( x n - m i ) T V i ( x n - m i ) &CirclePlus;
The digamma function that ψ () in upper two formulas is standard (derivative of the logarithm of Gamma function gamma (), i.e. ψ ()=(ln Γ ()) ').So,
(8) calculate the edge likelihood value MLIK after current iteration t, t is current iterations:
MLIK t = &Sigma; n = 1 N &Sigma; i = 1 K q ( z ni = 1 ) &CenterDot; ( 1 - &alpha; 2 ) &CenterDot; { &lang; ln &pi; i &rang; + 0.5 &times; [ &lang; ln | T i | &rang; - D ln ( 2 &pi; ) - &lang; ( x n - &mu; i ) T T i ( x n - &mu; i ) &rang; ] }
, wherein identical in the computing formula of every expectation <> and step (7).
(9) calculate after current iteration with last iteration after the difference DELTA MLIK=MLIK of edge likelihood value t-MLIK t-1; If Δ MLIK≤δ, parameter estimation procedure finishes so, otherwise forwards step (4) to, and the value of t increases by 1, proceeds iteration next time; The span of threshold value δ is 10 -5~10 -4.
The training process that in above-mentioned estimation WBMM, parameter and stochastic variable distribute is as shown in left side dashed rectangle in Fig. 1.Need to annotatedly be, the Dirichlet distribution Dir () mentioning in above-mentioned steps, Gaussian distribution N (), Wishart distribution W () and Gamma function gamma () are all the functions with canonical form, the expression formula that has these functions in most probability statistics books and documents and materials, they are all also the functions that this area scientific and technical personnel know and often need to use, implementing only need to consult corresponding probability statistics teaching material when of the present invention or relevant encyclopaedia introduction can obtain easily, providing no longer one by one its concrete form herein.
For the corresponding training set of each speaker X (1)..., X (g)..., X (G), adopt in this way and train, obtain respectively the weighting Bayes mixture model WBMM corresponding with it 1..., WBMM g..., WBMM g(quantity that G is speaker).
The 4th step: identification
In identifying, first the voice relevant to current speaker to be identified pass through the pre-service of the first step and the feature extraction of second step, obtain corresponding D dimensional feature vector x'.Calculate respectively it about the relevant model WBMM of each speaker 1..., WBMM g... WBMM gedge likelihood value { MLIK g(x') } g=1 ..., G.For example, x' is about g speaker model WBMM gedge likelihood be
MLIK g ( x &prime; ) = &Sigma; i = 1 K q ( z ni = 1 ) &CenterDot; ( 1 - &alpha; 2 ) &CenterDot; { &lang; ln &pi; i &rang; + 0.5 &times; [ &lang; ln | T i | &rang; - D ln ( 2 &pi; ) - &lang; ( x &prime; - &mu; i ) T T i ( x &prime; - &mu; i ) &rang; ] }
Wherein each expectation <> and q (z ni=1) be for g speaker, through resulting expectation and probability after the 3rd step training, (step (7) in the 3rd step training obtains, and unique difference is to calculate < (x'-μ i) tΤ i(x'-μ i) during >, by (formula ) in x nwith x', replace).
Final recognition result is maximum MLIK g(x') corresponding speaker, that is:
speaker ( x &prime; ) = arg max g = 1 G MLIK g ( x &prime; ) .
Performance evaluation of the present invention:
In order to verify the system performance that has adopted and method for distinguishing speek person text-independent (WBMM) based on weighting Bayes mixture model of the present invention, contrast with the system performance of method for distinguishing speek person text-independent itself and gauss hybrid models based on traditional are (GMM), here select TIMIT data set (speech sample frequency is 16KHz, and quantization digit is 16) to test.Here test method for distinguishing speek person proposed by the invention to the recognition effect in clean speech and two kinds of scenes of call voice.For generated telephone speech environment, the bandpass filter that the present invention is 0.3KHz~3.4KHz by pure language by an effective bandwidth scope, then adds the white Gaussian noise that signal to noise ratio snr is 20dB, thereby obtains call voice.Fig. 2 has provided the original clean speech of one section of TIMIT and the sound spectrograph of corresponding call voice.In preprocessing process, frame length τ=20ms, pre emphasis factor is 0.95, the dimension D=20 of eigenvector.
In the telephone voice data storehouse in TIMIT and generation, have 250 speakers, each speaker always has 10 sections of voice.Here, by 5 sections of voice wherein, for identification, 5 sections of remaining voice are according to circumstances for training.Here the mark K that is mixed into of WBMM and GMM is fixed on 16.Recognition result is weighed with discrimination, and discrimination is defined as the ratio of the speech frame quantity with the total number of speech frames that correctly identify speaker.
First, the performance of WBMM under more different α, considers two kinds of situations here, i.e. the situation of training utterance data sufficient (for the voice hop count TU=5 training) and training utterance data deficiencies (for the voice hop count TU=2 training).In addition, in order better to analyze the performance of WBMM, itself and WGMM are contrasted, WGMM is the gauss hybrid models based on weighting, the difference of itself and WBMM is to parameter, not give corresponding prior distribution, based on maximum-likelihood criterion, estimate relevant parameter, when α=-1, WGMM deteriorates to GMM.Fig. 3 has provided the discrimination in clean speech situation.Can find out, while no matter being TU=2 or TU=5, for α, get the arbitrary integer between-8~-1, the discrimination of WBMM is all higher than WGMM, also higher than the discrimination of GMM.Fig. 4 has provided the discrimination in call voice situation, although due to the existence of noise, make it compare with clean speech that discrimination is whole to decline, WBMM is still better than corresponding WGMM (α=-1 o'clock is GMM).It is former because the WBMM that the present invention proposes has adopted the likelihood function of weighting, make to give prominence to better the effect of observation data, in addition, introduced prior imformation, and adopted the training patterns based on bayesian criterion, can fully utilize prior imformation and observation data, discrimination is improved greatly.In addition, under two kinds of voice environments, all there is an optimum α: clean speech, during TU=5, α=-2; Clean speech, during TU=2, α=-4; Call voice, during TU=5, α=-3; Call voice, during TU=2, α=-7.For other databases, also can, by this kind of mode, by experimental result, determine optimum α.
Then the whole discrimination that, compares the whole bag of tricks system in the situation that speaker's number is different.Fig. 5 has provided call voice, and during TU=5, WBMM is in α=-1, and-3 ,-6 o'clock, WGMM, at α=-1 (GMM),, was respectively 50,100 at speaker's number, the discrimination of 150,200,250 o'clock at-3 ,-6 o'clock.Can find out that the WBMM that the present invention proposes is higher than the discrimination of corresponding WGMM and GMM.In addition, Fig. 6 has provided clean speech, discrimination during TU=2, and the WBMM that adopts the present invention to propose has obtained equally than WGMM and the better performance of GMM.
The scope that the present invention asks for protection is not limited only to the description of this embodiment, and particular content should be as the criterion with claims.

Claims (5)

  1. Based on weighting Bayes mixture model with method for distinguishing speek person text-independent, it is characterized in that, described method comprises the steps:
    Step 1: voice signal is carried out to pre-service: comprise sampling and Quantifying, pre-emphasis, minute frame and windowing;
    Step 2: the feature extraction on speech frame: to each speech frame, calculate D rank linear prediction cepstrum coefficient coefficient, the D dimensional feature vector using it as this frame;
    Step 3: for the corresponding training set of each speaker X={x n} n=1 ..., N, wherein N is the D dimensional feature vector x of this speaker for training nnumber; With weighting Bayes mixture model, WBMM carrys out modeling X, by training, estimates parameter value in WBMM and the distribution of stochastic variable; As needed to identify G speaker in this recognition system, repetition training process G time, obtains respectively WBMM 1..., WBMM g..., WBMM g;
    Step 4: for voice to be identified, first carry out pre-service and feature extraction, obtain corresponding D dimensional feature vector x'; Calculate x' about model WBMM corresponding to each speaker 1..., WBMM g..., WBMM gedge likelihood value { MLIK g(x') } g=1 ..., G, final recognition result is maximum MLIK g(x') corresponding speaker speaker, that is:
    speaker ( x &prime; ) = arg max g = 1 G MLIK g ( x &prime; ) .
  2. According to claim 1 a kind of based on weighting Bayes mixture model with method for distinguishing speek person text-independent, it is characterized in that, described in described method step 3 by training, to estimate the step of distribution of parameter value in WBMM and stochastic variable as follows:
    Step 3-1: set the super parameter { λ in WBMM 0, m 0, β 0, ν 0, V 0value, wherein, λ 0=0.01, m 0=0 (0 is D dimension zero vector), β 0=1, ν 0=D, V 0=400I (I is the unit matrix of (D * D));
    Step 3-2: set the value of α, α gets the arbitrary integer between-8~-1;
    Step 3-3: produce equally distributed random integers on N obedience [1, K] interval, wherein K is WBMM is mixed into mark, can get the arbitrary integer in 16~32, adds up the probability that on this interval, each integer occurs; That is, if produced N iindividual integer i, so θ i=N i/ N; For each { x n} n=1 ..., N, corresponding hidden variable { z n} n=1 ..., Ninitial distribution be
    q ( z n ) = &Pi; i = 1 K q ( z ni = 1 ) = &Pi; i = 1 K &theta; i ;
    In addition, set iterations counting variable t=1, start iterative loop;
    Step 3-4: calculate three intermediate variables:
    s i = &Sigma; n = 1 N q ( z ni = 1 )
    x &OverBar; i = 1 s i &Sigma; n = 1 N q ( z ni = 1 ) &CenterDot; x n
    C i = 1 s i &Sigma; n = 1 N q ( z ni = 1 ) &CenterDot; ( x n - x &OverBar; i ) &CenterDot; ( x n - x &OverBar; i ) T
    Step 3-5: upgrade the stochastic variable { π in WBMM i} i=1 ..., Kdistribution, it represents the proportion of i blending constituent, its is obeyed Dirichlet and distributes, that is, and q (π i)=Dir (π i| λ i), super parameter { λ accordingly i} i=1 ..., Kmore new formula as follows:
    &lambda; i = &lambda; 0 + 1 - &alpha; 2 &CenterDot; s i
    Step 3-6: upgrade stochastic variable { μ in WBMM i, Τ i} i=1 ..., Kdistribution, it represents respectively average and the inverse covariance matrix of i composition, they obey associating Gaussian-Wishart distribution, i.e. q (μ i, Τ i)=N (μ i| m i, (β iΤ i) -1) W (Τ i| ν i, V i), super parameter { m accordingly i, β i, ν i, V i} i=1 ..., Krenewal as follows:
    &beta; i = &beta; 0 + 1 - &alpha; 2 &CenterDot; s i ,
    m i = 1 &beta; i ( &beta; 0 m 0 + 1 - &alpha; 2 &CenterDot; s i &CenterDot; x &OverBar; i ) ,
    v i = v 0 + 1 - &alpha; 2 &CenterDot; s i ,
    V i - 1 = V 0 - 1 + 1 - &alpha; 2 &CenterDot; s i &CenterDot; C i + &beta; 0 s i ( 1 - &alpha; ) 2 &beta; 0 + s i ( 1 - &alpha; ) &CenterDot; ( x &OverBar; i - m 0 ) &CenterDot; ( x &OverBar; i - m 0 ) T ;
    Step 3-7: upgrade hidden variable { z n} n=1 ..., Ndistribution, as follows:
    q ( z n ) = &Pi; i = 1 K ( &gamma; ni &Sigma; j = 1 K &gamma; nj ) z ni
    Wherein,
    &gamma; ni = exp { ( 1 - &alpha; 2 ) &CenterDot; &lang; ln &pi; i &rang; + ( 1 - &alpha; 4 ) &CenterDot; [ &lang; ln | T i | &rang; - D ln ( 2 &pi; ) - &lang; ( x n - &mu; i ) T T i ( x n - &mu; i ) &rang; ] }
    In above formula, the computing formula of every expectation <> is as follows:
    &lang; ln &pi; i &rang; = &psi; ( &lambda; i ) - &psi; ( &Sigma; j = 1 K &lambda; j ) ,
    &lang; ln | T i | &rang; = &Sigma; d = 1 D &psi; ( v i + 1 - d 2 ) + D ln 2 + ln | V i | ,
    &lang; ( x n - &mu; i ) T T i ( x n - &mu; i ) &rang; = D &beta; i - 1 + v i ( x n - m i ) T V i ( x n - m i )
    The digamma function that in formula, ψ () is standard above (derivative of the logarithm of Gamma function gamma (), i.e. ψ ()=(ln Γ ()) '); q ( z ni = 1 ) = &gamma; ni / &Sigma; j = 1 K &gamma; nj ;
    Step 3-8: calculate the edge likelihood value MLIK after current iteration t, t is current iterations:
    MLIK t = &Sigma; n = 1 N &Sigma; i = 1 K q ( z ni = 1 ) &CenterDot; ( 1 - &alpha; 2 ) &CenterDot; { &lang; ln &pi; i &rang; + 0.5 &times; [ &lang; ln | T i | &rang; - D ln ( 2 &pi; ) - &lang; ( x n - &mu; i ) T T i ( x n - &mu; i ) &rang; ] } ;
    Step 3-9: calculate after current iteration with last iteration after the difference DELTA MLIK=MLIK of edge likelihood value t-MLIK t-1; If Δ MLIK≤δ, finishes by the process that training estimates the distribution of parameter value in WBMM and stochastic variable so, otherwise forward above-mentioned steps 3-4 to, the value of t increases by 1, carries out next iteration; The span of threshold value δ is 10 -5~10 -4.
  3. According to claim 1 a kind of based on weighting Bayes mixture model with method for distinguishing speek person text-independent, it is characterized in that, described in described method step 4, in identifying, calculate x' about the relevant model WBMM of each speaker 1..., WBMM g... WBMM gedge likelihood value { MLIK g(x') } g=1 ..., Gformula as follows:
    MLIK g ( x &prime; ) = &Sigma; i = 1 K q ( z ni = 1 ) &CenterDot; ( 1 - &alpha; 2 ) &CenterDot; { &lang; ln &pi; i &rang; + 0.5 &times; [ &lang; ln | T i | &rang; - D ln ( 2 &pi; ) - &lang; ( x &prime; - &mu; i ) T T i ( x &prime; - &mu; i ) &rang; ] } Wherein, <> and q (z ni=1) be the WBMM through after training gin expectation and probability.
  4. According to claim 1 a kind of based on weighting Bayes mixture model with method for distinguishing speek person text-independent, it is characterized in that: described method is under Bayesian frame, prior imformation is introduced and and training data merge.
  5. According to claim 1 a kind of based on weighting Bayes mixture model with method for distinguishing speek person text-independent, it is characterized in that: described method is to control the weight of data in training with an additional parameter α.
CN201410361706.0A 2014-07-25 2014-07-25 Method for identifying speaker unrelated to text based on weighted Bayes mixture model Expired - Fee Related CN104183239B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410361706.0A CN104183239B (en) 2014-07-25 2014-07-25 Method for identifying speaker unrelated to text based on weighted Bayes mixture model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410361706.0A CN104183239B (en) 2014-07-25 2014-07-25 Method for identifying speaker unrelated to text based on weighted Bayes mixture model

Publications (2)

Publication Number Publication Date
CN104183239A true CN104183239A (en) 2014-12-03
CN104183239B CN104183239B (en) 2017-04-19

Family

ID=51964229

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410361706.0A Expired - Fee Related CN104183239B (en) 2014-07-25 2014-07-25 Method for identifying speaker unrelated to text based on weighted Bayes mixture model

Country Status (1)

Country Link
CN (1) CN104183239B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108417205A (en) * 2018-01-19 2018-08-17 苏州思必驰信息科技有限公司 Semantic understanding training method and system
CN108630207A (en) * 2017-03-23 2018-10-09 富士通株式会社 Method for identifying speaker and speaker verification's equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020093591A1 (en) * 2000-12-12 2002-07-18 Nec Usa, Inc. Creating audio-centric, imagecentric, and integrated audio visual summaries
CN102034472A (en) * 2009-09-28 2011-04-27 戴红霞 Speaker recognition method based on Gaussian mixture model embedded with time delay neural network
CN102543063A (en) * 2011-12-07 2012-07-04 华南理工大学 Method for estimating speech speed of multiple speakers based on segmentation and clustering of speakers

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020093591A1 (en) * 2000-12-12 2002-07-18 Nec Usa, Inc. Creating audio-centric, imagecentric, and integrated audio visual summaries
CN102034472A (en) * 2009-09-28 2011-04-27 戴红霞 Speaker recognition method based on Gaussian mixture model embedded with time delay neural network
CN102543063A (en) * 2011-12-07 2012-07-04 华南理工大学 Method for estimating speech speed of multiple speakers based on segmentation and clustering of speakers

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
RONG ZHENG ETC: "Variational Bayes based I-vector for speaker diarization of telephone conversations", 《ACOUSTICS, SPEECH AND SIGNAL PROCESSING(ICASSP),2014 IEEE INTERNATIONAL CONFERENCE ON》 *
万洪杰 等: "基于贝叶斯网络的说话人识别研究", 《计算机应用》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108630207A (en) * 2017-03-23 2018-10-09 富士通株式会社 Method for identifying speaker and speaker verification's equipment
CN108417205A (en) * 2018-01-19 2018-08-17 苏州思必驰信息科技有限公司 Semantic understanding training method and system

Also Published As

Publication number Publication date
CN104183239B (en) 2017-04-19

Similar Documents

Publication Publication Date Title
CN102129860B (en) Text-related speaker recognition method based on infinite-state hidden Markov model
CN101599271B (en) Recognition method of digital music emotion
CN105096955B (en) A kind of speaker&#39;s method for quickly identifying and system based on model growth cluster
CN109378010A (en) Training method, the speech de-noising method and device of neural network model
CN107146601A (en) A kind of rear end i vector Enhancement Methods for Speaker Recognition System
CN107564513A (en) Audio recognition method and device
CN102693724A (en) Noise classification method of Gaussian Mixture Model based on neural network
CN109346084A (en) Method for distinguishing speek person based on depth storehouse autoencoder network
CN103345923A (en) Sparse representation based short-voice speaker recognition method
CN111128209B (en) Speech enhancement method based on mixed masking learning target
CN105206270A (en) Isolated digit speech recognition classification system and method combining principal component analysis (PCA) with restricted Boltzmann machine (RBM)
CN102034472A (en) Speaker recognition method based on Gaussian mixture model embedded with time delay neural network
CN104900235A (en) Voiceprint recognition method based on pitch period mixed characteristic parameters
CN110060701A (en) Multi-to-multi phonetics transfer method based on VAWGAN-AC
CN108922513A (en) Speech differentiation method, apparatus, computer equipment and storage medium
CN110222841A (en) Neural network training method and device based on spacing loss function
CN109378014A (en) A kind of mobile device source discrimination and system based on convolutional neural networks
CN113539293B (en) Single-channel voice separation method based on convolutional neural network and joint optimization
CN110349588A (en) A kind of LSTM network method for recognizing sound-groove of word-based insertion
Todkar et al. Speaker recognition techniques: A review
CN110085254A (en) Multi-to-multi phonetics transfer method based on beta-VAE and i-vector
CN102708871A (en) Line spectrum-to-parameter dimensional reduction quantizing method based on conditional Gaussian mixture model
CN114783418B (en) End-to-end voice recognition method and system based on sparse self-attention mechanism
CN106297769B (en) A kind of distinctive feature extracting method applied to languages identification
CN108806725A (en) Speech differentiation method, apparatus, computer equipment and storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20201209

Address after: Room 214, building D5, No. 9, Kechuang Avenue, Zhongshan Science and Technology Park, Jiangbei new district, Nanjing, Jiangsu Province

Patentee after: Nanjing Tian Gu Information Technology Co.,Ltd.

Address before: 210003, 66 new model street, Gulou District, Jiangsu, Nanjing

Patentee before: NANJING University OF POSTS AND TELECOMMUNICATIONS

Effective date of registration: 20201209

Address after: 210024 No. 20 Beijing West Road, Gulou District, Nanjing City, Jiangsu Province

Patentee after: STATE GRID JIANGSU ELECTRIC POWER Co.,Ltd. INFORMATION & TELECOMMUNICATION BRANCH

Address before: Room 214, building D5, No. 9, Kechuang Avenue, Zhongshan Science and Technology Park, Jiangbei new district, Nanjing, Jiangsu Province

Patentee before: Nanjing Tian Gu Information Technology Co.,Ltd.

CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20170419