CN102034472A - Speaker recognition method based on Gaussian mixture model embedded with time delay neural network - Google Patents

Speaker recognition method based on Gaussian mixture model embedded with time delay neural network Download PDF

Info

Publication number
CN102034472A
CN102034472A CN2009100354240A CN200910035424A CN102034472A CN 102034472 A CN102034472 A CN 102034472A CN 2009100354240 A CN2009100354240 A CN 2009100354240A CN 200910035424 A CN200910035424 A CN 200910035424A CN 102034472 A CN102034472 A CN 102034472A
Authority
CN
China
Prior art keywords
partiald
tdnn
sigma
lambda
gmm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2009100354240A
Other languages
Chinese (zh)
Inventor
戴红霞
王吉林
余华
魏昕
赵力
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN2009100354240A priority Critical patent/CN102034472A/en
Publication of CN102034472A publication Critical patent/CN102034472A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention discloses a speaker recognition method based on a Gaussian mixture model (GMM) embedded with a time delay neural network (TDNN). In the speaker recognition method, the advantages of the TDNN and the GMM are fully considered, the TDNN is embedded into the GMM, and solves a residual of input and output vectors of the TDNN by fully utilizing the time sequence of an input characteristic vector through the conversion of a time delay network, and the residual modifies the training of the GMM through an expectation maximization method; besides, a likelihood probability is acquired by a modified GMM model parameter and the residual, and a TDNN parameter is modified by an inertial backward inversion method so as to ensure that parameters of the GMM and the TDNN are alternately updated. An experiment shows that: a recognition rate of the method is improved to a certain extent compared with that of a baseline GMM under various signal to noise ratios.

Description

A kind of method for distinguishing speek person based on the gauss hybrid models that embeds time-delay neural network
Technical field
The present invention relates to a kind of method for distinguishing speek person, particularly a kind of method for distinguishing speek person based on the gauss hybrid models that embeds time-delay neural network.
Background technology
At aspects such as gate inhibition, credit card trade and court evidences, automatic Speaker Identification, particularly important with the Speaker Identification play more and more of text-independent effect, its target are voice to be identified correctly to be judged to be belong in the sound bank some among a plurality of reference men.
On the method for Speaker Identification, more and more come into one's own based on gauss hybrid models (GMM) method, because it has the discrimination height, training is simple, amount of training data requires advantages such as little, has become the recognition methods of present main flow.Because gauss hybrid models (GMM) has the ability of the distribution of good expression data, as long as abundant item is arranged, abundant training data, GMM just can approach any distributed model.But there are several problems in reality when using GMM.At first, GMM does not utilize the temporal information of speaker's voice, and the result of training and identification and the input sequence of proper vector are irrelevant; Secondly, when GMM trained, we always supposed that proper vector is mutually independently, and this is obviously unreasonable; In addition, because we are when selecting the GMM model, the choosing of mixed term number not have the governing principle of getting well yet, and the result that obtain just requires the Gaussian Mixture item abundant.
Neural network is also occupied important position aspect Speaker Identification, multilayer perceptron, ray base net network and auto-associative neural network etc. have been successfully applied to Speaker Identification, especially time-delay neural network (TDNN) is used widely in signal Processing, speech recognition and Speaker Identification, it has made full use of the time sequence information of characteristic vector sequence, proper vector is learnt and conversion, made proper vector after the conversion (be generally minimum least square method) in some way and approach object vector.But GMM and TDNN just are used for Speaker Identification separately at present, also do not occur in conjunction with the two advantage separately, thereby the method that improves the Speaker Identification effect better occurs.
Summary of the invention
Purpose of the present invention just is to address the deficiencies of the prior art, and has proposed a kind of method for distinguishing speek person based on the gauss hybrid models that embeds time-delay neural network.Technical scheme of the present invention is:
A kind of method for distinguishing speek person based on the gauss hybrid models that embeds time-delay neural network, it may further comprise the steps:
(1) pre-service and feature extraction;
At first, used based on the method for energy and zero-crossing rate and carried out silence detection, and removed noise with spectrum-subtraction, and voice signal carried out pre-emphasis, divide frame, the line linearity of going forward side by side prediction (LPC) is analyzed, and obtains the proper vector of cepstrum coefficient as Speaker Identification then from the LPC coefficient that obtains.
(2) training;
During training, the proper vector process that extracts is postponed the input of back as TDNN, the structure of TDNN learning characteristic vector, the temporal information of extraction characteristic vector sequence.Then learning outcome is offered GMM with the form of residual error proper vector, adopt greatest hope (EM) criterion to carry out the GMM model training, and utilize the inversion method backward of band inertia to upgrade the weight coefficient of TDNN network.Concrete training process is as follows:
(2-1) determine GMM model and TDNN structure:
The probability density function of a M rank GMM is obtained by M Gaussian probability-density function weighted sum, can represent with following form:
p ( x t | λ ) = Σ i = 1 M p i b i ( x t )
X in the following formula tBe D dimensional feature vector, D=13 here; b i(x t) be member's density function, it is u for mean value vector i, covariance matrix is a ∑ iGaussian function;
b i ( x t ) = 1 ( 2 π ) D / 2 | Σ i | 1 / 2 exp { - 1 2 ( x i - u i ) T Σ i - 1 ( x t - u i ) }
p iBe that the mixed weight-value mixed weight-value satisfies condition:
Figure B2009100354240D0000022
Complete GMM model parameter is as follows:
λ={(p i,u i,∑ i),i=1,2,...,M}
Here, that utilization is the TDNN that is not with feedback.After the delay of proper vector x (n) through the linear delay piece, as the input of TDNN, TDNN carries out nonlinear transformation to input, and linear weighted function obtains output vector then, compares with proper vector again, and normally used criterion is lowest mean square criterion (MMSE).Specifically, the ratio of the neuron number of the hidden layer of TDNN and the neuron number of input layer is 3: 2, and non-linear activation S function (" ∫ " mark among Fig. 2) is
Figure B2009100354240D0000023
Figure B2009100354240D0000024
Y is through the input after the weighted sum.During training, inertial coefficient γ=0.8 of neural network.
(2-2) set the condition of convergence and maximum iteration time; Particularly, the condition of convergence be the Euclidean distance of adjacent twice GMM coefficient and TDNN weight coefficient less than 0.0001, maximum iteration time is not more than 100 usually.
(2-3) determine the TDNN and the GMM model parameter of primary iteration at random; The initial coefficients of TDNN is set at the pseudo random number that is produced by computing machine, the initial mixing coefficient of GMM can be taken as 1/M, M is the mixing item number of GMM, initial average of GMM and variance are by the residual vector process LBG (Linde of TDNN, Buzo, Gray) method produces M polymeric type, and average and the variance of calculating this M polymeric type respectively obtain.
(2-4) proper vector x (n) input TDNN network, will subtract each other by proper vector x (n) before the TDNN and the output characteristic vector o (n) of TDNN, obtain all residual vectors;
(2-5) parameter of employing EM method correction GMM model;
If residual vector is r t, at first calculate the classification posterior probability:
p ( i | r t , λ ) = p i b i ( r t ) Σ k = 1 M p k b k ( r t )
Upgrade mixed weight-value then Mean value vector
Figure B2009100354240D0000027
And covariance matrix
Figure B2009100354240D0000028
p i ‾ = 1 N Σ t = 1 N p ( i | r t , λ )
u i ‾ = Σ t = 1 N p ( i | r t , λ ) x t Σ t = 1 N p ( i | r t , λ )
Σ ‾ i 2 = Σ t = 1 N p ( i | r t , λ ) x t 2 Σ t = 1 N p ( i | r t , λ ) - u i ‾ 2
(2-6) utilize the weight coefficient of revised each Gaussian distribution of GMM model, mean vector and variance are brought residual error into, obtain a likelihood probability, utilize the correction of the inversion method backward TDNN parameter of band inertia;
The TDNN network parameter obtains by making the function maximization in the following formula:
L ( X ) = arg max ω ij Π t = 1 N p ( ( x t - o t ) | λ )
O wherein tBe neural network output, x tEigenvector for input.
Get negatively after following formula taken the logarithm again, obtain:
G ( X ) = arg min ω ij ( - Σ t = 1 N ln p ( ( x t - o t ) | λ ) )
Adopt the inversion method backward of band inertia to ask its iterative formula of G (X) as follows:
Δω ij k ( m + 1 ) = γΔω ij k ( m ) - ( 1 - γ ) α ∂ F ( x ) ∂ ω ij k | ω ij k = ω ij k ( m )
Wherein,
Figure B2009100354240D0000036
Be in the m time iteration, connect input x iWith output y jWeight coefficient, k is the layer sequence number of neural network, α is an iteration step length, F (x)=-lnp ((x t-o t) | λ), γ is an inertial coefficient.
(2-7) judge whether to satisfy the condition of convergence of setting in the step (2-2) or whether reach maximum iteration time, if, then stop training, otherwise, skip to step (2-4).
(3) Speaker Identification
During identification, characteristic vector sequence X is through postponing back input TDNN.Then the output sequence O of X and TDNN is subtracted each other resulting residual sequence R and offer the GMM model, for the sequence R=R of T residual error vector 1, R 2..., R T, its GMM probability can be written as:
P ( R | λ ) = Π t = 1 T p ( R t | λ )
Be expressed as at log-domain:
L ( R | λ ) = log P ( R | λ ) = Σ t = 1 T log p ( R t | λ )
Utilization Bayes' theorem during identification, in N unknown words person's model, the words person of the model correspondence of likelihood probability maximum is the target speaker:
i * = arg max 1 ≤ i ≤ N L ( R | λ i )
In described a kind of method for distinguishing speek person, described based on the gauss hybrid models that embeds time-delay neural network
Figure B2009100354240D00000310
Computation process as follows:
∂ F ( x ) ∂ ω ij k = ∂ F ( x ) ∂ y i k ∂ y i k ∂ ω ij k
In the TDNN network,
Figure B2009100354240D0000042
Figure B2009100354240D0000043
Figure B2009100354240D0000044
Output when being i neuron input of k layer sample x,
Figure B2009100354240D0000045
Input when being i neuron input of k layer sample x,
Figure B2009100354240D0000046
Be activation function.So:
∂ y i k ∂ ω ij k = o j k - 1
In described a kind of method for distinguishing speek person, described based on the gauss hybrid models that embeds time-delay neural network Computation process is divided into output layer and two kinds of situations of hidden layer of TDNN;
For output layer:
∂ F ( x ) ∂ y i k = - 1 p ( ( x - o ) | λ ) ∂ p ( ( x - o ) | λ ) ∂ o i k ∂ o i k y i k
= - f ′ ( y i k ) p ( ( x - o ) | λ ) ∂ ( Σ n = 1 M p n c n e - 1 2 ( x - o - u n ) T Σ n - 1 ( x - o - u n ) ) / ∂ o i k
= - f ′ ( y i k ) p ( ( x - o ) | λ ) Σ n = 1 M p n c n ( a n ( x - o - u n ) σ n , i 2 ( x i - o i - u n , i ) ) - - - ( 15 )
Wherein: a n ( x - o - u n ) = e - 1 2 ( x - o - u n ) T Σ n - 1 ( x - o - u n ) , c n = 1 ( 2 π ) D / 2 | Σ n | 1 / 2
For hidden layer:
∂ F ( x ) ∂ y i k = Σ j ∂ F ( x ) ∂ y j k + 1 ∂ y i k + 1 ∂ y i k = Σ j ∂ F ( x ) ∂ y j k + 1 ∂ ( Σ n ω jn k + 1 o n k ) ∂ y i k
= Σ j ∂ F ( x ) ∂ y i k + 1 ∂ o i k ∂ y i k ω ji k + 1 = f ′ ( y i k ) Σ j ∂ F ( x ) ∂ y i k + 1 ω ji k + 1 .
In described a kind of method for distinguishing speek person based on the gauss hybrid models that embeds time-delay neural network, described pre-emphasis adopts f (Z)=1-0.97Z -1Wave filter, divide frame to adopt length 20ms and window to move the Hamming window of 10ms, the exponent number of linear prediction analysis is 20, proper vector is the cepstrum coefficient on 13 rank.
Advantage of the present invention and effect are:
1. made full use of TDNN and GMM advantage separately, make the temporal information that TDNN can the learning characteristic vector like this, set of eigenvectors is mapped to the subspace that can increase likelihood probability, and can reduce the independently influence of this unreasonable hypothesis of proper vector, likelihood probability to strengthening object module reduces the effect of the likelihood probability of non-object module.And GMM has the discrimination height, training is simple and the little advantage of amount of training data requirement.Institute is so that whole Speaker Recognition System discrimination improves greatly.
2. the method that proposes of the present invention Speaker Identification effect under voice and the noise circumstance voice when noiseless all increases than independent employing GMM.
Other advantages of the present invention and effect will continue to describe below.
Description of drawings
Fig. 1---speaker's training and model of cognition.
Fig. 2---time-delay neural network model.
Fig. 3---comparing data during 1conv4w-1conv4w during noiseless.
Comparing data during noise in Fig. 4---the automobile.
Fig. 5---the comparing data when showing the noise in the compartment.
Embodiment
Below in conjunction with drawings and Examples, technical solutions according to the invention are further elaborated.
Fig. 1 is training and the model of cognition that embeds the both speaker. identification of TDNN network, and it follows baseline GMM model (only adopting the GMM model as Speaker Identification) all different aspect training and identification.
1. pre-service and feature extraction
At first, used based on the method for energy and zero-crossing rate and carried out silence detection, and removed noise, again by f (Z)=1-0.97Z with spectrum-subtraction -1Wave filter carry out pre-emphasis, the Hamming window that moves 10ms with length 20ms and window divides frame, carries out 20 rank linear predictions (LPC) and analyzes, and obtains the proper vector of the cepstrum coefficient on 13 rank as Speaker Identification then from 20 rank LPC coefficients.
2. speaker model training
During training, the process of the process of training TDNN and training GMM model hockets.TDNN is a kind of multilayer perceptron network (MLP), as shown in Figure 2.Proper vector through the delay of linear delay piece after as the input of TDNN, the structure of TDNN learning characteristic vector is extracted the temporal information of characteristic vector sequence.Then learning outcome is offered GMM with the form of residual error proper vector (being output poor of input vector and TDNN), adopt greatest hope (EM) method to carry out the GMM model training, and utilize the weight coefficient of the renewal of the inversion method backward TDNN network of band inertia.Here the criterion of TDNN and GMM model learning and training all is the maximum likelihood probability.Like this, by study, residual error distributes and just might carry out towards the direction that strengthens likelihood probability.Concrete training process is described below:
(1) determine GMM model and TDNN structure:
The probability density function of a M rank GMM is obtained by M Gaussian probability-density function weighted sum, can represent with following form:
p ( x t | λ ) = Σ i = 1 M p i b i ( x t ) - - - ( 1 )
Here x tBe a D dimension random vector, in Speaker Identification is used, x tBe proper vector; b i(x t), i=1,2 ..., M is member's density; p i, i=1,2 ..., M is a mixed weight-value.Each member's density function is a D dimension variable, and mean value vector is u i, covariance matrix is a ∑ iGaussian function, form is as follows:
b i ( x t ) = 1 ( 2 π ) D / 2 | Σ i | 1 / 2 exp { - 1 2 ( x i - u i ) T Σ i - 1 ( x t - u i ) } - - - ( 2 )
Wherein mixed weight-value satisfies condition: Σ i = 1 M p i = 1 .
Complete GMM model is by mean value vector, covariance matrix and the mixed weight-value parameter of all member's density.These parameters gatherings are expressed as together:
λ={(p i,u i,∑ i),i=1,2,...,M} (3)
Because Speaker Identification general training and recognition data are less, be diagonal matrix so set the covariance matrix of each Gaussian Mixture Model Probability Density in actual the use usually.
Here, utilization be the TDNN that is not with feedback, as shown in Figure 2.After the delay of proper vector x (n) through the linear delay piece, as the input of TDNN, TDNN carries out nonlinear transformation to input, and linear weighted function obtains output vector then, compares with proper vector again, and normally used criterion is lowest mean square criterion (MMSE).Specifically, the ratio of the neuron number of the hidden layer of TDNN and the neuron number of input layer is 3: 2, and non-linear activation S function (" ∫ " mark among Fig. 2) is
Figure B2009100354240D0000063
Y is through the input after the weighted sum.
When training, inertial coefficient γ=0.8 of neural network.
(2) set the condition of convergence and maximum iteration time; Particularly, the condition of convergence be the Euclidean distance of adjacent twice GMM coefficient and TDNN weight coefficient less than 0.0001, maximum iteration time is not more than 100 usually.
(3) determine the TDNN and the GMM model parameter of primary iteration at random; The initial coefficients of TDNN is set at the pseudo random number that is produced by computing machine, the initial mixing coefficient of GMM can be taken as 1/M, M is the mixing item number of GMM, initial average of GMM and variance are by the residual vector process LBG (Linde of TDNN, Buzo, Gray) method produces M polymeric type, and average and the variance of calculating this M polymeric type respectively obtain.
(4) proper vector x (n) input TDNN network, will subtract each other by proper vector x (n) before the TDNN and the output characteristic vector o (n) of TDNN, obtain all residual vectors;
(5) parameter of employing EM method correction GMM model;
If residual vector is r t, at first use formula (4) to calculate the classification posterior probability:
p ( i | r t , λ ) = p i b i ( r t ) Σ k = 1 M p k b k ( r t ) - - - ( 4 )
Use formula (5) (6) (7) to obtain the mixed weight-value that upgrades then
Figure B2009100354240D0000066
Mean value vector
Figure B2009100354240D0000067
And covariance matrix
Figure B2009100354240D0000068
p i ‾ = 1 N Σ t = 1 N p ( i | r t , λ ) - - - ( 5 )
u i ‾ = Σ t = 1 N p ( i | r t , λ ) x t Σ t = 1 N p ( i | r t , λ ) - - - ( 6 )
Σ ‾ i 2 = Σ t = 1 N p ( i | r t , λ ) x t 2 Σ t = 1 N p ( i | r t , λ ) - u i ‾ 2 - - - ( 7 )
(6) utilize the weight coefficient of revised each Gaussian distribution of GMM model, mean vector and variance are brought residual error into, obtain a likelihood probability, utilize the parameter lambda of the correction of the inversion method backward TDNN of band inertia;
Specifically, by making function maximization in the following formula obtain the parameter lambda of TDNN:
L ( X ) = arg max ω ij Π t = 1 N p ( ( x t - o t ) | λ ) - - - ( 8 )
Wherein p (x| λ) sees formula (1), o tBe the output vector of TDNN, x tEigenvector for input TDNN.
Because general minimizing during the neural network iteration, and and formula more more convenient than product, so get negatively after following formula taken the logarithm again, obtain:
G ( X ) = arg min ω ij ( - Σ t = 1 N ln p ( ( x t - o t ) | λ ) ) - - - ( 9 )
Here, parameter lambda is embodied in
Figure B2009100354240D0000074
Be in the m time iteration, connect input x iWith output y jWeight coefficient, k is the layer sequence number of neural network.Adopt the inversion method backward of band inertia to ask G (X), because this method can be quickened repeatedly convergence process, and can better handle the local minimum problem, the iterative formula of the inversion method backward of band inertia is as follows:
Δω ij k ( m + 1 ) = γΔω ij k ( m ) - ( 1 - γ ) α ∂ F ( x ) ∂ ω ij k | ω ij k = ω ij k ( m ) - - - ( 10 )
Wherein,
Figure B2009100354240D0000076
α is an iteration step length, F (x)=-lnp ((x t-o t) | λ), γ is an inertial coefficient, and γ gets 0.8 herein.
In the formula (10),
Figure B2009100354240D0000077
Computation process as follows:
∂ F ( x ) ∂ ω ij k = ∂ F ( x ) ∂ y i k ∂ y i k ∂ ω ij k - - - ( 11 )
Seek the computing formula of two product terms in the formula (11) below respectively, because in neural network:
y i k = Σ j ω ij k o j k - 1 - - - ( 12 )
o i k = f ( y i k ) - - - ( 13 )
In last two formulas,
Figure B2009100354240D00000711
Output when being i neuron input of k layer sample x,
Figure B2009100354240D00000712
Input when being i neuron input of k layer sample x,
Figure B2009100354240D00000713
Be activation function.So:
∂ y i k ∂ ω ij k = o j k - 1 - - - ( 14 )
For
Figure B2009100354240D0000081
Find the solution, be divided into two kinds of situations of output layer and hidden layer:
(a) output layer
Figure B2009100354240D0000082
∂ F ( x ) ∂ y i k = - 1 p ( ( x - o ) | λ ) ∂ p ( ( x - o ) | λ ) ∂ o i k ∂ o i k y i k
= - f ′ ( y i k ) p ( ( x - o ) | λ ) ∂ ( Σ n = 1 M p n c n e - 1 2 ( x - o - u n ) T Σ n - 1 ( x - o - u n ) ) / ∂ o i k
= - f ′ ( y i k ) p ( ( x - o ) | λ ) Σ n = 1 M p n c n ( a n ( x - o - u n ) σ n , i 2 ( x i - o i - u n , i ) ) - - - ( 15 )
Wherein:
a n ( x - o - u n ) = e - 1 2 ( x - o - u n ) T Σ n - 1 ( x - o - u n )
c n = 1 ( 2 π ) D / 2 | Σ n | 1 / 2
(b) hidden layer
Figure B2009100354240D0000088
∂ F ( x ) ∂ y i k = Σ j ∂ F ( x ) ∂ y j k + 1 ∂ y j k + 1 ∂ y i k
= Σ j ∂ F ( x ) ∂ y j k + 1 ∂ ( Σ n ω jn k + 1 o n k ) ∂ y i k
= Σ j ∂ F ( x ) ∂ y j k + 1 ∂ o i k ∂ y i k ω ji k + 1
= f ′ ( y i k ) Σ j ∂ F ( x ) ∂ y i k + 1 ω ji k + 1 - - - ( 16 )
Because what adopt is inverting backward, so calculating
Figure B2009100354240D00000813
The time
Figure B2009100354240D00000814
Known, substitution formula (16) can be obtained
Figure B2009100354240D00000815
(7) judge whether to satisfy the condition of convergence of setting in the step (2) or whether reach maximum iteration time, if, then stop training, otherwise, skip to step (4).
3. Speaker Identification
During identification, proper vector enters TDNN after postponing.Because TDNN has learnt the structure and the time sequence information of feature space, so import characteristic vector sequence X to be identified, TDNN can do corresponding conversion to proper vector.Because passed through training,, reduce the effect of the likelihood probability of non-object module so the likelihood probability that strengthens object module has been played in the TDNN conversion.Then the sequence O that is exported after X and the process TDNN conversion is subtracted each other resulting residual sequence R and offer the GMM model, for the sequence R=R of T residual error vector 1, R 2..., R T, its GMM probability can be written as:
P ( R | λ ) = Π t = 1 T p ( R t | λ ) - - - ( 17 )
Be expressed as at log-domain:
L ( R | λ ) = log P ( R | λ ) = Σ t = 1 T log p ( R t | λ ) - - - ( 18 )
Utilization Bayes' theorem during identification, in N unknown words person's model, the words person of the model correspondence of likelihood probability maximum is the target speaker:
i * = arg max 1 ≤ i ≤ N L ( R | λ i ) - - - ( 19 )
Here the 1conv4w-1conv4w that adopts NIST test in 2006 is as experiment, and we have selected 107 target speakers therein, and wherein the male sex is 63,44 of women.Everyone chooses about about 2 minutes voice as training utterance, and all the other voice form about 23000 tests like this as tested speech.
Improvement effect for method of the present invention under the test noise environment, the noise data of choosing is noise (stationary noise) and the interior noise (nonstationary noise) of the displaying compartment in the exhibition in the automobile (2000cc group, Ordinary Rd) in the travelling in the NEC association criterion noise data storehouse.These noises are superposeed by certain signal to noise ratio (snr) in the 1conv4w-1conv4w voice, generate the voice that contain noise.
Adopt correct recognition rata as the standard of passing judgment on the Speaker Identification effect, correct_ratio=N v/ N tWherein, correct_ratio is a correct recognition rata, N vBe the testing time of correct identification, N tTotal testing time.
Here with method of the present invention (representing) with only adopt speaker training and the recognition methods (GMM represents with baseline) of GMM with TDNN-GMM.Experimental result is seen Fig. 3-Fig. 5.1conv4w-1conv4w changed the recognition effect comparing result of the mixing item number M of the Gaussian probability-density function among the GMM down when Fig. 3 was noiseless; Find to embed TDNN from Fig. 3 after, the recognition effect of GMM has improvement really, and it is few more to mix item number M, and improvement effect is obvious more, this be since in the class subclass more after a little while, the results of learning of neural network are better.
Fig. 4 and Fig. 5 have compared the comparing result under different noises, different signal to noise ratio (snr) condition, M=80.As can be seen from Figure 4 and Figure 5, the method that the present invention proposes is relative baseline GMM under different signal to noise ratio (S/N ratio)s, and the Speaker Identification effect all has greatly improved.

Claims (4)

1. method for distinguishing speek person based on the gauss hybrid models that embeds time-delay neural network is characterized in that may further comprise the steps:
(1) pre-service and feature extraction;
At first, used based on the method for energy and zero-crossing rate and carried out silence detection, and removed noise with spectrum-subtraction, and voice signal carried out pre-emphasis, divide frame, the line linearity of going forward side by side prediction (LPC) is analyzed, and obtains the proper vector of cepstrum coefficient as Speaker Identification then from the LPC coefficient that obtains;
(2) training;
During training, through postponing the laggard time-delay neural network (TDNN) of going into, the structure of TDNN learning characteristic vector is extracted the temporal information of characteristic vector sequence with the proper vector that extracts; Then learning outcome is offered gauss hybrid models (GMM) with the form of residual error proper vector, adopt the greatest hope method to carry out the GMM model training, and utilize the inversion method backward of band inertia to upgrade the weight coefficient of TDNN; Concrete training process is as follows:
(2-1) determine GMM model and TDNN structure:
The probability density function of a M rank GMM is obtained by M Gaussian probability-density function weighted sum, can represent with following form:
p ( x t | λ ) = Σ i = 1 M p i b i ( x t )
X in the following formula tBe D dimensional feature vector, D=13 here; b i(x t) be member's density function, it is u for mean value vector i, covariance matrix is a ∑ iGaussian function;
b i ( x t ) = 1 ( 2 π ) D / 2 | Σ i | 1 / 2 exp { - 1 2 ( x i - u i ) T Σ i - 1 ( x t - u i ) }
p iBe that the mixed weight-value mixed weight-value satisfies condition:
Figure F2009100354240C0000013
Complete GMM model parameter is as follows:
λ={(p i,u i,∑ i),i=1,2,...,M}
Here, what utilize is after not being with the delay of TDNN proper vector x (n) through the linear delay piece of feedback, input as TDNN, TDNN carries out nonlinear transformation to input, linear weighted function then, obtain output vector, compare with proper vector, normally used criterion is lowest mean square criterion (MMSE); The ratio of the neuron number of the hidden layer of TDNN and the neuron number of input layer is 3: 2, and non-linear activation S function is Y is through the input after the weighted sum; When training, inertial coefficient γ=0.8 of neural network;
(2-2) set the condition of convergence and maximum iteration time; Particularly, the condition of convergence be the Euclidean distance of adjacent twice GMM coefficient and TDNN weight coefficient less than 0.0001, maximum iteration time is not more than 100 usually;
(2-3) determine the TDNN and the GMM model parameter of primary iteration at random; The initial coefficients of TDNN is set at the pseudo random number that is produced by computing machine, the initial mixing coefficient of GMM can be taken as 1/M, M is the mixing item number of GMM, initial average of GMM and variance are by the residual vector process LBG (Linde of TDNN, Buzo, Gray) method produces M polymeric type, and average and the variance of calculating this M polymeric type respectively obtain;
(2-4) proper vector x (n) input TDNN network, will subtract each other by proper vector x (n) before the TDNN and the output characteristic vector o (n) of TDNN, obtain all residual vectors;
(2-5) parameter of employing greatest hope method correction GMM model;
If residual vector is r t, at first calculate the classification posterior probability:
p ( i | r t , λ ) = p i b i ( r t ) Σ k = 1 M p k b k ( r t )
Upgrade mixed weight-value then
Figure F2009100354240C0000022
Mean value vector
Figure F2009100354240C0000023
And covariance matrix
Figure F2009100354240C0000024
p i ‾ = 1 N Σ t = 1 N p ( i | r t , λ )
u i ‾ = Σ t = 1 N p ( i | r t , λ ) x t Σ t = 1 N p ( i | r t , λ )
Σ ‾ i 2 = Σ t = 1 N p ( i | r t , λ ) x t 2 Σ t = 1 N p ( i | r t , λ ) - u i ‾ 2
(2-6) utilize the weight coefficient of revised each Gaussian distribution of GMM model, mean vector and variance are brought residual error into, obtain a likelihood probability, utilize the correction of the inversion method backward TDNN parameter of band inertia;
The TDNN parameter obtains by making the function maximization in the following formula:
L ( X ) = arg max ω ij Π t = 1 N p ( ( x t - o t ) | λ )
O wherein tBe neural network output, x tEigenvector for input;
Get negatively after following formula taken the logarithm again, obtain:
G ( X ) = arg min ω ij ( - Σ t = 1 N ln p ( ( x t - o t ) | λ ) )
Adopt the inversion method backward of band inertia to ask its iterative formula of G (X) as follows:
Δω ij k ( m + 1 ) = γΔω ij k ( m ) - ( 1 - γ ) α ∂ F ( x ) ∂ ω ij k | ω ij k = ω ij k ( m )
Wherein,
Figure F2009100354240C00000211
Figure F2009100354240C00000212
Be in the m time iteration, connect input x iWith output y jWeight coefficient, k is the layer sequence number of neural network, α is an iteration step length, F (x)=-lnp ((x t-o t) | λ), γ is an inertial coefficient;
(2-7) judge whether to satisfy the condition of convergence of setting in the step (2-2) or whether reach maximum iteration time, if, then stop training, otherwise, skip to step (2-4);
(3) identification
During identification, characteristic vector sequence X is through postponing back input TDNN; Then the output sequence O of X and TDNN is subtracted each other resulting residual sequence R and offer the GMM model, for the sequence R=R of T residual error vector 1, R 2..., R T, its GMM probability can be written as:
P ( R | λ ) = Π t = 1 T p ( R t | λ )
Be expressed as at log-domain:
L ( R | λ ) = log P ( R | λ ) = Σ t = 1 T log p ( R t | λ )
Utilization Bayes' theorem during identification, in N unknown words person's model, the words person of the model correspondence of likelihood probability maximum is the target speaker:
i * = arg max 1 ≤ i ≤ N L ( R | λ i ) .
2. a kind of method for distinguishing speek person based on the gauss hybrid models that embeds time-delay neural network according to claim 1 is characterized in that, and is described
Figure F2009100354240C0000034
Computation process is as follows:
∂ F ( x ) ∂ ω ij k = ∂ F ( x ) ∂ y i k ∂ y i k ∂ ω ij k
In the TDNN network,
Figure F2009100354240C0000036
Output when being i neuron input of k layer sample x,
Figure F2009100354240C0000039
Input when being i neuron input of k layer sample x, Be activation function, so:
∂ y i k ∂ ω ij k = o j k - 1 .
3. a kind of method for distinguishing speek person based on the gauss hybrid models that embeds time-delay neural network according to claim 2 is characterized in that, and is described
Figure F2009100354240C00000312
Computation process is divided into output layer and two kinds of situations of hidden layer of TDNN;
For output layer:
∂ F ( x ) ∂ y i k = - 1 p ( ( x - o ) | λ ) ∂ p ( ( x - o ) | λ ) ∂ o i k ∂ o i k y i k
= - f ′ ( y i k ) p ( ( x - o ) | λ ) ∂ ( Σ n = 1 M p n c n e - 1 2 ( x - o - u n ) T Σ n - 1 ( x - o - u n ) ) / ∂ o i k
= - f ′ ( y i k ) p ( ( x - o ) | λ ) Σ n = 1 M p n c n ( a n ( x - o - u n ) σ n , i 2 ( x i - o i - u n , i ) ) - - - ( 15 )
Wherein: a n ( x - o - u n ) = e - 1 2 ( x - o - u n ) T Σ n - 1 ( x - o - u n ) , c n = 1 ( 2 π ) D / 2 | Σ n | 1 / 2
For hidden layer:
∂ F ( x ) ∂ y i k = Σ j ∂ F ( x ) ∂ y j k + 1 ∂ y i k + 1 ∂ y i k = Σ j ∂ F ( x ) ∂ y j k + 1 ∂ ( Σ n ω jn k + 1 o n k ) ∂ y i k
= Σ j ∂ F ( x ) ∂ y j k + 1 ∂ o i k ∂ y i k ω ji k + 1 = f ′ ( y i k ) Σ j ∂ F ( x ) ∂ y i k + 1 ω ji k + 1 .
4. a kind of method for distinguishing speek person based on the gauss hybrid models that embeds time-delay neural network according to claim 1 is characterized in that described pre-emphasis adopts f (Z)=1-0.97Z -1Wave filter, divide frame to adopt length 20ms and window to move the Hamming window of 10ms, the exponent number of linear prediction analysis is 20, proper vector is the cepstrum coefficient on 13 rank.
CN2009100354240A 2009-09-28 2009-09-28 Speaker recognition method based on Gaussian mixture model embedded with time delay neural network Pending CN102034472A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2009100354240A CN102034472A (en) 2009-09-28 2009-09-28 Speaker recognition method based on Gaussian mixture model embedded with time delay neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2009100354240A CN102034472A (en) 2009-09-28 2009-09-28 Speaker recognition method based on Gaussian mixture model embedded with time delay neural network

Publications (1)

Publication Number Publication Date
CN102034472A true CN102034472A (en) 2011-04-27

Family

ID=43887277

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2009100354240A Pending CN102034472A (en) 2009-09-28 2009-09-28 Speaker recognition method based on Gaussian mixture model embedded with time delay neural network

Country Status (1)

Country Link
CN (1) CN102034472A (en)

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102436809A (en) * 2011-10-21 2012-05-02 东南大学 Network speech recognition method in English oral language machine examination system
CN102708871A (en) * 2012-05-08 2012-10-03 哈尔滨工程大学 Line spectrum-to-parameter dimensional reduction quantizing method based on conditional Gaussian mixture model
WO2013086736A1 (en) * 2011-12-16 2013-06-20 华为技术有限公司 Speaker recognition method and device
CN103680496A (en) * 2013-12-19 2014-03-26 百度在线网络技术(北京)有限公司 Deep-neural-network-based acoustic model training method, hosts and system
CN104112445A (en) * 2014-07-30 2014-10-22 宇龙计算机通信科技(深圳)有限公司 Terminal and voice identification method
CN104183239A (en) * 2014-07-25 2014-12-03 南京邮电大学 Method for identifying speaker unrelated to text based on weighted Bayes mixture model
CN104882141A (en) * 2015-03-03 2015-09-02 盐城工学院 Serial port voice control projection system based on time delay neural network and hidden Markov model
CN105765562A (en) * 2013-12-03 2016-07-13 罗伯特·博世有限公司 Method and device for determining a data-based functional model
CN106779050A (en) * 2016-11-24 2017-05-31 厦门中控生物识别信息技术有限公司 The optimization method and device of a kind of convolutional neural networks
CN108417224A (en) * 2018-01-19 2018-08-17 苏州思必驰信息科技有限公司 The training and recognition methods of two way blocks model and system
CN108877823A (en) * 2018-07-27 2018-11-23 三星电子(中国)研发中心 Sound enhancement method and device
CN109036386A (en) * 2018-09-14 2018-12-18 北京网众共创科技有限公司 A kind of method of speech processing and device
CN109119089A (en) * 2018-06-05 2019-01-01 安克创新科技股份有限公司 The method and apparatus of penetrating processing is carried out to music
CN109166571A (en) * 2018-08-06 2019-01-08 广东美的厨房电器制造有限公司 Wake-up word training method, device and the household appliance of household appliance
CN109214444A (en) * 2018-08-24 2019-01-15 小沃科技有限公司 Game Anti-addiction decision-making system and method based on twin neural network and GMM
CN109271482A (en) * 2018-09-05 2019-01-25 东南大学 A kind of implementation method of the automatic Evaluation Platform of postgraduates'english oral teaching voice
CN109326278A (en) * 2017-07-31 2019-02-12 科大讯飞股份有限公司 A kind of acoustic model construction method and device, electronic equipment
CN110232932A (en) * 2019-05-09 2019-09-13 平安科技(深圳)有限公司 Method for identifying speaker, device, equipment and medium based on residual error time-delay network

Cited By (30)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102436809A (en) * 2011-10-21 2012-05-02 东南大学 Network speech recognition method in English oral language machine examination system
US9142210B2 (en) 2011-12-16 2015-09-22 Huawei Technologies Co., Ltd. Method and device for speaker recognition
CN103562993A (en) * 2011-12-16 2014-02-05 华为技术有限公司 Speaker recognition method and device
WO2013086736A1 (en) * 2011-12-16 2013-06-20 华为技术有限公司 Speaker recognition method and device
CN103562993B (en) * 2011-12-16 2015-05-27 华为技术有限公司 Speaker recognition method and device
CN102708871A (en) * 2012-05-08 2012-10-03 哈尔滨工程大学 Line spectrum-to-parameter dimensional reduction quantizing method based on conditional Gaussian mixture model
CN105765562B (en) * 2013-12-03 2022-01-11 罗伯特·博世有限公司 Method and device for obtaining a data-based function model
CN105765562A (en) * 2013-12-03 2016-07-13 罗伯特·博世有限公司 Method and device for determining a data-based functional model
CN103680496A (en) * 2013-12-19 2014-03-26 百度在线网络技术(北京)有限公司 Deep-neural-network-based acoustic model training method, hosts and system
CN103680496B (en) * 2013-12-19 2016-08-10 百度在线网络技术(北京)有限公司 Acoustic training model method based on deep-neural-network, main frame and system
CN104183239B (en) * 2014-07-25 2017-04-19 南京邮电大学 Method for identifying speaker unrelated to text based on weighted Bayes mixture model
CN104183239A (en) * 2014-07-25 2014-12-03 南京邮电大学 Method for identifying speaker unrelated to text based on weighted Bayes mixture model
CN104112445A (en) * 2014-07-30 2014-10-22 宇龙计算机通信科技(深圳)有限公司 Terminal and voice identification method
CN104882141A (en) * 2015-03-03 2015-09-02 盐城工学院 Serial port voice control projection system based on time delay neural network and hidden Markov model
CN106779050A (en) * 2016-11-24 2017-05-31 厦门中控生物识别信息技术有限公司 The optimization method and device of a kind of convolutional neural networks
CN109326278A (en) * 2017-07-31 2019-02-12 科大讯飞股份有限公司 A kind of acoustic model construction method and device, electronic equipment
CN109326278B (en) * 2017-07-31 2022-06-07 科大讯飞股份有限公司 Acoustic model construction method and device and electronic equipment
CN108417224B (en) * 2018-01-19 2020-09-01 苏州思必驰信息科技有限公司 Training and recognition method and system of bidirectional neural network model
CN108417224A (en) * 2018-01-19 2018-08-17 苏州思必驰信息科技有限公司 The training and recognition methods of two way blocks model and system
CN109119089A (en) * 2018-06-05 2019-01-01 安克创新科技股份有限公司 The method and apparatus of penetrating processing is carried out to music
CN108877823A (en) * 2018-07-27 2018-11-23 三星电子(中国)研发中心 Sound enhancement method and device
CN108877823B (en) * 2018-07-27 2020-12-18 三星电子(中国)研发中心 Speech enhancement method and device
CN109166571A (en) * 2018-08-06 2019-01-08 广东美的厨房电器制造有限公司 Wake-up word training method, device and the household appliance of household appliance
CN109214444A (en) * 2018-08-24 2019-01-15 小沃科技有限公司 Game Anti-addiction decision-making system and method based on twin neural network and GMM
CN109214444B (en) * 2018-08-24 2022-01-07 小沃科技有限公司 Game anti-addiction determination system and method based on twin neural network and GMM
CN109271482A (en) * 2018-09-05 2019-01-25 东南大学 A kind of implementation method of the automatic Evaluation Platform of postgraduates'english oral teaching voice
CN109036386A (en) * 2018-09-14 2018-12-18 北京网众共创科技有限公司 A kind of method of speech processing and device
WO2020224114A1 (en) * 2019-05-09 2020-11-12 平安科技(深圳)有限公司 Residual delay network-based speaker confirmation method and apparatus, device and medium
CN110232932A (en) * 2019-05-09 2019-09-13 平安科技(深圳)有限公司 Method for identifying speaker, device, equipment and medium based on residual error time-delay network
CN110232932B (en) * 2019-05-09 2023-11-03 平安科技(深圳)有限公司 Speaker confirmation method, device, equipment and medium based on residual delay network

Similar Documents

Publication Publication Date Title
CN102034472A (en) Speaker recognition method based on Gaussian mixture model embedded with time delay neural network
CN102693724A (en) Noise classification method of Gaussian Mixture Model based on neural network
CN110706692B (en) Training method and system of child voice recognition model
US11776548B2 (en) Convolutional neural network with phonetic attention for speaker verification
Hermansky et al. Tandem connectionist feature extraction for conventional HMM systems
CN101814159B (en) Speaker verification method based on combination of auto-associative neural network and Gaussian mixture background model
CN102129860B (en) Text-related speaker recognition method based on infinite-state hidden Markov model
US20080208581A1 (en) Model Adaptation System and Method for Speaker Recognition
CN102737633A (en) Method and device for recognizing speaker based on tensor subspace analysis
CN103824557A (en) Audio detecting and classifying method with customization function
CN109346084A (en) Method for distinguishing speek person based on depth storehouse autoencoder network
US11217265B2 (en) Condition-invariant feature extraction network
US10283112B2 (en) System and method for neural network based feature extraction for acoustic model development
Mallidi et al. Uncertainty estimation of DNN classifiers
Marchi et al. Generalised discriminative transform via curriculum learning for speaker recognition
CN103985381A (en) Voice frequency indexing method based on parameter fusion optimized decision
Chen et al. Towards understanding and mitigating audio adversarial examples for speaker recognition
CN113284513B (en) Method and device for detecting false voice based on phoneme duration characteristics
Todkar et al. Speaker recognition techniques: A review
CN112735435A (en) Voiceprint open set identification method with unknown class internal division capability
Bai et al. Speaker verification by partial AUC optimization with mahalanobis distance metric learning
Wöllmer et al. Multi-stream LSTM-HMM decoding and histogram equalization for noise robust keyword spotting
Reshma et al. A survey on speech emotion recognition
CN113239809A (en) Underwater sound target identification method based on multi-scale sparse SRU classification model
CN104183239B (en) Method for identifying speaker unrelated to text based on weighted Bayes mixture model

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Open date: 20110427