CN102034472A - Speaker recognition method based on Gaussian mixture model embedded with time delay neural network - Google Patents

Speaker recognition method based on Gaussian mixture model embedded with time delay neural network Download PDF

Info

Publication number
CN102034472A
CN102034472A CN2009100354240A CN200910035424A CN102034472A CN 102034472 A CN102034472 A CN 102034472A CN 2009100354240 A CN2009100354240 A CN 2009100354240A CN 200910035424 A CN200910035424 A CN 200910035424A CN 102034472 A CN102034472 A CN 102034472A
Authority
CN
China
Prior art keywords
mrow
msub
msubsup
mfrac
math
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2009100354240A
Other languages
Chinese (zh)
Inventor
戴红霞
王吉林
余华
魏昕
赵力
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN2009100354240A priority Critical patent/CN102034472A/en
Publication of CN102034472A publication Critical patent/CN102034472A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Image Analysis (AREA)

Abstract

The invention discloses a speaker recognition method based on a Gaussian mixture model (GMM) embedded with a time delay neural network (TDNN). In the speaker recognition method, the advantages of the TDNN and the GMM are fully considered, the TDNN is embedded into the GMM, and solves a residual of input and output vectors of the TDNN by fully utilizing the time sequence of an input characteristic vector through the conversion of a time delay network, and the residual modifies the training of the GMM through an expectation maximization method; besides, a likelihood probability is acquired by a modified GMM model parameter and the residual, and a TDNN parameter is modified by an inertial backward inversion method so as to ensure that parameters of the GMM and the TDNN are alternately updated. An experiment shows that: a recognition rate of the method is improved to a certain extent compared with that of a baseline GMM under various signal to noise ratios.

Description

Speaker recognition method based on Gaussian mixture model embedded with time delay neural network
Technical Field
The invention relates to a speaker recognition method, in particular to a speaker recognition method based on a Gaussian mixture model embedded with a time delay neural network.
Background
In the aspects of entrance guard, credit card transaction, court evidence and the like, automatic speaker recognition, particularly speaker recognition irrelevant to text plays an increasingly important role, and the aim of the method is to correctly judge the voice to be recognized as belonging to one of a plurality of reference persons in a voice library.
In the speaker recognition method, attention is paid more and more to a Gaussian Mixture Model (GMM) based method, and the GMM based method has become the mainstream recognition method at present because of its advantages of high recognition rate, simple training, low requirement for training data volume, and the like. Since the Gaussian Mixture Model (GMM) has a good ability to represent the distribution of data, it can approximate any distribution model as long as there are enough terms, and enough training data. However, there are several problems with the practical use of GMMs. Firstly, GMM does not utilize the time information of speaker voice, and the training and recognition result is irrelevant to the input sequence of the feature vector; secondly, during GMM training, we always assume that the feature vectors are independent of each other, which is obviously unreasonable; in addition, because the selection of the number of the mixture items does not have a good guiding principle when the GMM model is selected, enough Gaussian mixture items are required to obtain a good result.
The neural network also has an important position in the speaker recognition aspect, the multilayer perceptron, the ray-based network, the self-association neural network and the like are successfully applied to speaker recognition, and particularly, the Time Delay Neural Network (TDNN) is widely applied to signal processing, speech recognition and speaker recognition. However, the GMM and TDNN are only used for speaker recognition at present, and no method for combining the advantages of the GMM and TDNN to better improve the speaker recognition effect exists.
Disclosure of Invention
The invention aims to solve the defects of the prior art and provides a speaker recognition method based on a Gaussian mixture model embedded with a time delay neural network. The technical scheme of the invention is as follows:
a speaker recognition method based on a Gaussian mixture model embedded with a time delay neural network comprises the following steps:
(1) preprocessing and feature extraction;
first, silence detection is performed using a method based on energy and a zero-crossing rate, noise is removed by spectral subtraction, a speech signal is pre-emphasized, framed, and subjected to Linear Prediction (LPC) analysis, and then cepstrum coefficients are found from the obtained LPC coefficients as feature vectors for speaker recognition.
(2) Training;
during training, the extracted feature vector is delayed and then used as the input of TDNN, the TDNN learns the structure of the feature vector, and the time information of the feature vector sequence is extracted. And then providing the learning result to the GMM in the form of residual characteristic vectors, carrying out GMM model training by adopting a maximum Expectation (EM) criterion, and updating weight coefficients of the TDNN by using a backward inversion method with inertia. The specific training process is as follows:
(2-1) determining the GMM model and the TDNN structure:
the probability density function of an M-order GMM is obtained by weighted summation of M gaussian probability density functions and can be expressed as follows:
<math><mrow><mi>p</mi><mrow><mo>(</mo><msub><mi>x</mi><mi>t</mi></msub><mo>|</mo><mi>&lambda;</mi><mo>)</mo></mrow><mo>=</mo><munderover><mi>&Sigma;</mi><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>M</mi></munderover><msub><mi>p</mi><mi>i</mi></msub><msub><mi>b</mi><mi>i</mi></msub><mrow><mo>(</mo><msub><mi>x</mi><mi>t</mi></msub><mo>)</mo></mrow></mrow></math>
in the above formula xtD-dimensional feature vectors, where D is 13;bi(xt) Is a member density function which is the mean vector uiThe covariance matrix is sigmai(ii) a gaussian function of;
<math><mrow><msub><mi>b</mi><mi>i</mi></msub><mrow><mo>(</mo><msub><mi>x</mi><mi>t</mi></msub><mo>)</mo></mrow><mo>=</mo><mfrac><mn>1</mn><mrow><msup><mrow><mo>(</mo><mn>2</mn><mi>&pi;</mi><mo>)</mo></mrow><mrow><mi>D</mi><mo>/</mo><mn>2</mn></mrow></msup><msup><mrow><mo>|</mo><msub><mi>&Sigma;</mi><mi>i</mi></msub><mo>|</mo></mrow><mrow><mn>1</mn><mo>/</mo><mn>2</mn></mrow></msup></mrow></mfrac><mi>exp</mi><mo>{</mo><mo>-</mo><mfrac><mn>1</mn><mn>2</mn></mfrac><msup><mrow><mo>(</mo><msub><mi>x</mi><mi>i</mi></msub><mo>-</mo><msub><mi>u</mi><mi>i</mi></msub><mo>)</mo></mrow><mi>T</mi></msup><msubsup><mi>&Sigma;</mi><mi>i</mi><mrow><mo>-</mo><mn>1</mn></mrow></msubsup><mrow><mo>(</mo><msub><mi>x</mi><mi>t</mi></msub><mo>-</mo><msub><mi>u</mi><mi>i</mi></msub><mo>)</mo></mrow><mo>}</mo></mrow></math>
pithe mixed weight satisfies the condition:
Figure B2009100354240D0000022
the complete GMM model parameters are as follows:
λ={(pi,ui,∑i),i=1,2,...,M}
here, TDNN without feedback is utilized. The eigenvector x (n) is delayed by the linear delay block and then used as the input of the TDNN, which performs nonlinear transformation on the input, then performs linear weighting to obtain the output vector, and then compares the output vector with the eigenvector, wherein the commonly used criterion is minimum mean square criterion (MMSE). Specifically, the ratio of the number of neurons in the hidden layer to the number of neurons in the input layer of TDNN is 3: 2, and the nonlinear activation S function (the [ integral ] notation in FIG. 2) is
Figure B2009100354240D0000023
Figure B2009100354240D0000024
And y is the input after weighted summation. During training, the inertia coefficient gamma of the neural network is 0.8.
(2-2) setting a convergence condition and a maximum iteration number; specifically, the convergence condition is that the euclidean distance between the GMM coefficients and the TDNN weight coefficients of two adjacent times is less than 0.0001, and the maximum number of iterations is usually not more than 100.
(2-3) randomly determining TDNN and GMM model parameters of the initial iteration; the initial coefficient of TDNN is set as pseudo random number generated by computer, the initial mixing coefficient of GMM can be 1/M, M is the number of mixing terms of GMM, the initial mean value and variance of GMM are obtained by generating M aggregation classes by LBG (Linde, Buzo, Gray) method of the residual vector of TDNN, and respectively calculating the mean value and variance of the M aggregation classes.
(2-4) inputting the feature vector x (n) into a TDNN network, and subtracting the feature vector x (n) before passing through the TDNN from the output feature vector o (n) of the TDNN to obtain all residual vectors;
(2-5) correcting parameters of the GMM by adopting an EM method;
let the residual vector be rtFirst, the class posterior probability is calculated:
<math><mrow><mi>p</mi><mrow><mo>(</mo><mi>i</mi><mo>|</mo><msub><mi>r</mi><mi>t</mi></msub><mo>,</mo><mi>&lambda;</mi><mo>)</mo></mrow><mo>=</mo><mfrac><mrow><msub><mi>p</mi><mi>i</mi></msub><msub><mi>b</mi><mi>i</mi></msub><mrow><mo>(</mo><msub><mi>r</mi><mi>t</mi></msub><mo>)</mo></mrow></mrow><mrow><msubsup><mi>&Sigma;</mi><mrow><mi>k</mi><mo>=</mo><mn>1</mn></mrow><mi>M</mi></msubsup><msub><mi>p</mi><mi>k</mi></msub><msub><mi>b</mi><mi>k</mi></msub><mrow><mo>(</mo><msub><mi>r</mi><mi>t</mi></msub><mo>)</mo></mrow></mrow></mfrac></mrow></math>
then updating the mixing weightMean vector
Figure B2009100354240D0000027
Sum covariance matrix
Figure B2009100354240D0000028
<math><mrow><mover><msub><mi>p</mi><mi>i</mi></msub><mo>&OverBar;</mo></mover><mo>=</mo><mfrac><mn>1</mn><mi>N</mi></mfrac><munderover><mi>&Sigma;</mi><mrow><mi>t</mi><mo>=</mo><mn>1</mn></mrow><mi>N</mi></munderover><mi>p</mi><mrow><mo>(</mo><msub><mrow><mi>i</mi><mo>|</mo><mi>r</mi></mrow><mi>t</mi></msub><mo>,</mo><mi>&lambda;</mi><mo>)</mo></mrow></mrow></math>
<math><mrow><mover><msub><mi>u</mi><mi>i</mi></msub><mo>&OverBar;</mo></mover><mo>=</mo><mfrac><mrow><msubsup><mi>&Sigma;</mi><mrow><mi>t</mi><mo>=</mo><mn>1</mn></mrow><mi>N</mi></msubsup><mi>p</mi><mrow><mo>(</mo><msub><mrow><mi>i</mi><mo>|</mo><mi>r</mi></mrow><mi>t</mi></msub><mo>,</mo><mi>&lambda;</mi><mo>)</mo></mrow><msub><mi>x</mi><mi>t</mi></msub></mrow><mrow><msubsup><mi>&Sigma;</mi><mrow><mi>t</mi><mo>=</mo><mn>1</mn></mrow><mi>N</mi></msubsup><mi>p</mi><mrow><mo>(</mo><msub><mrow><mi>i</mi><mo>|</mo><mi>r</mi></mrow><mi>t</mi></msub><mo>,</mo><mi>&lambda;</mi><mo>)</mo></mrow></mrow></mfrac></mrow></math>
<math><mrow><msubsup><mover><mi>&Sigma;</mi><mo>&OverBar;</mo></mover><mi>i</mi><mn>2</mn></msubsup><mo>=</mo><mfrac><mrow><msubsup><mi>&Sigma;</mi><mrow><mi>t</mi><mo>=</mo><mn>1</mn></mrow><mi>N</mi></msubsup><mi>p</mi><mrow><mo>(</mo><msub><mrow><mi>i</mi><mo>|</mo><mi>r</mi></mrow><mi>t</mi></msub><mo>,</mo><mi>&lambda;</mi><mo>)</mo></mrow><msubsup><mi>x</mi><mi>t</mi><mn>2</mn></msubsup></mrow><mrow><msubsup><mi>&Sigma;</mi><mrow><mi>t</mi><mo>=</mo><mn>1</mn></mrow><mi>N</mi></msubsup><mi>p</mi><mrow><mo>(</mo><msub><mrow><mi>i</mi><mo>|</mo><mi>r</mi></mrow><mi>t</mi></msub><mo>,</mo><mi>&lambda;</mi><mo>)</mo></mrow></mrow></mfrac><mo>-</mo><msup><mover><msub><mi>u</mi><mi>i</mi></msub><mo>&OverBar;</mo></mover><mn>2</mn></msup></mrow></math>
(2-6) substituting residual errors by using the weight coefficient, mean vector and variance of each Gaussian distribution of the modified GMM model to obtain a likelihood probability, and modifying TDNN parameters by using a backward inversion method with inertia;
the TDNN network parameters are obtained by maximizing the function in the following equation:
<math><mrow><mi>L</mi><mrow><mo>(</mo><mi>X</mi><mo>)</mo></mrow><mo>=</mo><munder><mrow><mi>arg</mi><mi>max</mi></mrow><msub><mi>&omega;</mi><mi>ij</mi></msub></munder><munderover><mi>&Pi;</mi><mrow><mi>t</mi><mo>=</mo><mn>1</mn></mrow><mi>N</mi></munderover><mi>p</mi><mrow><mo>(</mo><mrow><mo>(</mo><msub><mi>x</mi><mi>t</mi></msub><mo>-</mo><msub><mi>o</mi><mi>t</mi></msub><mo>)</mo></mrow><mo>|</mo><mi>&lambda;</mi><mo>)</mo></mrow></mrow></math>
wherein o istIs output by the neural network, xtIs the input feature vector.
And taking the logarithm of the formula and then taking the negative of the formula to obtain:
<math><mrow><mi>G</mi><mrow><mo>(</mo><mi>X</mi><mo>)</mo></mrow><mo>=</mo><munder><mrow><mi>arg</mi><mi>min</mi></mrow><msub><mi>&omega;</mi><mi>ij</mi></msub></munder><mrow><mo>(</mo><mo>-</mo><munderover><mi>&Sigma;</mi><mrow><mi>t</mi><mo>=</mo><mn>1</mn></mrow><mi>N</mi></munderover><mi>ln</mi><mi>p</mi><mrow><mo>(</mo><mrow><mo>(</mo><msub><mi>x</mi><mi>t</mi></msub><mo>-</mo><msub><mi>o</mi><mi>t</mi></msub><mo>)</mo></mrow><mo>|</mo><mi>&lambda;</mi><mo>)</mo></mrow><mo>)</mo></mrow></mrow></math>
and (3) solving G (X) by adopting a backward inversion method with inertia, wherein the iterative formula is as follows:
<math><mrow><msubsup><mi>&Delta;&omega;</mi><mi>ij</mi><mi>k</mi></msubsup><mrow><mo>(</mo><mi>m</mi><mo>+</mo><mn>1</mn><mo>)</mo></mrow><mo>=</mo><msubsup><mi>&gamma;&Delta;&omega;</mi><mi>ij</mi><mi>k</mi></msubsup><mrow><mo>(</mo><mi>m</mi><mo>)</mo></mrow><mo>-</mo><mrow><mo>(</mo><mn>1</mn><mo>-</mo><mi>&gamma;</mi><mo>)</mo></mrow><mi>&alpha;</mi><mfrac><mrow><mo>&PartialD;</mo><mi>F</mi><mrow><mo>(</mo><mi>x</mi><mo>)</mo></mrow></mrow><mrow><mo>&PartialD;</mo><msubsup><mi>&omega;</mi><mi>ij</mi><mi>k</mi></msubsup></mrow></mfrac><msub><mo>|</mo><mrow><msubsup><mi>&omega;</mi><mi>ij</mi><mi>k</mi></msubsup><mo>=</mo><msubsup><mi>&omega;</mi><mi>ij</mi><mi>k</mi></msubsup><mrow><mo>(</mo><mi>m</mi><mo>)</mo></mrow></mrow></msub></mrow></math>
wherein,
Figure B2009100354240D0000036
for the m-th iteration, input x is connectediAnd output yjK is the layer number of the neural network, α is the iteration step size, f (x) -lnp ((x)t-ot) | λ), γ is the inertia coefficient.
And (2-7) judging whether the convergence condition set in the step (2-2) is met or whether the maximum iteration number is reached, if so, stopping training, otherwise, jumping to the step (2-4).
(3) Speaker recognition
During identification, the feature vector sequence X is input into TDNN after being delayed. Then, a residual sequence R obtained by subtracting the output sequence O of the TDNN from X is provided to the GMM model, and R is equal to R for the sequence of T residual vectors1,R2,...,RTIts GMM probability can be written as:
<math><mrow><mi>P</mi><mrow><mo>(</mo><mi>R</mi><mo>|</mo><mi>&lambda;</mi><mo>)</mo></mrow><mo>=</mo><munderover><mi>&Pi;</mi><mrow><mi>t</mi><mo>=</mo><mn>1</mn></mrow><mi>T</mi></munderover><mi>p</mi><mrow><mo>(</mo><msub><mi>R</mi><mi>t</mi></msub><mo>|</mo><mi>&lambda;</mi><mo>)</mo></mrow></mrow></math>
expressed in the logarithmic domain as:
<math><mrow><mi>L</mi><mrow><mo>(</mo><mi>R</mi><mo>|</mo><mi>&lambda;</mi><mo>)</mo></mrow><mo>=</mo><mi>log</mi><mi>P</mi><mrow><mo>(</mo><mi>R</mi><mo>|</mo><mi>&lambda;</mi><mo>)</mo></mrow><mo>=</mo><munderover><mi>&Sigma;</mi><mrow><mi>t</mi><mo>=</mo><mn>1</mn></mrow><mi>T</mi></munderover><mi>log</mi><mi>p</mi><mrow><mo>(</mo><msub><mi>R</mi><mi>t</mi></msub><mo>|</mo><mi>&lambda;</mi><mo>)</mo></mrow></mrow></math>
and (3) applying Bayes theorem during identification, wherein in the models of N unknown speakers, the speaker corresponding to the model with the maximum likelihood probability is the target speaker:
<math><mrow><msup><mi>i</mi><mo>*</mo></msup><mo>=</mo><munder><mrow><mi>arg</mi><mi>max</mi></mrow><mrow><mn>1</mn><mo>&le;</mo><mi>i</mi><mo>&le;</mo><mi>N</mi></mrow></munder><mi>L</mi><mrow><mo>(</mo><msub><mrow><mi>R</mi><mo>|</mo><mi>&lambda;</mi></mrow><mi>i</mi></msub><mo>)</mo></mrow></mrow></math>
in the speaker identification method based on the Gaussian mixture model embedded with the time delay neural network, the speaker identification method comprises the following steps
Figure B2009100354240D00000310
The calculation process of (2) is as follows:
<math><mrow><mfrac><mrow><mo>&PartialD;</mo><mi>F</mi><mrow><mo>(</mo><mi>x</mi><mo>)</mo></mrow></mrow><mrow><mo>&PartialD;</mo><msubsup><mi>&omega;</mi><mi>ij</mi><mi>k</mi></msubsup></mrow></mfrac><mo>=</mo><mfrac><mrow><mo>&PartialD;</mo><mi>F</mi><mrow><mo>(</mo><mi>x</mi><mo>)</mo></mrow></mrow><msubsup><mrow><mo>&PartialD;</mo><mi>y</mi></mrow><mi>i</mi><mi>k</mi></msubsup></mfrac><mfrac><msubsup><mrow><mo>&PartialD;</mo><mi>y</mi></mrow><mi>i</mi><mi>k</mi></msubsup><msubsup><mrow><mo>&PartialD;</mo><mi>&omega;</mi></mrow><mi>ij</mi><mi>k</mi></msubsup></mfrac></mrow></math>
in the case of a TDNN network,
Figure B2009100354240D0000042
Figure B2009100354240D0000043
Figure B2009100354240D0000044
the output when sample x is input to the ith neuron of the k layer,
Figure B2009100354240D0000045
the input when sample x is input for the ith neuron in the kth layer,
Figure B2009100354240D0000046
is an activation function. Then:
<math><mrow><mfrac><msubsup><mrow><mo>&PartialD;</mo><mi>y</mi></mrow><mi>i</mi><mi>k</mi></msubsup><msubsup><mrow><mo>&PartialD;</mo><mi>&omega;</mi></mrow><mi>ij</mi><mi>k</mi></msubsup></mfrac><mo>=</mo><msubsup><mi>o</mi><mi>j</mi><mrow><mi>k</mi><mo>-</mo><mn>1</mn></mrow></msubsup></mrow></math>
in the speaker identification method based on the Gaussian mixture model embedded with the time delay neural network, the speaker identification method comprises the following stepsThe calculation process is divided into two cases of an output layer and an implicit layer of the TDNN;
for the output layer:
<math><mrow><mfrac><mrow><mo>&PartialD;</mo><mi>F</mi><mrow><mo>(</mo><mi>x</mi><mo>)</mo></mrow></mrow><msubsup><mrow><mo>&PartialD;</mo><mi>y</mi></mrow><mi>i</mi><mi>k</mi></msubsup></mfrac><mo>=</mo><mo>-</mo><mfrac><mn>1</mn><mrow><mi>p</mi><mrow><mo>(</mo><mrow><mo>(</mo><mi>x</mi><mo>-</mo><mi>o</mi><mo>)</mo></mrow><mo>|</mo><mi>&lambda;</mi><mo>)</mo></mrow></mrow></mfrac><mfrac><mrow><mo>&PartialD;</mo><mi>p</mi><mrow><mo>(</mo><mrow><mo>(</mo><mi>x</mi><mo>-</mo><mi>o</mi><mo>)</mo></mrow><mo>|</mo><mi>&lambda;</mi><mo>)</mo></mrow></mrow><msubsup><mrow><mo>&PartialD;</mo><mi>o</mi></mrow><mi>i</mi><mi>k</mi></msubsup></mfrac><mfrac><msubsup><mrow><mo>&PartialD;</mo><mi>o</mi></mrow><mi>i</mi><mi>k</mi></msubsup><msubsup><mi>y</mi><mi>i</mi><mi>k</mi></msubsup></mfrac></mrow></math>
<math><mrow><mo>=</mo><mo>-</mo><mfrac><mrow><msup><mi>f</mi><mo>&prime;</mo></msup><mrow><mo>(</mo><msubsup><mi>y</mi><mi>i</mi><mi>k</mi></msubsup><mo>)</mo></mrow></mrow><mrow><mi>p</mi><mrow><mo>(</mo><mrow><mo>(</mo><mi>x</mi><mo>-</mo><mi>o</mi><mo>)</mo></mrow><mo>|</mo><mi>&lambda;</mi><mo>)</mo></mrow></mrow></mfrac><mo>&PartialD;</mo><mrow><mo>(</mo><munderover><mi>&Sigma;</mi><mrow><mi>n</mi><mo>=</mo><mn>1</mn></mrow><mi>M</mi></munderover><msub><mi>p</mi><mi>n</mi></msub><msub><mi>c</mi><mi>n</mi></msub><msup><mi>e</mi><mrow><mo>-</mo><mfrac><mn>1</mn><mn>2</mn></mfrac><msup><mrow><mo>(</mo><msub><mrow><mi>x</mi><mo>-</mo><mi>o</mi><mo>-</mo><mi>u</mi></mrow><mi>n</mi></msub><mo>)</mo></mrow><mi>T</mi></msup><msubsup><mi>&Sigma;</mi><mi>n</mi><mrow><mo>-</mo><mn>1</mn></mrow></msubsup><mrow><mo>(</mo><msub><mrow><mi>x</mi><mo>-</mo><mi>o</mi><mo>-</mo><mi>u</mi></mrow><mi>n</mi></msub><mo>)</mo></mrow></mrow></msup><mo>)</mo></mrow><mo>/</mo><msubsup><mrow><mo>&PartialD;</mo><mi>o</mi></mrow><mi>i</mi><mi>k</mi></msubsup></mrow></math>
<math><mrow><mo>=</mo><mo>-</mo><mfrac><mrow><msup><mi>f</mi><mo>&prime;</mo></msup><mrow><mo>(</mo><msubsup><mi>y</mi><mi>i</mi><mi>k</mi></msubsup><mo>)</mo></mrow></mrow><mrow><mi>p</mi><mrow><mo>(</mo><mrow><mo>(</mo><mi>x</mi><mo>-</mo><mi>o</mi><mo>)</mo></mrow><mo>|</mo><mi>&lambda;</mi><mo>)</mo></mrow></mrow></mfrac><munderover><mi>&Sigma;</mi><mrow><mi>n</mi><mo>=</mo><mn>1</mn></mrow><mi>M</mi></munderover><msub><mi>p</mi><mi>n</mi></msub><msub><mi>c</mi><mi>n</mi></msub><mrow><mo>(</mo><mfrac><mrow><msub><mi>a</mi><mi>n</mi></msub><mrow><mo>(</mo><msub><mrow><mi>x</mi><mo>-</mo><mi>o</mi><mo>-</mo><mi>u</mi></mrow><mi>n</mi></msub><mo>)</mo></mrow></mrow><msubsup><mi>&sigma;</mi><mrow><mi>n</mi><mo>,</mo><mi>i</mi></mrow><mn>2</mn></msubsup></mfrac><mrow><mo>(</mo><msub><mi>x</mi><mi>i</mi></msub><mo>-</mo><msub><mi>o</mi><mi>i</mi></msub><mo>-</mo><msub><mi>u</mi><mrow><mi>n</mi><mo>,</mo><mi>i</mi></mrow></msub><mo>)</mo></mrow><mo>)</mo></mrow><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>15</mn><mo>)</mo></mrow></mrow></math>
wherein: <math><mrow><mrow><msub><mi>a</mi><mi>n</mi></msub><mrow><mo>(</mo><msub><mrow><mi>x</mi><mo>-</mo><mi>o</mi><mo>-</mo><mi>u</mi></mrow><mi>n</mi></msub><mo>)</mo></mrow><mo>=</mo><msup><mi>e</mi><mrow><mo>-</mo><mfrac><mn>1</mn><mn>2</mn></mfrac><msup><mrow><mo>(</mo><msub><mrow><mi>x</mi><mo>-</mo><mi>o</mi><mo>-</mo><mi>u</mi></mrow><mi>n</mi></msub><mo>)</mo></mrow><mi>T</mi></msup><msubsup><mi>&Sigma;</mi><mi>n</mi><mrow><mo>-</mo><mn>1</mn></mrow></msubsup><mrow><mo>(</mo><msub><mrow><mi>x</mi><mo>-</mo><mi>o</mi><mo>-</mo><mi>u</mi></mrow><mi>n</mi></msub><mo>)</mo></mrow></mrow></msup></mrow><mo>,</mo></mrow></math> <math><mrow><msub><mi>c</mi><mi>n</mi></msub><mo>=</mo><mfrac><mn>1</mn><mrow><msup><mrow><mo>(</mo><mn>2</mn><mi>&pi;</mi><mo>)</mo></mrow><mrow><mi>D</mi><mo>/</mo><mn>2</mn></mrow></msup><msup><mrow><mo>|</mo><msub><mi>&Sigma;</mi><mi>n</mi></msub><mo>|</mo></mrow><mrow><mn>1</mn><mo>/</mo><mn>2</mn></mrow></msup></mrow></mfrac></mrow></math>
for the hidden layer:
<math><mrow><mfrac><mrow><mo>&PartialD;</mo><mi>F</mi><mrow><mo>(</mo><mi>x</mi><mo>)</mo></mrow></mrow><msubsup><mrow><mo>&PartialD;</mo><mi>y</mi></mrow><mi>i</mi><mi>k</mi></msubsup></mfrac><mo>=</mo><munder><mi>&Sigma;</mi><mi>j</mi></munder><mfrac><mrow><mo>&PartialD;</mo><mi>F</mi><mrow><mo>(</mo><mi>x</mi><mo>)</mo></mrow></mrow><mrow><mo>&PartialD;</mo><msubsup><mi>y</mi><mi>j</mi><mrow><mi>k</mi><mo>+</mo><mn>1</mn></mrow></msubsup></mrow></mfrac><mfrac><msubsup><mrow><mo>&PartialD;</mo><mi>y</mi></mrow><mi>i</mi><mrow><mi>k</mi><mo>+</mo><mn>1</mn></mrow></msubsup><mrow><mo>&PartialD;</mo><msubsup><mi>y</mi><mi>i</mi><mi>k</mi></msubsup></mrow></mfrac><mo>=</mo><munder><mi>&Sigma;</mi><mi>j</mi></munder><mfrac><mrow><mo>&PartialD;</mo><mi>F</mi><mrow><mo>(</mo><mi>x</mi><mo>)</mo></mrow></mrow><mrow><mo>&PartialD;</mo><msubsup><mi>y</mi><mi>j</mi><mrow><mi>k</mi><mo>+</mo><mn>1</mn></mrow></msubsup></mrow></mfrac><mfrac><mrow><mo>&PartialD;</mo><mrow><mo>(</mo><munder><mi>&Sigma;</mi><mi>n</mi></munder><msubsup><mi>&omega;</mi><mi>jn</mi><mrow><mi>k</mi><mo>+</mo><mn>1</mn></mrow></msubsup><msubsup><mi>o</mi><mi>n</mi><mi>k</mi></msubsup><mo>)</mo></mrow></mrow><mrow><mo>&PartialD;</mo><msubsup><mi>y</mi><mi>i</mi><mi>k</mi></msubsup></mrow></mfrac></mrow></math>
<math><mrow><mo>=</mo><munder><mi>&Sigma;</mi><mi>j</mi></munder><mfrac><mrow><mo>&PartialD;</mo><mi>F</mi><mrow><mo>(</mo><mi>x</mi><mo>)</mo></mrow></mrow><mrow><mo>&PartialD;</mo><msubsup><mi>y</mi><mi>i</mi><mrow><mi>k</mi><mo>+</mo><mn>1</mn></mrow></msubsup></mrow></mfrac><mfrac><msubsup><mrow><mo>&PartialD;</mo><mi>o</mi></mrow><mi>i</mi><mi>k</mi></msubsup><msubsup><mrow><mo>&PartialD;</mo><mi>y</mi></mrow><mi>i</mi><mi>k</mi></msubsup></mfrac><msubsup><mi>&omega;</mi><mi>ji</mi><mrow><mi>k</mi><mo>+</mo><mn>1</mn></mrow></msubsup><mo>=</mo><msup><mi>f</mi><mo>&prime;</mo></msup><mrow><mo>(</mo><msubsup><mi>y</mi><mi>i</mi><mi>k</mi></msubsup><mo>)</mo></mrow><munder><mi>&Sigma;</mi><mi>j</mi></munder><mfrac><mrow><mo>&PartialD;</mo><mi>F</mi><mrow><mo>(</mo><mi>x</mi><mo>)</mo></mrow></mrow><mrow><mo>&PartialD;</mo><msubsup><mi>y</mi><mi>i</mi><mrow><mi>k</mi><mo>+</mo><mn>1</mn></mrow></msubsup></mrow></mfrac><msubsup><mi>&omega;</mi><mi>ji</mi><mrow><mi>k</mi><mo>+</mo><mn>1</mn></mrow></msubsup><mo>.</mo></mrow></math>
in the speaker identification method based on the Gaussian mixture model embedded with the time delay neural network, the pre-emphasis adopts f (Z) -1-0.97Z-1The filter adopts a Hamming window with the length of 20ms and the window shift of 10ms for framing, the order of linear prediction analysis is 20, and the feature vector is a cepstrum coefficient of 13 orders.
The invention has the advantages and effects that:
1. the advantages of TDNN and GMM are fully utilized, so that TDNN can learn the time information of the feature vector, the feature vector set is mapped to the subspace capable of increasing the likelihood probability, the influence of the unreasonable assumption that the feature vector is independent can be reduced, the likelihood probability of the target model is enhanced, and the likelihood probability of the non-target model is reduced. And the GMM has the advantages of high recognition rate, simple training and low requirement on the training data volume. Therefore, the recognition rate of the whole speaker recognition system is greatly improved.
2. The method provided by the invention has the advantage that the speaker recognition effect under the voice in the noise-free and noise environment is improved compared with that of the GMM adopted alone.
Other advantages and effects of the present invention will be described further below.
Drawings
FIG. 1-speaker training and recognition model.
FIG. 2-time-lapse neural network model.
FIG. 3-comparison data at 1conv4w-1conv4w without noise.
FIG. 4-comparative data on noise in a car.
FIG. 5-comparative data showing noise within a compartment.
Detailed Description
The technical solution of the present invention is further explained below with reference to the drawings and the embodiments.
Fig. 1 is a training and recognition model for speaker recognition embedded in a TDNN network that differs from a baseline GMM model (which uses only the GMM model for speaker recognition) in both training and recognition.
1. Preprocessing and feature extraction
First, silence detection is performed using a method based on energy and zero-crossing rate, and noise is removed by spectral subtraction, and then f (Z) is 1 to 0.97Z-1The filter (2) performs pre-emphasis, performs 20-order Linear Prediction (LPC) analysis by framing with a Hamming window of 20ms length and 10ms window shift, and then obtains 13-order cepstral coefficients from the 20-order LPC coefficients as a feature vector for speaker recognition.
2. Speaker model training
During training, the process of training the TDNN and the process of training the GMM are performed alternately. TDNN is a multi-layer perceptron network (MLP) as shown in FIG. 2. And the feature vector is used as the input of TDNN after being delayed by the linear delay block, and the TDNN learns the structure of the feature vector and extracts the time information of the feature vector sequence. The learning result is then provided to the GMM in the form of a residual eigenvector (i.e., the difference between the input vector and the output of the TDNN), the GMM model is trained using a maximum Expectation (EM) method, and the weight coefficients of the TDNN network are updated using an inertial backward inversion method. Here, the criterion for both TDNN and GMM model learning and training is the maximum likelihood probability. Thus, by learning, the residual distribution is likely to proceed toward the direction of enhancing the likelihood probability. The specific training process is described as follows:
(1) determining a GMM model and a TDNN structure:
the probability density function of an M-order GMM is obtained by weighted summation of M gaussian probability density functions and can be expressed as follows:
<math><mrow><mi>p</mi><mrow><mo>(</mo><msub><mi>x</mi><mi>t</mi></msub><mo>|</mo><mi>&lambda;</mi><mo>)</mo></mrow><mo>=</mo><munderover><mi>&Sigma;</mi><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>M</mi></munderover><msub><mi>p</mi><mi>i</mi></msub><msub><mi>b</mi><mi>i</mi></msub><mrow><mo>(</mo><msub><mi>x</mi><mi>t</mi></msub><mo>)</mo></mrow><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>1</mn><mo>)</mo></mrow></mrow></math>
where x istIs a D-dimensional random vector, and in speaker recognition applications, xtIs a feature vector; bi(xt) I 1, 2, M is the member density; p is a radical ofi I 1, 2, M is the mixing weight. Each member density function is a D-dimensional variable with a mean vector of uiThe covariance matrix is sigmaiOf the form:
<math><mrow><mrow><msub><mi>b</mi><mi>i</mi></msub><mrow><mo>(</mo><msub><mi>x</mi><mi>t</mi></msub><mo>)</mo></mrow><mo>=</mo><mfrac><mn>1</mn><mrow><msup><mrow><mo>(</mo><mn>2</mn><mi>&pi;</mi><mo>)</mo></mrow><mrow><mi>D</mi><mo>/</mo><mn>2</mn></mrow></msup><msup><mrow><mo>|</mo><msub><mi>&Sigma;</mi><mi>i</mi></msub><mo>|</mo></mrow><mrow><mn>1</mn><mo>/</mo><mn>2</mn></mrow></msup></mrow></mfrac><mi>exp</mi><mo>{</mo><mo>-</mo><mfrac><mn>1</mn><mn>2</mn></mfrac><msup><mrow><mo>(</mo><msub><mi>x</mi><mi>i</mi></msub><mo>-</mo><msub><mi>u</mi><mi>i</mi></msub><mo>)</mo></mrow><mi>T</mi></msup><msubsup><mi>&Sigma;</mi><mi>i</mi><mrow><mo>-</mo><mn>1</mn></mrow></msubsup><mrow><mo>(</mo><msub><mi>x</mi><mi>t</mi></msub><mo>-</mo><msub><mi>u</mi><mi>i</mi></msub><mo>)</mo></mrow><mo>}</mo></mrow><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>2</mn><mo>)</mo></mrow></mrow></math>
wherein the mixing weight satisfies the condition: <math><mrow><munderover><mi>&Sigma;</mi><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>M</mi></munderover><msub><mi>p</mi><mi>i</mi></msub><mo>=</mo><mn>1</mn><mo>.</mo></mrow></math>
the complete GMM model consists of the mean vector of all member densities, the covariance matrix, and the mixture weight parameters. These parameters are collectively represented as:
λ={(pi,ui,∑i),i=1,2,...,M} (3)
since speaker recognition generally has less training and recognition data, it is common in practice to set the covariance matrix for each gaussian mixture density to be a diagonal matrix.
Here, TDNN without feedback is utilized, as shown in fig. 2. The eigenvector x (n) is delayed by the linear delay block and then used as the input of the TDNN, which performs nonlinear transformation on the input, then performs linear weighting to obtain the output vector, and then compares the output vector with the eigenvector, wherein the commonly used criterion is minimum mean square criterion (MMSE). Specifically, the ratio of the number of neurons in the hidden layer to the number of neurons in the input layer of TDNN is 3: 2, and the nonlinear activation S function (the [ integral ] notation in FIG. 2) is
Figure B2009100354240D0000063
And y is the input after weighted summation.
During training, the inertia coefficient γ of the neural network is 0.8.
(2) Setting a convergence condition and a maximum iteration number; specifically, the convergence condition is that the euclidean distance between the GMM coefficients and the TDNN weight coefficients of two adjacent times is less than 0.0001, and the maximum number of iterations is usually not more than 100.
(3) Randomly determining TDNN and GMM model parameters of initial iteration; the initial coefficient of TDNN is set as pseudo random number generated by computer, the initial mixing coefficient of GMM can be 1/M, M is the number of mixing terms of GMM, the initial mean value and variance of GMM are obtained by generating M aggregation classes by LBG (Linde, Buzo, Gray) method of the residual vector of TDNN, and respectively calculating the mean value and variance of the M aggregation classes.
(4) Inputting the feature vector x (n) into a TDNN network, and subtracting the feature vector x (n) before passing through the TDNN from the output feature vector o (n) of the TDNN to obtain all residual vectors;
(5) correcting parameters of the GMM model by adopting an EM (effective velocity) method;
let the residual vector be rtFirst, the class posterior probability is calculated using equation (4):
<math><mrow><mi>p</mi><mrow><mo>(</mo><mi>i</mi><mo>|</mo><msub><mi>r</mi><mi>t</mi></msub><mo>,</mo><mi>&lambda;</mi><mo>)</mo></mrow><mo>=</mo><mfrac><mrow><msub><mi>p</mi><mi>i</mi></msub><msub><mi>b</mi><mi>i</mi></msub><mrow><mo>(</mo><msub><mi>r</mi><mi>t</mi></msub><mo>)</mo></mrow></mrow><mrow><msubsup><mi>&Sigma;</mi><mrow><mi>k</mi><mo>=</mo><mn>1</mn></mrow><mi>M</mi></msubsup><msub><mi>p</mi><mi>k</mi></msub><msub><mi>b</mi><mi>k</mi></msub><mrow><mo>(</mo><msub><mi>r</mi><mi>t</mi></msub><mo>)</mo></mrow></mrow></mfrac><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>4</mn><mo>)</mo></mrow></mrow></math>
then, the updated mixing weight values are obtained by the formulas (5), (6) and (7)
Figure B2009100354240D0000066
Mean vector
Figure B2009100354240D0000067
Sum covariance matrix
Figure B2009100354240D0000068
<math><mrow><mover><msub><mi>p</mi><mi>i</mi></msub><mo>&OverBar;</mo></mover><mo>=</mo><mfrac><mn>1</mn><mi>N</mi></mfrac><munderover><mi>&Sigma;</mi><mrow><mi>t</mi><mo>=</mo><mn>1</mn></mrow><mi>N</mi></munderover><mi>p</mi><mrow><mo>(</mo><msub><mrow><mi>i</mi><mo>|</mo><mi>r</mi></mrow><mi>t</mi></msub><mo>,</mo><mi>&lambda;</mi><mo>)</mo></mrow><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>5</mn><mo>)</mo></mrow></mrow></math>
<math><mrow><mover><msub><mi>u</mi><mi>i</mi></msub><mo>&OverBar;</mo></mover><mo>=</mo><mfrac><mrow><msubsup><mi>&Sigma;</mi><mrow><mi>t</mi><mo>=</mo><mn>1</mn></mrow><mi>N</mi></msubsup><mi>p</mi><mrow><mo>(</mo><msub><mrow><mi>i</mi><mo>|</mo><mi>r</mi></mrow><mi>t</mi></msub><mo>,</mo><mi>&lambda;</mi><mo>)</mo></mrow><msub><mi>x</mi><mi>t</mi></msub></mrow><mrow><msubsup><mi>&Sigma;</mi><mrow><mi>t</mi><mo>=</mo><mn>1</mn></mrow><mi>N</mi></msubsup><mi>p</mi><mrow><mo>(</mo><msub><mrow><mi>i</mi><mo>|</mo><mi>r</mi></mrow><mi>t</mi></msub><mo>,</mo><mi>&lambda;</mi><mo>)</mo></mrow></mrow></mfrac><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>6</mn><mo>)</mo></mrow></mrow></math>
<math><mrow><msubsup><mover><mi>&Sigma;</mi><mo>&OverBar;</mo></mover><mi>i</mi><mn>2</mn></msubsup><mo>=</mo><mfrac><mrow><msubsup><mi>&Sigma;</mi><mrow><mi>t</mi><mo>=</mo><mn>1</mn></mrow><mi>N</mi></msubsup><mi>p</mi><mrow><mo>(</mo><msub><mrow><mi>i</mi><mo>|</mo><mi>r</mi></mrow><mi>t</mi></msub><mo>,</mo><mi>&lambda;</mi><mo>)</mo></mrow><msubsup><mi>x</mi><mi>t</mi><mn>2</mn></msubsup></mrow><mrow><msubsup><mi>&Sigma;</mi><mrow><mi>t</mi><mo>=</mo><mn>1</mn></mrow><mi>N</mi></msubsup><mi>p</mi><mrow><mo>(</mo><msub><mrow><mi>i</mi><mo>|</mo><mi>r</mi></mrow><mi>t</mi></msub><mo>,</mo><mi>&lambda;</mi><mo>)</mo></mrow></mrow></mfrac><mo>-</mo><msup><mover><msub><mi>u</mi><mi>i</mi></msub><mo>&OverBar;</mo></mover><mn>2</mn></msup><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>7</mn><mo>)</mo></mrow></mrow></math>
(6) Carrying out residual error substitution by using the weight coefficient, mean vector and variance of each Gaussian distribution of the modified GMM model to obtain a likelihood probability, and modifying the parameter lambda of the TDNN by using a backward inversion method with inertia;
specifically, the parameter λ of TDNN is obtained by maximizing the function in the following equation:
<math><mrow><mi>L</mi><mrow><mo>(</mo><mi>X</mi><mo>)</mo></mrow><mo>=</mo><munder><mrow><mi>arg</mi><mi>max</mi></mrow><msub><mi>&omega;</mi><mi>ij</mi></msub></munder><munderover><mi>&Pi;</mi><mrow><mi>t</mi><mo>=</mo><mn>1</mn></mrow><mi>N</mi></munderover><mi>p</mi><mrow><mo>(</mo><mrow><mo>(</mo><msub><mi>x</mi><mi>t</mi></msub><mo>-</mo><msub><mi>o</mi><mi>t</mi></msub><mo>)</mo></mrow><mo>|</mo><mi>&lambda;</mi><mo>)</mo></mrow><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>8</mn><mo>)</mo></mrow></mrow></math>
wherein p (x | λ) is represented by formula (1), otIs the output vector of TDNN, xtIs the feature vector of the input TDNN.
Because the minimum value is generally solved during the neural network iteration, and the product of the sum formula is more convenient than the product of the sum formula, the logarithm of the above formula is taken and then the negation is taken, and the following results are obtained:
<math><mrow><mi>G</mi><mrow><mo>(</mo><mi>X</mi><mo>)</mo></mrow><mo>=</mo><munder><mrow><mi>arg</mi><mi>min</mi></mrow><msub><mi>&omega;</mi><mi>ij</mi></msub></munder><mrow><mo>(</mo><mo>-</mo><munderover><mi>&Sigma;</mi><mrow><mi>t</mi><mo>=</mo><mn>1</mn></mrow><mi>N</mi></munderover><mi>ln</mi><mi>p</mi><mrow><mo>(</mo><mrow><mo>(</mo><msub><mi>x</mi><mi>t</mi></msub><mo>-</mo><msub><mi>o</mi><mi>t</mi></msub><mo>)</mo></mrow><mo>|</mo><mi>&lambda;</mi><mo>)</mo></mrow><mo>)</mo></mrow><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>9</mn><mo>)</mo></mrow></mrow></math>
here, the parameter λ is embodied as
Figure B2009100354240D0000074
Is the m-th iteration, the input x is connectediAnd output yjK is the layer number of the neural network. The method for solving G (X) by adopting the backward inversion method with inertia can accelerate the iterative convergence process and better process the local minimum value problem, and the iterative formula of the backward inversion method with inertia is as follows:
<math><mrow><msubsup><mi>&Delta;&omega;</mi><mi>ij</mi><mi>k</mi></msubsup><mrow><mo>(</mo><mi>m</mi><mo>+</mo><mn>1</mn><mo>)</mo></mrow><mo>=</mo><msubsup><mi>&gamma;&Delta;&omega;</mi><mi>ij</mi><mi>k</mi></msubsup><mrow><mo>(</mo><mi>m</mi><mo>)</mo></mrow><mo>-</mo><mrow><mo>(</mo><mn>1</mn><mo>-</mo><mi>&gamma;</mi><mo>)</mo></mrow><mi>&alpha;</mi><mfrac><mrow><mo>&PartialD;</mo><mi>F</mi><mrow><mo>(</mo><mi>x</mi><mo>)</mo></mrow></mrow><mrow><mo>&PartialD;</mo><msubsup><mi>&omega;</mi><mi>ij</mi><mi>k</mi></msubsup></mrow></mfrac><msub><mo>|</mo><mrow><msubsup><mi>&omega;</mi><mi>ij</mi><mi>k</mi></msubsup><mo>=</mo><msubsup><mi>&omega;</mi><mi>ij</mi><mi>k</mi></msubsup><mrow><mo>(</mo><mi>m</mi><mo>)</mo></mrow></mrow></msub><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>10</mn><mo>)</mo></mrow></mrow></math>
wherein,
Figure B2009100354240D0000076
α is the iteration step, f (x) -lnp ((x)t-ot) | λ), γ is the inertia coefficient, where γ is 0.8.
In the formula (10), the compound represented by the formula (10),
Figure B2009100354240D0000077
the calculation process of (2) is as follows:
<math><mrow><mfrac><mrow><mo>&PartialD;</mo><mi>F</mi><mrow><mo>(</mo><mi>x</mi><mo>)</mo></mrow></mrow><mrow><mo>&PartialD;</mo><msubsup><mi>&omega;</mi><mi>ij</mi><mi>k</mi></msubsup></mrow></mfrac><mo>=</mo><mfrac><mrow><mo>&PartialD;</mo><mi>F</mi><mrow><mo>(</mo><mi>x</mi><mo>)</mo></mrow></mrow><msubsup><mrow><mo>&PartialD;</mo><mi>y</mi></mrow><mi>i</mi><mi>k</mi></msubsup></mfrac><mfrac><msubsup><mrow><mo>&PartialD;</mo><mi>y</mi></mrow><mi>i</mi><mi>k</mi></msubsup><msubsup><mrow><mo>&PartialD;</mo><mi>&omega;</mi></mrow><mi>ij</mi><mi>k</mi></msubsup></mfrac><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>11</mn><mo>)</mo></mrow></mrow></math>
the following calculation formulas for the two product terms in equation (11) are respectively found, since in a neural network:
<math><mrow><msubsup><mi>y</mi><mi>i</mi><mi>k</mi></msubsup><mo>=</mo><munder><mi>&Sigma;</mi><mi>j</mi></munder><msubsup><mi>&omega;</mi><mi>ij</mi><mi>k</mi></msubsup><msubsup><mi>o</mi><mi>j</mi><mrow><mi>k</mi><mo>-</mo><mn>1</mn></mrow></msubsup><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>12</mn><mo>)</mo></mrow></mrow></math>
o i k = f ( y i k ) - - - ( 13 )
in the above two formulas, the first and second groups,
Figure B2009100354240D00000711
the output when sample x is input to the ith neuron of the k layer,
Figure B2009100354240D00000712
the input when sample x is input for the ith neuron in the kth layer,
Figure B2009100354240D00000713
is an activation function. Then:
<math><mrow><mfrac><mrow><mo>&PartialD;</mo><msubsup><mi>y</mi><mi>i</mi><mi>k</mi></msubsup></mrow><msubsup><mrow><mo>&PartialD;</mo><mi>&omega;</mi></mrow><mi>ij</mi><mi>k</mi></msubsup></mfrac><mo>=</mo><msubsup><mi>o</mi><mi>j</mi><mrow><mi>k</mi><mo>-</mo><mn>1</mn></mrow></msubsup><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>14</mn><mo>)</mo></mrow></mrow></math>
for the
Figure B2009100354240D0000081
The solution of (2) is divided into two cases of an output layer and a hidden layer:
(a) of the output layer
Figure B2009100354240D0000082
<math><mrow><mfrac><mrow><mo>&PartialD;</mo><mi>F</mi><mrow><mo>(</mo><mi>x</mi><mo>)</mo></mrow></mrow><msubsup><mrow><mo>&PartialD;</mo><mi>y</mi></mrow><mi>i</mi><mi>k</mi></msubsup></mfrac><mo>=</mo><mo>-</mo><mfrac><mn>1</mn><mrow><mi>p</mi><mrow><mo>(</mo><mrow><mo>(</mo><mi>x</mi><mo>-</mo><mi>o</mi><mo>)</mo></mrow><mo>|</mo><mi>&lambda;</mi><mo>)</mo></mrow></mrow></mfrac><mfrac><mrow><mo>&PartialD;</mo><mi>p</mi><mrow><mo>(</mo><mrow><mo>(</mo><mi>x</mi><mo>-</mo><mi>o</mi><mo>)</mo></mrow><mo>|</mo><mi>&lambda;</mi><mo>)</mo></mrow></mrow><msubsup><mrow><mo>&PartialD;</mo><mi>o</mi></mrow><mi>i</mi><mi>k</mi></msubsup></mfrac><mfrac><msubsup><mrow><mo>&PartialD;</mo><mi>o</mi></mrow><mi>i</mi><mi>k</mi></msubsup><msubsup><mi>y</mi><mi>i</mi><mi>k</mi></msubsup></mfrac></mrow></math>
<math><mrow><mo>=</mo><mo>-</mo><mfrac><mrow><msup><mi>f</mi><mo>&prime;</mo></msup><mrow><mo>(</mo><msubsup><mi>y</mi><mi>i</mi><mi>k</mi></msubsup><mo>)</mo></mrow></mrow><mrow><mi>p</mi><mrow><mo>(</mo><mrow><mo>(</mo><mi>x</mi><mo>-</mo><mi>o</mi><mo>)</mo></mrow><mo>|</mo><mi>&lambda;</mi><mo>)</mo></mrow></mrow></mfrac><mo>&PartialD;</mo><mrow><mo>(</mo><munderover><mi>&Sigma;</mi><mrow><mi>n</mi><mo>=</mo><mn>1</mn></mrow><mi>M</mi></munderover><msub><mi>p</mi><mi>n</mi></msub><msub><mi>c</mi><mi>n</mi></msub><msup><mi>e</mi><mrow><mo>-</mo><mfrac><mn>1</mn><mn>2</mn></mfrac><msup><mrow><mo>(</mo><msub><mrow><mi>x</mi><mo>-</mo><mi>o</mi><mo>-</mo><mi>u</mi></mrow><mi>n</mi></msub><mo>)</mo></mrow><mi>T</mi></msup><msubsup><mi>&Sigma;</mi><mi>n</mi><mrow><mo>-</mo><mn>1</mn></mrow></msubsup><mrow><mo>(</mo><msub><mrow><mi>x</mi><mo>-</mo><mi>o</mi><mo>-</mo><mi>u</mi></mrow><mi>n</mi></msub><mo>)</mo></mrow></mrow></msup><mo>)</mo></mrow><mo>/</mo><msubsup><mrow><mo>&PartialD;</mo><mi>o</mi></mrow><mi>i</mi><mi>k</mi></msubsup></mrow></math>
<math><mrow><mo>=</mo><mo>-</mo><mfrac><mrow><msup><mi>f</mi><mo>&prime;</mo></msup><mrow><mo>(</mo><msubsup><mi>y</mi><mi>i</mi><mi>k</mi></msubsup><mo>)</mo></mrow></mrow><mrow><mi>p</mi><mrow><mo>(</mo><mrow><mo>(</mo><mi>x</mi><mo>-</mo><mi>o</mi><mo>)</mo></mrow><mo>|</mo><mi>&lambda;</mi><mo>)</mo></mrow></mrow></mfrac><munderover><mi>&Sigma;</mi><mrow><mi>n</mi><mo>=</mo><mn>1</mn></mrow><mi>M</mi></munderover><msub><mi>p</mi><mi>n</mi></msub><msub><mi>c</mi><mi>n</mi></msub><mrow><mo>(</mo><mfrac><mrow><msub><mi>a</mi><mi>n</mi></msub><mrow><mo>(</mo><msub><mrow><mi>x</mi><mo>-</mo><mi>o</mi><mo>-</mo><mi>u</mi></mrow><mi>n</mi></msub><mo>)</mo></mrow></mrow><msubsup><mi>&sigma;</mi><mrow><mi>n</mi><mo>,</mo><mi>i</mi></mrow><mn>2</mn></msubsup></mfrac><mrow><mo>(</mo><msub><mi>x</mi><mi>i</mi></msub><mo>-</mo><msub><mi>o</mi><mi>i</mi></msub><mo>-</mo><msub><mi>u</mi><mrow><mi>n</mi><mo>,</mo><mi>i</mi></mrow></msub><mo>)</mo></mrow><mo>)</mo></mrow><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>15</mn><mo>)</mo></mrow></mrow></math>
Wherein:
<math><mrow><msub><mi>a</mi><mi>n</mi></msub><mrow><mo>(</mo><msub><mrow><mi>x</mi><mo>-</mo><mi>o</mi><mo>-</mo><mi>u</mi></mrow><mi>n</mi></msub><mo>)</mo></mrow><mo>=</mo><msup><mi>e</mi><mrow><mo>-</mo><mfrac><mn>1</mn><mn>2</mn></mfrac><msup><mrow><mo>(</mo><msub><mrow><mi>x</mi><mo>-</mo><mi>o</mi><mo>-</mo><mi>u</mi></mrow><mi>n</mi></msub><mo>)</mo></mrow><mi>T</mi></msup><msubsup><mi>&Sigma;</mi><mi>n</mi><mrow><mo>-</mo><mn>1</mn></mrow></msubsup><mrow><mo>(</mo><msub><mrow><mi>x</mi><mo>-</mo><mi>o</mi><mo>-</mo><mi>u</mi></mrow><mi>n</mi></msub><mo>)</mo></mrow></mrow></msup></mrow></math>
<math><mrow><msub><mi>c</mi><mi>n</mi></msub><mo>=</mo><mfrac><mn>1</mn><mrow><msup><mrow><mo>(</mo><mn>2</mn><mi>&pi;</mi><mo>)</mo></mrow><mrow><mi>D</mi><mo>/</mo><mn>2</mn></mrow></msup><msup><mrow><mo>|</mo><msub><mi>&Sigma;</mi><mi>n</mi></msub><mo>|</mo></mrow><mrow><mn>1</mn><mo>/</mo><mn>2</mn></mrow></msup></mrow></mfrac></mrow></math>
(b) hidden layer
Figure B2009100354240D0000088
<math><mrow><mfrac><mrow><mo>&PartialD;</mo><mi>F</mi><mrow><mo>(</mo><mi>x</mi><mo>)</mo></mrow></mrow><msubsup><mrow><mo>&PartialD;</mo><mi>y</mi></mrow><mi>i</mi><mi>k</mi></msubsup></mfrac><mo>=</mo><munder><mi>&Sigma;</mi><mi>j</mi></munder><mfrac><mrow><mo>&PartialD;</mo><mi>F</mi><mrow><mo>(</mo><mi>x</mi><mo>)</mo></mrow></mrow><mrow><mo>&PartialD;</mo><msubsup><mi>y</mi><mi>j</mi><mrow><mi>k</mi><mo>+</mo><mn>1</mn></mrow></msubsup></mrow></mfrac><mfrac><msubsup><mrow><mo>&PartialD;</mo><mi>y</mi></mrow><mi>j</mi><mrow><mi>k</mi><mo>+</mo><mn>1</mn></mrow></msubsup><mrow><mo>&PartialD;</mo><msubsup><mi>y</mi><mi>i</mi><mi>k</mi></msubsup></mrow></mfrac></mrow></math>
<math><mrow><mo>=</mo><munder><mi>&Sigma;</mi><mi>j</mi></munder><mfrac><mrow><mo>&PartialD;</mo><mi>F</mi><mrow><mo>(</mo><mi>x</mi><mo>)</mo></mrow></mrow><mrow><mo>&PartialD;</mo><msubsup><mi>y</mi><mi>j</mi><mrow><mi>k</mi><mo>+</mo><mn>1</mn></mrow></msubsup></mrow></mfrac><mfrac><mrow><mo>&PartialD;</mo><mrow><mo>(</mo><munder><mi>&Sigma;</mi><mi>n</mi></munder><msubsup><mi>&omega;</mi><mi>jn</mi><mrow><mi>k</mi><mo>+</mo><mn>1</mn></mrow></msubsup><msubsup><mi>o</mi><mi>n</mi><mi>k</mi></msubsup><mo>)</mo></mrow></mrow><mrow><mo>&PartialD;</mo><msubsup><mi>y</mi><mi>i</mi><mi>k</mi></msubsup></mrow></mfrac></mrow></math>
<math><mrow><mo>=</mo><munder><mi>&Sigma;</mi><mi>j</mi></munder><mfrac><mrow><mo>&PartialD;</mo><mi>F</mi><mrow><mo>(</mo><mi>x</mi><mo>)</mo></mrow></mrow><mrow><mo>&PartialD;</mo><msubsup><mi>y</mi><mi>j</mi><mrow><mi>k</mi><mo>+</mo><mn>1</mn></mrow></msubsup></mrow></mfrac><mfrac><msubsup><mrow><mo>&PartialD;</mo><mi>o</mi></mrow><mi>i</mi><mi>k</mi></msubsup><msubsup><mrow><mo>&PartialD;</mo><mi>y</mi></mrow><mi>i</mi><mi>k</mi></msubsup></mfrac><msubsup><mi>&omega;</mi><mi>ji</mi><mrow><mi>k</mi><mo>+</mo><mn>1</mn></mrow></msubsup></mrow></math>
<math><mrow><msup><mrow><mo>=</mo><mi>f</mi></mrow><mo>&prime;</mo></msup><mrow><mo>(</mo><msubsup><mi>y</mi><mi>i</mi><mi>k</mi></msubsup><mo>)</mo></mrow><munder><mi>&Sigma;</mi><mi>j</mi></munder><mfrac><mrow><mo>&PartialD;</mo><mi>F</mi><mrow><mo>(</mo><mi>x</mi><mo>)</mo></mrow></mrow><mrow><mo>&PartialD;</mo><msubsup><mi>y</mi><mi>i</mi><mrow><mi>k</mi><mo>+</mo><mn>1</mn></mrow></msubsup></mrow></mfrac><msubsup><mi>&omega;</mi><mi>ji</mi><mrow><mi>k</mi><mo>+</mo><mn>1</mn></mrow></msubsup><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>16</mn><mo>)</mo></mrow></mrow></math>
Since backward inversion is used, the calculation is carried out
Figure B2009100354240D00000813
Time of flight
Figure B2009100354240D00000814
It is known that the expression (16) can be substituted
Figure B2009100354240D00000815
(7) And (3) judging whether the convergence condition set in the step (2) is met or whether the maximum iteration number is reached, if so, stopping training, otherwise, jumping to the step (4).
3. Speaker recognition
During identification, the feature vector enters TDNN after being delayed. Since TDNN has learned the structure and timing information of the feature space, the input of the sequence X of feature vectors to be recognized will transform the feature vectors accordingly. Because the training is carried out, the TDNN transformation plays the roles of enhancing the likelihood probability of the target model and reducing the likelihood probability of the non-target model. Then, a residual sequence R obtained by subtracting the sequence O output after the TDNN transformation from X is provided for the GMM model, and the sequence R of the T residual vectors is equal to R1,R2,...,RTIts GMM probability can be written as:
<math><mrow><mi>P</mi><mrow><mo>(</mo><mi>R</mi><mo>|</mo><mi>&lambda;</mi><mo>)</mo></mrow><mo>=</mo><munderover><mi>&Pi;</mi><mrow><mi>t</mi><mo>=</mo><mn>1</mn></mrow><mi>T</mi></munderover><mi>p</mi><mrow><mo>(</mo><msub><mi>R</mi><mi>t</mi></msub><mo>|</mo><mi>&lambda;</mi><mo>)</mo></mrow><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>17</mn><mo>)</mo></mrow></mrow></math>
expressed in the logarithmic domain as:
<math><mrow><mi>L</mi><mrow><mo>(</mo><mi>R</mi><mo>|</mo><mi>&lambda;</mi><mo>)</mo></mrow><mo>=</mo><mi>log</mi><mi>P</mi><mrow><mo>(</mo><mi>R</mi><mo>|</mo><mi>&lambda;</mi><mo>)</mo></mrow><mo>=</mo><munderover><mi>&Sigma;</mi><mrow><mi>t</mi><mo>=</mo><mn>1</mn></mrow><mi>T</mi></munderover><mi>log</mi><mi>p</mi><mrow><mo>(</mo><msub><mi>R</mi><mi>t</mi></msub><mo>|</mo><mi>&lambda;</mi><mo>)</mo></mrow><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>18</mn><mo>)</mo></mrow></mrow></math>
and (3) applying Bayes theorem during identification, wherein in the models of N unknown speakers, the speaker corresponding to the model with the maximum likelihood probability is the target speaker:
<math><mrow><msup><mi>i</mi><mo>*</mo></msup><mo>=</mo><munder><mrow><mi>arg</mi><mi>max</mi></mrow><mrow><mn>1</mn><mo>&le;</mo><mi>i</mi><mo>&le;</mo><mi>N</mi></mrow></munder><mi>L</mi><mrow><mo>(</mo><msub><mrow><mi>R</mi><mo>|</mo><mi>&lambda;</mi></mrow><mi>i</mi></msub><mo>)</mo></mrow><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>19</mn><mo>)</mo></mrow></mrow></math>
here we selected 107 target speakers among them, 63 male and 44 female, using 1conv4w-1conv4w tested in NIST 2006 as an experiment. Each person selected about 2 minutes of speech as the training speech and the remaining speech as the test speech, thus forming about 23000 tests.
In order to test the improvement effect of the method of the present invention in a noisy environment, noise data were selected as noise (stationary noise) in a running car (2000cc group, general road) and noise (non-stationary noise) in a display booth in an exhibition in the japan electronics association standard noise database. These noises are superimposed on the 1conv4w-1conv4w speech at a certain signal-to-noise ratio (SNR) to generate a noisy speech.
Adopting correct recognition rate as the standard for judging the recognition effect of the speaker, wherein, correct _ ratio is Nv/Nt. Wherein, correct _ ratio is the correct recognition rate, NvNumber of tests for proper identification, NtTotal number of tests.
The method of the present invention (denoted TDNN-GMM) and the speaker training and recognition method using GMM only (denoted baseline GMM) are described herein. The results of the experiment are shown in FIGS. 3-5. FIG. 3 is a comparison of the recognition effect of varying the number of mixture terms M of the Gaussian probability density function in GMM at 1conv4w-1conv4w in the absence of noise; from fig. 3, it is found that the recognition effect of the GMM is improved after the TDNN is embedded, and the improvement effect is more obvious when the number of mixed terms M is less, because the learning effect of the neural network is better when the subclasses in the class are less.
Fig. 4 and 5 compare the comparison results under different noise and signal-to-noise ratios (SNR), where M is 80. As can be seen from fig. 4 and 5, the method of the present invention has a great improvement in speaker recognition effect compared to the baseline GMM under different signal-to-noise ratios.

Claims (4)

1. A speaker recognition method based on a Gaussian mixture model embedded with a time delay neural network is characterized by comprising the following steps:
(1) preprocessing and feature extraction;
firstly, carrying out silence detection by using a method based on energy and a zero crossing rate, removing noise by using a spectral subtraction method, carrying out pre-emphasis and framing on a speech signal, carrying out Linear Prediction (LPC) analysis, and then obtaining a cepstrum coefficient from the obtained LPC coefficient to be used as a feature vector for speaker identification;
(2) training;
during training, delaying the extracted feature vector, then entering a Time Delay Neural Network (TDNN), wherein the TDNN learns the structure of the feature vector, and extracting time information of a feature vector sequence; then, providing the learning result to a Gaussian Mixture Model (GMM) in the form of residual error characteristic vector, training the GMM by adopting a maximum expectation method, and updating the weight coefficient of the TDNN by utilizing a backward inversion method with inertia; the specific training process is as follows:
(2-1) determining the GMM model and the TDNN structure:
the probability density function of an M-order GMM is obtained by weighted summation of M gaussian probability density functions and can be expressed as follows:
<math><mrow><mi>p</mi><mrow><mo>(</mo><msub><mi>x</mi><mi>t</mi></msub><mo>|</mo><mi>&lambda;</mi><mo>)</mo></mrow><mo>=</mo><munderover><mi>&Sigma;</mi><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>M</mi></munderover><msub><mi>p</mi><mi>i</mi></msub><msub><mi>b</mi><mi>i</mi></msub><mrow><mo>(</mo><msub><mi>x</mi><mi>t</mi></msub><mo>)</mo></mrow></mrow></math>
in the above formula xtD-dimensional feature vectors, where D is 13; bi(xt) Is a member density function which is the mean vector uiThe covariance matrix is sigmai(ii) a gaussian function of;
<math><mrow><msub><mi>b</mi><mi>i</mi></msub><mrow><mo>(</mo><msub><mi>x</mi><mi>t</mi></msub><mo>)</mo></mrow><mo>=</mo><mfrac><mn>1</mn><mrow><msup><mrow><mo>(</mo><mn>2</mn><mi>&pi;</mi><mo>)</mo></mrow><mrow><mi>D</mi><mo>/</mo><mn>2</mn></mrow></msup><msup><mrow><mo>|</mo><msub><mi>&Sigma;</mi><mi>i</mi></msub><mo>|</mo></mrow><mrow><mn>1</mn><mo>/</mo><mn>2</mn></mrow></msup></mrow></mfrac><mi>exp</mi><mo>{</mo><mo>-</mo><mfrac><mn>1</mn><mn>2</mn></mfrac><msup><mrow><mo>(</mo><msub><mi>x</mi><mi>i</mi></msub><mo>-</mo><msub><mi>u</mi><mi>i</mi></msub><mo>)</mo></mrow><mi>T</mi></msup><msubsup><mi>&Sigma;</mi><mi>i</mi><mrow><mo>-</mo><mn>1</mn></mrow></msubsup><mrow><mo>(</mo><msub><mi>x</mi><mi>t</mi></msub><mo>-</mo><msub><mi>u</mi><mi>i</mi></msub><mo>)</mo></mrow><mo>}</mo></mrow></math>
pithe mixed weight satisfies the condition:
Figure F2009100354240C0000013
the complete GMM model parameters are as follows:
λ={(pi,ui,∑i),i=1,2,...,M}
the TDNN characteristic vector x (n) without feedback is used as the input of the TDNN after being delayed by a linear delay block, the TDNN performs nonlinear transformation on the input, then performs linear weighting to obtain an output vector, and then compares the output vector with the characteristic vector, wherein the commonly used criterion is minimum mean square criterion (MMSE); the ratio of the number of neurons in the hidden layer to the number of neurons in the input layer of TDNN is 3: 2, and the nonlinear activation S function isy is the input after weighted summation; during training, the inertia coefficient gamma of the neural network is 0.8;
(2-2) setting a convergence condition and a maximum iteration number; specifically, the convergence condition is that the euclidean distance between the GMM coefficients and the TDNN weight coefficients of two adjacent times is less than 0.0001, and the maximum iteration number is usually not more than 100;
(2-3) randomly determining TDNN and GMM model parameters of the initial iteration; setting an initial coefficient of TDNN as a pseudo-random number generated by a computer, wherein an initial mixing coefficient of GMM can be 1/M, M is the number of mixing terms of GMM, the initial mean value and variance of GMM are obtained by generating M aggregation classes through an LBG (Linde, Buzo, Gray) method by using a residual vector of TDNN, and respectively calculating the mean value and the variance of the M aggregation classes;
(2-4) inputting the feature vector x (n) into a TDNN network, and subtracting the feature vector x (n) before passing through the TDNN from the output feature vector o (n) of the TDNN to obtain all residual vectors;
(2-5) correcting parameters of the GMM model by adopting a maximum expectation method;
let the residual vector be rtFirst, the class posterior probability is calculated:
<math><mrow><mi>p</mi><mrow><mo>(</mo><mi>i</mi><mo>|</mo><msub><mi>r</mi><mi>t</mi></msub><mo>,</mo><mi>&lambda;</mi><mo>)</mo></mrow><mo>=</mo><mfrac><mrow><msub><mi>p</mi><mi>i</mi></msub><msub><mi>b</mi><mi>i</mi></msub><mrow><mo>(</mo><msub><mi>r</mi><mi>t</mi></msub><mo>)</mo></mrow></mrow><mrow><msubsup><mi>&Sigma;</mi><mrow><mi>k</mi><mo>=</mo><mn>1</mn></mrow><mi>M</mi></msubsup><msub><mi>p</mi><mi>k</mi></msub><msub><mi>b</mi><mi>k</mi></msub><mrow><mo>(</mo><msub><mi>r</mi><mi>t</mi></msub><mo>)</mo></mrow></mrow></mfrac></mrow></math>
then updating the mixing weight
Figure F2009100354240C0000022
Mean vector
Figure F2009100354240C0000023
Sum covariance matrix
Figure F2009100354240C0000024
<math><mrow><mover><msub><mi>p</mi><mi>i</mi></msub><mo>&OverBar;</mo></mover><mo>=</mo><mfrac><mn>1</mn><mi>N</mi></mfrac><munderover><mi>&Sigma;</mi><mrow><mi>t</mi><mo>=</mo><mn>1</mn></mrow><mi>N</mi></munderover><mi>p</mi><mrow><mo>(</mo><msub><mrow><mi>i</mi><mo>|</mo><mi>r</mi></mrow><mi>t</mi></msub><mo>,</mo><mi>&lambda;</mi><mo>)</mo></mrow></mrow></math>
<math><mrow><mover><msub><mi>u</mi><mi>i</mi></msub><mo>&OverBar;</mo></mover><mo>=</mo><mfrac><mrow><msubsup><mi>&Sigma;</mi><mrow><mi>t</mi><mo>=</mo><mn>1</mn></mrow><mi>N</mi></msubsup><mi>p</mi><mrow><mo>(</mo><msub><mrow><mi>i</mi><mo>|</mo><mi>r</mi></mrow><mi>t</mi></msub><mo>,</mo><mi>&lambda;</mi><mo>)</mo></mrow><msub><mi>x</mi><mi>t</mi></msub></mrow><mrow><msubsup><mi>&Sigma;</mi><mrow><mi>t</mi><mo>=</mo><mn>1</mn></mrow><mi>N</mi></msubsup><mi>p</mi><mrow><mo>(</mo><msub><mrow><mi>i</mi><mo>|</mo><mi>r</mi></mrow><mi>t</mi></msub><mo>,</mo><mi>&lambda;</mi><mo>)</mo></mrow></mrow></mfrac></mrow></math>
<math><mrow><msubsup><mover><mi>&Sigma;</mi><mo>&OverBar;</mo></mover><mi>i</mi><mn>2</mn></msubsup><mo>=</mo><mfrac><mrow><msubsup><mi>&Sigma;</mi><mrow><mi>t</mi><mo>=</mo><mn>1</mn></mrow><mi>N</mi></msubsup><mi>p</mi><mrow><mo>(</mo><msub><mrow><mi>i</mi><mo>|</mo><mi>r</mi></mrow><mi>t</mi></msub><mo>,</mo><mi>&lambda;</mi><mo>)</mo></mrow><msubsup><mi>x</mi><mi>t</mi><mn>2</mn></msubsup></mrow><mrow><msubsup><mi>&Sigma;</mi><mrow><mi>t</mi><mo>=</mo><mn>1</mn></mrow><mi>N</mi></msubsup><mi>p</mi><mrow><mo>(</mo><msub><mrow><mi>i</mi><mo>|</mo><mi>r</mi></mrow><mi>t</mi></msub><mo>,</mo><mi>&lambda;</mi><mo>)</mo></mrow></mrow></mfrac><mo>-</mo><msup><mover><msub><mi>u</mi><mi>i</mi></msub><mo>&OverBar;</mo></mover><mn>2</mn></msup></mrow></math>
(2-6) substituting residual errors by using the weight coefficient, mean vector and variance of each Gaussian distribution of the modified GMM model to obtain a likelihood probability, and modifying TDNN parameters by using a backward inversion method with inertia;
the TDNN parameter is obtained by maximizing the function in the following equation:
<math><mrow><mi>L</mi><mrow><mo>(</mo><mi>X</mi><mo>)</mo></mrow><mo>=</mo><munder><mrow><mi>arg</mi><mi>max</mi></mrow><msub><mi>&omega;</mi><mi>ij</mi></msub></munder><munderover><mi>&Pi;</mi><mrow><mi>t</mi><mo>=</mo><mn>1</mn></mrow><mi>N</mi></munderover><mi>p</mi><mrow><mo>(</mo><mrow><mo>(</mo><msub><mi>x</mi><mi>t</mi></msub><mo>-</mo><msub><mi>o</mi><mi>t</mi></msub><mo>)</mo></mrow><mo>|</mo><mi>&lambda;</mi><mo>)</mo></mrow></mrow></math>
wherein o istIs output by the neural network, xtIs the input feature vector;
and taking the logarithm of the formula and then taking the negative of the formula to obtain:
<math><mrow><mi>G</mi><mrow><mo>(</mo><mi>X</mi><mo>)</mo></mrow><mo>=</mo><munder><mrow><mi>arg</mi><mi>min</mi></mrow><msub><mi>&omega;</mi><mi>ij</mi></msub></munder><mrow><mo>(</mo><mo>-</mo><munderover><mi>&Sigma;</mi><mrow><mi>t</mi><mo>=</mo><mn>1</mn></mrow><mi>N</mi></munderover><mi>ln</mi><mi>p</mi><mrow><mo>(</mo><mrow><mo>(</mo><msub><mi>x</mi><mi>t</mi></msub><mo>-</mo><msub><mi>o</mi><mi>t</mi></msub><mo>)</mo></mrow><mo>|</mo><mi>&lambda;</mi><mo>)</mo></mrow><mo>)</mo></mrow></mrow></math>
and (3) solving G (X) by adopting a backward inversion method with inertia, wherein the iterative formula is as follows:
<math><mrow><msubsup><mi>&Delta;&omega;</mi><mi>ij</mi><mi>k</mi></msubsup><mrow><mo>(</mo><mi>m</mi><mo>+</mo><mn>1</mn><mo>)</mo></mrow><mo>=</mo><msubsup><mi>&gamma;&Delta;&omega;</mi><mi>ij</mi><mi>k</mi></msubsup><mrow><mo>(</mo><mi>m</mi><mo>)</mo></mrow><mo>-</mo><mrow><mo>(</mo><mn>1</mn><mo>-</mo><mi>&gamma;</mi><mo>)</mo></mrow><mi>&alpha;</mi><mfrac><mrow><mo>&PartialD;</mo><mi>F</mi><mrow><mo>(</mo><mi>x</mi><mo>)</mo></mrow></mrow><mrow><mo>&PartialD;</mo><msubsup><mi>&omega;</mi><mi>ij</mi><mi>k</mi></msubsup></mrow></mfrac><msub><mo>|</mo><mrow><msubsup><mi>&omega;</mi><mi>ij</mi><mi>k</mi></msubsup><mo>=</mo><msubsup><mi>&omega;</mi><mi>ij</mi><mi>k</mi></msubsup><mrow><mo>(</mo><mi>m</mi><mo>)</mo></mrow></mrow></msub></mrow></math>
wherein,
Figure F2009100354240C00000211
Figure F2009100354240C00000212
for the m-th iteration, input x is connectediAnd output yjK is the layer number of the neural network, α is the iteration step size, f (x) -lnp ((x)t-ot) | λ), γ is the inertia coefficient;
(2-7) judging whether the convergence condition set in the step (2-2) is met or whether the maximum iteration number is reached, if so, stopping training, otherwise, jumping to the step (2-4);
(3) identification
During identification, inputting the characteristic vector sequence X into TDNN after delaying; then, a residual sequence R obtained by subtracting the output sequence O of the TDNN from X is provided to the GMM model, and R is equal to R for the sequence of T residual vectors1,R2,...,RTIts GMM probability can be written as:
<math><mrow><mi>P</mi><mrow><mo>(</mo><mi>R</mi><mo>|</mo><mi>&lambda;</mi><mo>)</mo></mrow><mo>=</mo><munderover><mi>&Pi;</mi><mrow><mi>t</mi><mo>=</mo><mn>1</mn></mrow><mi>T</mi></munderover><mi>p</mi><mrow><mo>(</mo><msub><mi>R</mi><mi>t</mi></msub><mo>|</mo><mi>&lambda;</mi><mo>)</mo></mrow></mrow></math>
expressed in the logarithmic domain as:
<math><mrow><mi>L</mi><mrow><mo>(</mo><mi>R</mi><mo>|</mo><mi>&lambda;</mi><mo>)</mo></mrow><mo>=</mo><mi>log</mi><mi>P</mi><mrow><mo>(</mo><mi>R</mi><mo>|</mo><mi>&lambda;</mi><mo>)</mo></mrow><mo>=</mo><munderover><mi>&Sigma;</mi><mrow><mi>t</mi><mo>=</mo><mn>1</mn></mrow><mi>T</mi></munderover><mi>log</mi><mi>p</mi><mrow><mo>(</mo><msub><mi>R</mi><mi>t</mi></msub><mo>|</mo><mi>&lambda;</mi><mo>)</mo></mrow></mrow></math>
and (3) applying Bayes theorem during identification, wherein in the models of N unknown speakers, the speaker corresponding to the model with the maximum likelihood probability is the target speaker:
<math><mrow><msup><mi>i</mi><mo>*</mo></msup><mo>=</mo><munder><mrow><mi>arg</mi><mi>max</mi></mrow><mrow><mn>1</mn><mo>&le;</mo><mi>i</mi><mo>&le;</mo><mi>N</mi></mrow></munder><mi>L</mi><mrow><mo>(</mo><msub><mrow><mi>R</mi><mo>|</mo><mi>&lambda;</mi></mrow><mi>i</mi></msub><mo>)</mo></mrow><mo>.</mo></mrow></math>
2. the speaker recognition method according to claim 1, wherein the speaker recognition method is based on a Gaussian mixture model with embedded delay neural network
Figure F2009100354240C0000034
The calculation process is as follows:
<math><mrow><mfrac><mrow><mo>&PartialD;</mo><mi>F</mi><mrow><mo>(</mo><mi>x</mi><mo>)</mo></mrow></mrow><mrow><mo>&PartialD;</mo><msubsup><mi>&omega;</mi><mi>ij</mi><mi>k</mi></msubsup></mrow></mfrac><mo>=</mo><mfrac><mrow><mo>&PartialD;</mo><mi>F</mi><mrow><mo>(</mo><mi>x</mi><mo>)</mo></mrow></mrow><msubsup><mrow><mo>&PartialD;</mo><mi>y</mi></mrow><mi>i</mi><mi>k</mi></msubsup></mfrac><mfrac><msubsup><mrow><mo>&PartialD;</mo><mi>y</mi></mrow><mi>i</mi><mi>k</mi></msubsup><msubsup><mrow><mo>&PartialD;</mo><mi>&omega;</mi></mrow><mi>ij</mi><mi>k</mi></msubsup></mfrac></mrow></math>
in the case of a TDNN network,
Figure F2009100354240C0000036
the output when sample x is input to the ith neuron of the k layer,
Figure F2009100354240C0000039
the input when sample x is input for the ith neuron in the kth layer,to activate the function, then:
<math><mrow><mfrac><msubsup><mrow><mo>&PartialD;</mo><mi>y</mi></mrow><mi>i</mi><mi>k</mi></msubsup><msubsup><mrow><mo>&PartialD;</mo><mi>&omega;</mi></mrow><mi>ij</mi><mi>k</mi></msubsup></mfrac><mo>=</mo><msubsup><mi>o</mi><mi>j</mi><mrow><mi>k</mi><mo>-</mo><mn>1</mn></mrow></msubsup><mo>.</mo></mrow></math>
3. the speaker recognition method according to claim 2, wherein the speaker recognition method is based on the Gaussian mixture model with embedded delay neural network
Figure F2009100354240C00000312
The calculation process is divided into two cases of an output layer and an implicit layer of the TDNN;
for the output layer:
<math><mrow><mfrac><mrow><mo>&PartialD;</mo><mi>F</mi><mrow><mo>(</mo><mi>x</mi><mo>)</mo></mrow></mrow><msubsup><mrow><mo>&PartialD;</mo><mi>y</mi></mrow><mi>i</mi><mi>k</mi></msubsup></mfrac><mo>=</mo><mo>-</mo><mfrac><mn>1</mn><mrow><mi>p</mi><mrow><mo>(</mo><mrow><mo>(</mo><mi>x</mi><mo>-</mo><mi>o</mi><mo>)</mo></mrow><mo>|</mo><mi>&lambda;</mi><mo>)</mo></mrow></mrow></mfrac><mfrac><mrow><mo>&PartialD;</mo><mi>p</mi><mrow><mo>(</mo><mrow><mo>(</mo><mi>x</mi><mo>-</mo><mi>o</mi><mo>)</mo></mrow><mo>|</mo><mi>&lambda;</mi><mo>)</mo></mrow></mrow><msubsup><mrow><mo>&PartialD;</mo><mi>o</mi></mrow><mi>i</mi><mi>k</mi></msubsup></mfrac><mfrac><msubsup><mrow><mo>&PartialD;</mo><mi>o</mi></mrow><mi>i</mi><mi>k</mi></msubsup><msubsup><mi>y</mi><mi>i</mi><mi>k</mi></msubsup></mfrac></mrow></math>
<math><mrow><mo>=</mo><mo>-</mo><mfrac><mrow><msup><mi>f</mi><mo>&prime;</mo></msup><mrow><mo>(</mo><msubsup><mi>y</mi><mi>i</mi><mi>k</mi></msubsup><mo>)</mo></mrow></mrow><mrow><mi>p</mi><mrow><mo>(</mo><mrow><mo>(</mo><mi>x</mi><mo>-</mo><mi>o</mi><mo>)</mo></mrow><mo>|</mo><mi>&lambda;</mi><mo>)</mo></mrow></mrow></mfrac><mo>&PartialD;</mo><mrow><mo>(</mo><munderover><mi>&Sigma;</mi><mrow><mi>n</mi><mo>=</mo><mn>1</mn></mrow><mi>M</mi></munderover><msub><mi>p</mi><mi>n</mi></msub><msub><mi>c</mi><mi>n</mi></msub><msup><mi>e</mi><mrow><mo>-</mo><mfrac><mn>1</mn><mn>2</mn></mfrac><msup><mrow><mo>(</mo><msub><mrow><mi>x</mi><mo>-</mo><mi>o</mi><mo>-</mo><mi>u</mi></mrow><mi>n</mi></msub><mo>)</mo></mrow><mi>T</mi></msup><msubsup><mi>&Sigma;</mi><mi>n</mi><mrow><mo>-</mo><mn>1</mn></mrow></msubsup><mrow><mo>(</mo><msub><mrow><mi>x</mi><mo>-</mo><mi>o</mi><mo>-</mo><mi>u</mi></mrow><mi>n</mi></msub><mo>)</mo></mrow></mrow></msup><mo>)</mo></mrow><mo>/</mo><msubsup><mrow><mo>&PartialD;</mo><mi>o</mi></mrow><mi>i</mi><mi>k</mi></msubsup></mrow></math>
<math><mrow><mo>=</mo><mo>-</mo><mfrac><mrow><msup><mi>f</mi><mo>&prime;</mo></msup><mrow><mo>(</mo><msubsup><mi>y</mi><mi>i</mi><mi>k</mi></msubsup><mo>)</mo></mrow></mrow><mrow><mi>p</mi><mrow><mo>(</mo><mrow><mo>(</mo><mi>x</mi><mo>-</mo><mi>o</mi><mo>)</mo></mrow><mo>|</mo><mi>&lambda;</mi><mo>)</mo></mrow></mrow></mfrac><munderover><mi>&Sigma;</mi><mrow><mi>n</mi><mo>=</mo><mn>1</mn></mrow><mi>M</mi></munderover><msub><mi>p</mi><mi>n</mi></msub><msub><mi>c</mi><mi>n</mi></msub><mrow><mo>(</mo><mfrac><mrow><msub><mi>a</mi><mi>n</mi></msub><mrow><mo>(</mo><msub><mrow><mi>x</mi><mo>-</mo><mi>o</mi><mo>-</mo><mi>u</mi></mrow><mi>n</mi></msub><mo>)</mo></mrow></mrow><msubsup><mi>&sigma;</mi><mrow><mi>n</mi><mo>,</mo><mi>i</mi></mrow><mn>2</mn></msubsup></mfrac><mrow><mo>(</mo><msub><mi>x</mi><mi>i</mi></msub><mo>-</mo><msub><mi>o</mi><mi>i</mi></msub><mo>-</mo><msub><mi>u</mi><mrow><mi>n</mi><mo>,</mo><mi>i</mi></mrow></msub><mo>)</mo></mrow><mo>)</mo></mrow><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>15</mn><mo>)</mo></mrow></mrow></math>
wherein: <math><mrow><mrow><msub><mi>a</mi><mi>n</mi></msub><mrow><mo>(</mo><msub><mrow><mi>x</mi><mo>-</mo><mi>o</mi><mo>-</mo><mi>u</mi></mrow><mi>n</mi></msub><mo>)</mo></mrow><mo>=</mo><msup><mi>e</mi><mrow><mo>-</mo><mfrac><mn>1</mn><mn>2</mn></mfrac><msup><mrow><mo>(</mo><msub><mrow><mi>x</mi><mo>-</mo><mi>o</mi><mo>-</mo><mi>u</mi></mrow><mi>n</mi></msub><mo>)</mo></mrow><mi>T</mi></msup><msubsup><mi>&Sigma;</mi><mi>n</mi><mrow><mo>-</mo><mn>1</mn></mrow></msubsup><mrow><mo>(</mo><msub><mrow><mi>x</mi><mo>-</mo><mi>o</mi><mo>-</mo><mi>u</mi></mrow><mi>n</mi></msub><mo>)</mo></mrow></mrow></msup></mrow><mo>,</mo></mrow></math> <math><mrow><msub><mi>c</mi><mi>n</mi></msub><mo>=</mo><mfrac><mn>1</mn><mrow><msup><mrow><mo>(</mo><mn>2</mn><mi>&pi;</mi><mo>)</mo></mrow><mrow><mi>D</mi><mo>/</mo><mn>2</mn></mrow></msup><msup><mrow><mo>|</mo><msub><mi>&Sigma;</mi><mi>n</mi></msub><mo>|</mo></mrow><mrow><mn>1</mn><mo>/</mo><mn>2</mn></mrow></msup></mrow></mfrac></mrow></math>
for the hidden layer:
<math><mrow><mfrac><mrow><mo>&PartialD;</mo><mi>F</mi><mrow><mo>(</mo><mi>x</mi><mo>)</mo></mrow></mrow><msubsup><mrow><mo>&PartialD;</mo><mi>y</mi></mrow><mi>i</mi><mi>k</mi></msubsup></mfrac><mo>=</mo><munder><mi>&Sigma;</mi><mi>j</mi></munder><mfrac><mrow><mo>&PartialD;</mo><mi>F</mi><mrow><mo>(</mo><mi>x</mi><mo>)</mo></mrow></mrow><mrow><mo>&PartialD;</mo><msubsup><mi>y</mi><mi>j</mi><mrow><mi>k</mi><mo>+</mo><mn>1</mn></mrow></msubsup></mrow></mfrac><mfrac><msubsup><mrow><mo>&PartialD;</mo><mi>y</mi></mrow><mi>i</mi><mrow><mi>k</mi><mo>+</mo><mn>1</mn></mrow></msubsup><mrow><mo>&PartialD;</mo><msubsup><mi>y</mi><mi>i</mi><mi>k</mi></msubsup></mrow></mfrac><mo>=</mo><munder><mi>&Sigma;</mi><mi>j</mi></munder><mfrac><mrow><mo>&PartialD;</mo><mi>F</mi><mrow><mo>(</mo><mi>x</mi><mo>)</mo></mrow></mrow><mrow><mo>&PartialD;</mo><msubsup><mi>y</mi><mi>j</mi><mrow><mi>k</mi><mo>+</mo><mn>1</mn></mrow></msubsup></mrow></mfrac><mfrac><mrow><mo>&PartialD;</mo><mrow><mo>(</mo><munder><mi>&Sigma;</mi><mi>n</mi></munder><msubsup><mi>&omega;</mi><mi>jn</mi><mrow><mi>k</mi><mo>+</mo><mn>1</mn></mrow></msubsup><msubsup><mi>o</mi><mi>n</mi><mi>k</mi></msubsup><mo>)</mo></mrow></mrow><mrow><mo>&PartialD;</mo><msubsup><mi>y</mi><mi>i</mi><mi>k</mi></msubsup></mrow></mfrac></mrow></math>
<math><mrow><mo>=</mo><munder><mi>&Sigma;</mi><mi>j</mi></munder><mfrac><mrow><mo>&PartialD;</mo><mi>F</mi><mrow><mo>(</mo><mi>x</mi><mo>)</mo></mrow></mrow><mrow><mo>&PartialD;</mo><msubsup><mi>y</mi><mi>j</mi><mrow><mi>k</mi><mo>+</mo><mn>1</mn></mrow></msubsup></mrow></mfrac><mfrac><msubsup><mrow><mo>&PartialD;</mo><mi>o</mi></mrow><mi>i</mi><mi>k</mi></msubsup><msubsup><mrow><mo>&PartialD;</mo><mi>y</mi></mrow><mi>i</mi><mi>k</mi></msubsup></mfrac><msubsup><mi>&omega;</mi><mi>ji</mi><mrow><mi>k</mi><mo>+</mo><mn>1</mn></mrow></msubsup><mo>=</mo><msup><mi>f</mi><mo>&prime;</mo></msup><mrow><mo>(</mo><msubsup><mi>y</mi><mi>i</mi><mi>k</mi></msubsup><mo>)</mo></mrow><munder><mi>&Sigma;</mi><mi>j</mi></munder><mfrac><mrow><mo>&PartialD;</mo><mi>F</mi><mrow><mo>(</mo><mi>x</mi><mo>)</mo></mrow></mrow><mrow><mo>&PartialD;</mo><msubsup><mi>y</mi><mi>i</mi><mrow><mi>k</mi><mo>+</mo><mn>1</mn></mrow></msubsup></mrow></mfrac><msubsup><mi>&omega;</mi><mi>ji</mi><mrow><mi>k</mi><mo>+</mo><mn>1</mn></mrow></msubsup><mo>.</mo></mrow></math>
4. the speaker recognition method according to claim 1, wherein the pre-emphasis is performed by using f (Z) -1-0.97Z-1The filter adopts a Hamming window with the length of 20ms and the window shift of 10ms for framing, the order of linear prediction analysis is 20, and the feature vector is a cepstrum coefficient of 13 orders.
CN2009100354240A 2009-09-28 2009-09-28 Speaker recognition method based on Gaussian mixture model embedded with time delay neural network Pending CN102034472A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2009100354240A CN102034472A (en) 2009-09-28 2009-09-28 Speaker recognition method based on Gaussian mixture model embedded with time delay neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2009100354240A CN102034472A (en) 2009-09-28 2009-09-28 Speaker recognition method based on Gaussian mixture model embedded with time delay neural network

Publications (1)

Publication Number Publication Date
CN102034472A true CN102034472A (en) 2011-04-27

Family

ID=43887277

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2009100354240A Pending CN102034472A (en) 2009-09-28 2009-09-28 Speaker recognition method based on Gaussian mixture model embedded with time delay neural network

Country Status (1)

Country Link
CN (1) CN102034472A (en)

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102436809A (en) * 2011-10-21 2012-05-02 东南大学 Network speech recognition method in English oral language machine examination system
CN102708871A (en) * 2012-05-08 2012-10-03 哈尔滨工程大学 Line spectrum-to-parameter dimensional reduction quantizing method based on conditional Gaussian mixture model
WO2013086736A1 (en) * 2011-12-16 2013-06-20 华为技术有限公司 Speaker recognition method and device
CN103680496A (en) * 2013-12-19 2014-03-26 百度在线网络技术(北京)有限公司 Deep-neural-network-based acoustic model training method, hosts and system
CN104112445A (en) * 2014-07-30 2014-10-22 宇龙计算机通信科技(深圳)有限公司 Terminal and voice identification method
CN104183239A (en) * 2014-07-25 2014-12-03 南京邮电大学 Method for identifying speaker unrelated to text based on weighted Bayes mixture model
CN104882141A (en) * 2015-03-03 2015-09-02 盐城工学院 Serial port voice control projection system based on time delay neural network and hidden Markov model
CN105765562A (en) * 2013-12-03 2016-07-13 罗伯特·博世有限公司 Method and device for determining a data-based functional model
CN106779050A (en) * 2016-11-24 2017-05-31 厦门中控生物识别信息技术有限公司 The optimization method and device of a kind of convolutional neural networks
CN108417224A (en) * 2018-01-19 2018-08-17 苏州思必驰信息科技有限公司 The training and recognition methods of two way blocks model and system
CN108877823A (en) * 2018-07-27 2018-11-23 三星电子(中国)研发中心 Sound enhancement method and device
CN109036386A (en) * 2018-09-14 2018-12-18 北京网众共创科技有限公司 A kind of method of speech processing and device
CN109119089A (en) * 2018-06-05 2019-01-01 安克创新科技股份有限公司 The method and apparatus of penetrating processing is carried out to music
CN109166571A (en) * 2018-08-06 2019-01-08 广东美的厨房电器制造有限公司 Wake-up word training method, device and the household appliance of household appliance
CN109214444A (en) * 2018-08-24 2019-01-15 小沃科技有限公司 Game Anti-addiction decision-making system and method based on twin neural network and GMM
CN109271482A (en) * 2018-09-05 2019-01-25 东南大学 A kind of implementation method of the automatic Evaluation Platform of postgraduates'english oral teaching voice
CN109326278A (en) * 2017-07-31 2019-02-12 科大讯飞股份有限公司 Acoustic model construction method and device and electronic equipment
CN110232932A (en) * 2019-05-09 2019-09-13 平安科技(深圳)有限公司 Method for identifying speaker, device, equipment and medium based on residual error time-delay network

Cited By (30)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102436809A (en) * 2011-10-21 2012-05-02 东南大学 Network speech recognition method in English oral language machine examination system
US9142210B2 (en) 2011-12-16 2015-09-22 Huawei Technologies Co., Ltd. Method and device for speaker recognition
CN103562993A (en) * 2011-12-16 2014-02-05 华为技术有限公司 Speaker recognition method and device
WO2013086736A1 (en) * 2011-12-16 2013-06-20 华为技术有限公司 Speaker recognition method and device
CN103562993B (en) * 2011-12-16 2015-05-27 华为技术有限公司 Speaker recognition method and device
CN102708871A (en) * 2012-05-08 2012-10-03 哈尔滨工程大学 Line spectrum-to-parameter dimensional reduction quantizing method based on conditional Gaussian mixture model
CN105765562B (en) * 2013-12-03 2022-01-11 罗伯特·博世有限公司 Method and device for obtaining a data-based function model
CN105765562A (en) * 2013-12-03 2016-07-13 罗伯特·博世有限公司 Method and device for determining a data-based functional model
CN103680496A (en) * 2013-12-19 2014-03-26 百度在线网络技术(北京)有限公司 Deep-neural-network-based acoustic model training method, hosts and system
CN103680496B (en) * 2013-12-19 2016-08-10 百度在线网络技术(北京)有限公司 Acoustic training model method based on deep-neural-network, main frame and system
CN104183239B (en) * 2014-07-25 2017-04-19 南京邮电大学 Method for identifying speaker unrelated to text based on weighted Bayes mixture model
CN104183239A (en) * 2014-07-25 2014-12-03 南京邮电大学 Method for identifying speaker unrelated to text based on weighted Bayes mixture model
CN104112445A (en) * 2014-07-30 2014-10-22 宇龙计算机通信科技(深圳)有限公司 Terminal and voice identification method
CN104882141A (en) * 2015-03-03 2015-09-02 盐城工学院 Serial port voice control projection system based on time delay neural network and hidden Markov model
CN106779050A (en) * 2016-11-24 2017-05-31 厦门中控生物识别信息技术有限公司 The optimization method and device of a kind of convolutional neural networks
CN109326278A (en) * 2017-07-31 2019-02-12 科大讯飞股份有限公司 Acoustic model construction method and device and electronic equipment
CN109326278B (en) * 2017-07-31 2022-06-07 科大讯飞股份有限公司 Acoustic model construction method and device and electronic equipment
CN108417224B (en) * 2018-01-19 2020-09-01 苏州思必驰信息科技有限公司 Training and recognition method and system of bidirectional neural network model
CN108417224A (en) * 2018-01-19 2018-08-17 苏州思必驰信息科技有限公司 The training and recognition methods of two way blocks model and system
CN109119089A (en) * 2018-06-05 2019-01-01 安克创新科技股份有限公司 The method and apparatus of penetrating processing is carried out to music
CN108877823A (en) * 2018-07-27 2018-11-23 三星电子(中国)研发中心 Sound enhancement method and device
CN108877823B (en) * 2018-07-27 2020-12-18 三星电子(中国)研发中心 Speech enhancement method and device
CN109166571A (en) * 2018-08-06 2019-01-08 广东美的厨房电器制造有限公司 Wake-up word training method, device and the household appliance of household appliance
CN109214444A (en) * 2018-08-24 2019-01-15 小沃科技有限公司 Game Anti-addiction decision-making system and method based on twin neural network and GMM
CN109214444B (en) * 2018-08-24 2022-01-07 小沃科技有限公司 Game anti-addiction determination system and method based on twin neural network and GMM
CN109271482A (en) * 2018-09-05 2019-01-25 东南大学 A kind of implementation method of the automatic Evaluation Platform of postgraduates'english oral teaching voice
CN109036386A (en) * 2018-09-14 2018-12-18 北京网众共创科技有限公司 A kind of method of speech processing and device
WO2020224114A1 (en) * 2019-05-09 2020-11-12 平安科技(深圳)有限公司 Residual delay network-based speaker confirmation method and apparatus, device and medium
CN110232932A (en) * 2019-05-09 2019-09-13 平安科技(深圳)有限公司 Method for identifying speaker, device, equipment and medium based on residual error time-delay network
CN110232932B (en) * 2019-05-09 2023-11-03 平安科技(深圳)有限公司 Speaker confirmation method, device, equipment and medium based on residual delay network

Similar Documents

Publication Publication Date Title
CN102034472A (en) Speaker recognition method based on Gaussian mixture model embedded with time delay neural network
CN102693724A (en) Noise classification method of Gaussian Mixture Model based on neural network
Hermansky et al. Tandem connectionist feature extraction for conventional HMM systems
Weninger et al. Single-channel speech separation with memory-enhanced recurrent neural networks
US11776548B2 (en) Convolutional neural network with phonetic attention for speaker verification
Prasad et al. Improved cepstral mean and variance normalization using Bayesian framework
CN110706692B (en) Training method and system of child voice recognition model
US8838446B2 (en) Method and apparatus of transforming speech feature vectors using an auto-associative neural network
CN101814159B (en) Speaker verification method based on combination of auto-associative neural network and Gaussian mixture background model
CN109346084A (en) Method for distinguishing speek person based on depth storehouse autoencoder network
US10283112B2 (en) System and method for neural network based feature extraction for acoustic model development
Mallidi et al. Autoencoder based multi-stream combination for noise robust speech recognition.
Pandharipande et al. An unsupervised frame selection technique for robust emotion recognition in noisy speech
Sivaram et al. Data-driven and feedback based spectro-temporal features for speech recognition
Reshma et al. A survey on speech emotion recognition
Dolfing et al. Combination of confidence measures in isolated word recognition.
Mohammadi et al. Weighted X-vectors for robust text-independent speaker verification with multiple enrollment utterances
CN104183239B (en) Method for identifying speaker unrelated to text based on weighted Bayes mixture model
CN104036777A (en) Method and device for voice activity detection
Mekonnen et al. Noise robust speaker verification using GMM-UBM multi-condition training
Yee et al. Malay language text-independent speaker verification using NN-MLP classifier with MFCC
Nathwani et al. Consistent DNN uncertainty training and decoding for robust ASR
Avila et al. On the use of blind channel response estimation and a residual neural network to detect physical access attacks to speaker verification systems
Tzagkarakis et al. Sparsity based robust speaker identification using a discriminative dictionary learning approach
Nehra et al. Speaker identification system using CNN approach

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Open date: 20110427