CN102034472A - Speaker recognition method based on Gaussian mixture model embedded with time delay neural network - Google Patents
Speaker recognition method based on Gaussian mixture model embedded with time delay neural network Download PDFInfo
- Publication number
- CN102034472A CN102034472A CN2009100354240A CN200910035424A CN102034472A CN 102034472 A CN102034472 A CN 102034472A CN 2009100354240 A CN2009100354240 A CN 2009100354240A CN 200910035424 A CN200910035424 A CN 200910035424A CN 102034472 A CN102034472 A CN 102034472A
- Authority
- CN
- China
- Prior art keywords
- mrow
- msub
- msubsup
- mfrac
- math
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 67
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 27
- 239000000203 mixture Substances 0.000 title claims abstract description 20
- 239000013598 vector Substances 0.000 claims abstract description 76
- 238000012549 training Methods 0.000 claims abstract description 35
- 230000006870 function Effects 0.000 claims description 21
- 210000002569 neuron Anatomy 0.000 claims description 12
- 230000008569 process Effects 0.000 claims description 12
- 239000011159 matrix material Substances 0.000 claims description 9
- 238000004364 calculation method Methods 0.000 claims description 7
- 230000003111 delayed effect Effects 0.000 claims description 7
- 230000002776 aggregation Effects 0.000 claims description 6
- 238000004220 aggregation Methods 0.000 claims description 6
- 230000004913 activation Effects 0.000 claims description 5
- 230000009466 transformation Effects 0.000 claims description 5
- 238000007476 Maximum Likelihood Methods 0.000 claims description 4
- 238000009432 framing Methods 0.000 claims description 4
- 238000001514 detection method Methods 0.000 claims description 3
- 238000000605 extraction Methods 0.000 claims description 3
- 230000009191 jumping Effects 0.000 claims description 3
- 238000007781 pre-processing Methods 0.000 claims description 3
- 230000003595 spectral effect Effects 0.000 claims description 3
- 238000011410 subtraction method Methods 0.000 claims 1
- 230000008901 benefit Effects 0.000 abstract description 8
- 238000002474 experimental method Methods 0.000 abstract description 3
- 238000006243 chemical reaction Methods 0.000 abstract 1
- 230000000694 effects Effects 0.000 description 11
- 238000012360 testing method Methods 0.000 description 5
- 230000006872 improvement Effects 0.000 description 3
- NAWXUBYGYWOOIX-SFHVURJKSA-N (2s)-2-[[4-[2-(2,4-diaminoquinazolin-6-yl)ethyl]benzoyl]amino]-4-methylidenepentanedioic acid Chemical compound C1=CC2=NC(N)=NC(N)=C2C=C1CCC1=CC=C(C(=O)N[C@@H](CC(=C)C(O)=O)C(O)=O)C=C1 NAWXUBYGYWOOIX-SFHVURJKSA-N 0.000 description 2
- 230000002708 enhancing effect Effects 0.000 description 2
- 150000001875 compounds Chemical class 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Landscapes
- Image Analysis (AREA)
Abstract
The invention discloses a speaker recognition method based on a Gaussian mixture model (GMM) embedded with a time delay neural network (TDNN). In the speaker recognition method, the advantages of the TDNN and the GMM are fully considered, the TDNN is embedded into the GMM, and solves a residual of input and output vectors of the TDNN by fully utilizing the time sequence of an input characteristic vector through the conversion of a time delay network, and the residual modifies the training of the GMM through an expectation maximization method; besides, a likelihood probability is acquired by a modified GMM model parameter and the residual, and a TDNN parameter is modified by an inertial backward inversion method so as to ensure that parameters of the GMM and the TDNN are alternately updated. An experiment shows that: a recognition rate of the method is improved to a certain extent compared with that of a baseline GMM under various signal to noise ratios.
Description
Technical Field
The invention relates to a speaker recognition method, in particular to a speaker recognition method based on a Gaussian mixture model embedded with a time delay neural network.
Background
In the aspects of entrance guard, credit card transaction, court evidence and the like, automatic speaker recognition, particularly speaker recognition irrelevant to text plays an increasingly important role, and the aim of the method is to correctly judge the voice to be recognized as belonging to one of a plurality of reference persons in a voice library.
In the speaker recognition method, attention is paid more and more to a Gaussian Mixture Model (GMM) based method, and the GMM based method has become the mainstream recognition method at present because of its advantages of high recognition rate, simple training, low requirement for training data volume, and the like. Since the Gaussian Mixture Model (GMM) has a good ability to represent the distribution of data, it can approximate any distribution model as long as there are enough terms, and enough training data. However, there are several problems with the practical use of GMMs. Firstly, GMM does not utilize the time information of speaker voice, and the training and recognition result is irrelevant to the input sequence of the feature vector; secondly, during GMM training, we always assume that the feature vectors are independent of each other, which is obviously unreasonable; in addition, because the selection of the number of the mixture items does not have a good guiding principle when the GMM model is selected, enough Gaussian mixture items are required to obtain a good result.
The neural network also has an important position in the speaker recognition aspect, the multilayer perceptron, the ray-based network, the self-association neural network and the like are successfully applied to speaker recognition, and particularly, the Time Delay Neural Network (TDNN) is widely applied to signal processing, speech recognition and speaker recognition. However, the GMM and TDNN are only used for speaker recognition at present, and no method for combining the advantages of the GMM and TDNN to better improve the speaker recognition effect exists.
Disclosure of Invention
The invention aims to solve the defects of the prior art and provides a speaker recognition method based on a Gaussian mixture model embedded with a time delay neural network. The technical scheme of the invention is as follows:
a speaker recognition method based on a Gaussian mixture model embedded with a time delay neural network comprises the following steps:
(1) preprocessing and feature extraction;
first, silence detection is performed using a method based on energy and a zero-crossing rate, noise is removed by spectral subtraction, a speech signal is pre-emphasized, framed, and subjected to Linear Prediction (LPC) analysis, and then cepstrum coefficients are found from the obtained LPC coefficients as feature vectors for speaker recognition.
(2) Training;
during training, the extracted feature vector is delayed and then used as the input of TDNN, the TDNN learns the structure of the feature vector, and the time information of the feature vector sequence is extracted. And then providing the learning result to the GMM in the form of residual characteristic vectors, carrying out GMM model training by adopting a maximum Expectation (EM) criterion, and updating weight coefficients of the TDNN by using a backward inversion method with inertia. The specific training process is as follows:
(2-1) determining the GMM model and the TDNN structure:
the probability density function of an M-order GMM is obtained by weighted summation of M gaussian probability density functions and can be expressed as follows:
in the above formula xtD-dimensional feature vectors, where D is 13;bi(xt) Is a member density function which is the mean vector uiThe covariance matrix is sigmai(ii) a gaussian function of;
λ={(pi,ui,∑i),i=1,2,...,M}
here, TDNN without feedback is utilized. The eigenvector x (n) is delayed by the linear delay block and then used as the input of the TDNN, which performs nonlinear transformation on the input, then performs linear weighting to obtain the output vector, and then compares the output vector with the eigenvector, wherein the commonly used criterion is minimum mean square criterion (MMSE). Specifically, the ratio of the number of neurons in the hidden layer to the number of neurons in the input layer of TDNN is 3: 2, and the nonlinear activation S function (the [ integral ] notation in FIG. 2) is And y is the input after weighted summation. During training, the inertia coefficient gamma of the neural network is 0.8.
(2-2) setting a convergence condition and a maximum iteration number; specifically, the convergence condition is that the euclidean distance between the GMM coefficients and the TDNN weight coefficients of two adjacent times is less than 0.0001, and the maximum number of iterations is usually not more than 100.
(2-3) randomly determining TDNN and GMM model parameters of the initial iteration; the initial coefficient of TDNN is set as pseudo random number generated by computer, the initial mixing coefficient of GMM can be 1/M, M is the number of mixing terms of GMM, the initial mean value and variance of GMM are obtained by generating M aggregation classes by LBG (Linde, Buzo, Gray) method of the residual vector of TDNN, and respectively calculating the mean value and variance of the M aggregation classes.
(2-4) inputting the feature vector x (n) into a TDNN network, and subtracting the feature vector x (n) before passing through the TDNN from the output feature vector o (n) of the TDNN to obtain all residual vectors;
(2-5) correcting parameters of the GMM by adopting an EM method;
let the residual vector be rtFirst, the class posterior probability is calculated:
(2-6) substituting residual errors by using the weight coefficient, mean vector and variance of each Gaussian distribution of the modified GMM model to obtain a likelihood probability, and modifying TDNN parameters by using a backward inversion method with inertia;
the TDNN network parameters are obtained by maximizing the function in the following equation:
wherein o istIs output by the neural network, xtIs the input feature vector.
And taking the logarithm of the formula and then taking the negative of the formula to obtain:
and (3) solving G (X) by adopting a backward inversion method with inertia, wherein the iterative formula is as follows:
wherein, for the m-th iteration, input x is connectediAnd output yjK is the layer number of the neural network, α is the iteration step size, f (x) -lnp ((x)t-ot) | λ), γ is the inertia coefficient.
And (2-7) judging whether the convergence condition set in the step (2-2) is met or whether the maximum iteration number is reached, if so, stopping training, otherwise, jumping to the step (2-4).
(3) Speaker recognition
During identification, the feature vector sequence X is input into TDNN after being delayed. Then, a residual sequence R obtained by subtracting the output sequence O of the TDNN from X is provided to the GMM model, and R is equal to R for the sequence of T residual vectors1,R2,...,RTIts GMM probability can be written as:
expressed in the logarithmic domain as:
and (3) applying Bayes theorem during identification, wherein in the models of N unknown speakers, the speaker corresponding to the model with the maximum likelihood probability is the target speaker:
in the speaker identification method based on the Gaussian mixture model embedded with the time delay neural network, the speaker identification method comprises the following stepsThe calculation process of (2) is as follows:
in the case of a TDNN network, the output when sample x is input to the ith neuron of the k layer,the input when sample x is input for the ith neuron in the kth layer,is an activation function. Then:
in the speaker identification method based on the Gaussian mixture model embedded with the time delay neural network, the speaker identification method comprises the following stepsThe calculation process is divided into two cases of an output layer and an implicit layer of the TDNN;
for the output layer:
wherein: <math><mrow><mrow><msub><mi>a</mi><mi>n</mi></msub><mrow><mo>(</mo><msub><mrow><mi>x</mi><mo>-</mo><mi>o</mi><mo>-</mo><mi>u</mi></mrow><mi>n</mi></msub><mo>)</mo></mrow><mo>=</mo><msup><mi>e</mi><mrow><mo>-</mo><mfrac><mn>1</mn><mn>2</mn></mfrac><msup><mrow><mo>(</mo><msub><mrow><mi>x</mi><mo>-</mo><mi>o</mi><mo>-</mo><mi>u</mi></mrow><mi>n</mi></msub><mo>)</mo></mrow><mi>T</mi></msup><msubsup><mi>Σ</mi><mi>n</mi><mrow><mo>-</mo><mn>1</mn></mrow></msubsup><mrow><mo>(</mo><msub><mrow><mi>x</mi><mo>-</mo><mi>o</mi><mo>-</mo><mi>u</mi></mrow><mi>n</mi></msub><mo>)</mo></mrow></mrow></msup></mrow><mo>,</mo></mrow></math> <math><mrow><msub><mi>c</mi><mi>n</mi></msub><mo>=</mo><mfrac><mn>1</mn><mrow><msup><mrow><mo>(</mo><mn>2</mn><mi>π</mi><mo>)</mo></mrow><mrow><mi>D</mi><mo>/</mo><mn>2</mn></mrow></msup><msup><mrow><mo>|</mo><msub><mi>Σ</mi><mi>n</mi></msub><mo>|</mo></mrow><mrow><mn>1</mn><mo>/</mo><mn>2</mn></mrow></msup></mrow></mfrac></mrow></math>
for the hidden layer:
in the speaker identification method based on the Gaussian mixture model embedded with the time delay neural network, the pre-emphasis adopts f (Z) -1-0.97Z-1The filter adopts a Hamming window with the length of 20ms and the window shift of 10ms for framing, the order of linear prediction analysis is 20, and the feature vector is a cepstrum coefficient of 13 orders.
The invention has the advantages and effects that:
1. the advantages of TDNN and GMM are fully utilized, so that TDNN can learn the time information of the feature vector, the feature vector set is mapped to the subspace capable of increasing the likelihood probability, the influence of the unreasonable assumption that the feature vector is independent can be reduced, the likelihood probability of the target model is enhanced, and the likelihood probability of the non-target model is reduced. And the GMM has the advantages of high recognition rate, simple training and low requirement on the training data volume. Therefore, the recognition rate of the whole speaker recognition system is greatly improved.
2. The method provided by the invention has the advantage that the speaker recognition effect under the voice in the noise-free and noise environment is improved compared with that of the GMM adopted alone.
Other advantages and effects of the present invention will be described further below.
Drawings
FIG. 1-speaker training and recognition model.
FIG. 2-time-lapse neural network model.
FIG. 3-comparison data at 1conv4w-1conv4w without noise.
FIG. 4-comparative data on noise in a car.
FIG. 5-comparative data showing noise within a compartment.
Detailed Description
The technical solution of the present invention is further explained below with reference to the drawings and the embodiments.
Fig. 1 is a training and recognition model for speaker recognition embedded in a TDNN network that differs from a baseline GMM model (which uses only the GMM model for speaker recognition) in both training and recognition.
1. Preprocessing and feature extraction
First, silence detection is performed using a method based on energy and zero-crossing rate, and noise is removed by spectral subtraction, and then f (Z) is 1 to 0.97Z-1The filter (2) performs pre-emphasis, performs 20-order Linear Prediction (LPC) analysis by framing with a Hamming window of 20ms length and 10ms window shift, and then obtains 13-order cepstral coefficients from the 20-order LPC coefficients as a feature vector for speaker recognition.
2. Speaker model training
During training, the process of training the TDNN and the process of training the GMM are performed alternately. TDNN is a multi-layer perceptron network (MLP) as shown in FIG. 2. And the feature vector is used as the input of TDNN after being delayed by the linear delay block, and the TDNN learns the structure of the feature vector and extracts the time information of the feature vector sequence. The learning result is then provided to the GMM in the form of a residual eigenvector (i.e., the difference between the input vector and the output of the TDNN), the GMM model is trained using a maximum Expectation (EM) method, and the weight coefficients of the TDNN network are updated using an inertial backward inversion method. Here, the criterion for both TDNN and GMM model learning and training is the maximum likelihood probability. Thus, by learning, the residual distribution is likely to proceed toward the direction of enhancing the likelihood probability. The specific training process is described as follows:
(1) determining a GMM model and a TDNN structure:
the probability density function of an M-order GMM is obtained by weighted summation of M gaussian probability density functions and can be expressed as follows:
where x istIs a D-dimensional random vector, and in speaker recognition applications, xtIs a feature vector; bi(xt) I 1, 2, M is the member density; p is a radical ofi I 1, 2, M is the mixing weight. Each member density function is a D-dimensional variable with a mean vector of uiThe covariance matrix is sigmaiOf the form:
wherein the mixing weight satisfies the condition: <math><mrow><munderover><mi>Σ</mi><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>M</mi></munderover><msub><mi>p</mi><mi>i</mi></msub><mo>=</mo><mn>1</mn><mo>.</mo></mrow></math>
the complete GMM model consists of the mean vector of all member densities, the covariance matrix, and the mixture weight parameters. These parameters are collectively represented as:
λ={(pi,ui,∑i),i=1,2,...,M} (3)
since speaker recognition generally has less training and recognition data, it is common in practice to set the covariance matrix for each gaussian mixture density to be a diagonal matrix.
Here, TDNN without feedback is utilized, as shown in fig. 2. The eigenvector x (n) is delayed by the linear delay block and then used as the input of the TDNN, which performs nonlinear transformation on the input, then performs linear weighting to obtain the output vector, and then compares the output vector with the eigenvector, wherein the commonly used criterion is minimum mean square criterion (MMSE). Specifically, the ratio of the number of neurons in the hidden layer to the number of neurons in the input layer of TDNN is 3: 2, and the nonlinear activation S function (the [ integral ] notation in FIG. 2) is And y is the input after weighted summation.
During training, the inertia coefficient γ of the neural network is 0.8.
(2) Setting a convergence condition and a maximum iteration number; specifically, the convergence condition is that the euclidean distance between the GMM coefficients and the TDNN weight coefficients of two adjacent times is less than 0.0001, and the maximum number of iterations is usually not more than 100.
(3) Randomly determining TDNN and GMM model parameters of initial iteration; the initial coefficient of TDNN is set as pseudo random number generated by computer, the initial mixing coefficient of GMM can be 1/M, M is the number of mixing terms of GMM, the initial mean value and variance of GMM are obtained by generating M aggregation classes by LBG (Linde, Buzo, Gray) method of the residual vector of TDNN, and respectively calculating the mean value and variance of the M aggregation classes.
(4) Inputting the feature vector x (n) into a TDNN network, and subtracting the feature vector x (n) before passing through the TDNN from the output feature vector o (n) of the TDNN to obtain all residual vectors;
(5) correcting parameters of the GMM model by adopting an EM (effective velocity) method;
let the residual vector be rtFirst, the class posterior probability is calculated using equation (4):
then, the updated mixing weight values are obtained by the formulas (5), (6) and (7)Mean vectorSum covariance matrix
(6) Carrying out residual error substitution by using the weight coefficient, mean vector and variance of each Gaussian distribution of the modified GMM model to obtain a likelihood probability, and modifying the parameter lambda of the TDNN by using a backward inversion method with inertia;
specifically, the parameter λ of TDNN is obtained by maximizing the function in the following equation:
wherein p (x | λ) is represented by formula (1), otIs the output vector of TDNN, xtIs the feature vector of the input TDNN.
Because the minimum value is generally solved during the neural network iteration, and the product of the sum formula is more convenient than the product of the sum formula, the logarithm of the above formula is taken and then the negation is taken, and the following results are obtained:
here, the parameter λ is embodied asIs the m-th iteration, the input x is connectediAnd output yjK is the layer number of the neural network. The method for solving G (X) by adopting the backward inversion method with inertia can accelerate the iterative convergence process and better process the local minimum value problem, and the iterative formula of the backward inversion method with inertia is as follows:
wherein,α is the iteration step, f (x) -lnp ((x)t-ot) | λ), γ is the inertia coefficient, where γ is 0.8.
In the formula (10), the compound represented by the formula (10),the calculation process of (2) is as follows:
the following calculation formulas for the two product terms in equation (11) are respectively found, since in a neural network:
in the above two formulas, the first and second groups,the output when sample x is input to the ith neuron of the k layer,the input when sample x is input for the ith neuron in the kth layer,is an activation function. Then:
Wherein:
Since backward inversion is used, the calculation is carried outTime of flightIt is known that the expression (16) can be substituted
(7) And (3) judging whether the convergence condition set in the step (2) is met or whether the maximum iteration number is reached, if so, stopping training, otherwise, jumping to the step (4).
3. Speaker recognition
During identification, the feature vector enters TDNN after being delayed. Since TDNN has learned the structure and timing information of the feature space, the input of the sequence X of feature vectors to be recognized will transform the feature vectors accordingly. Because the training is carried out, the TDNN transformation plays the roles of enhancing the likelihood probability of the target model and reducing the likelihood probability of the non-target model. Then, a residual sequence R obtained by subtracting the sequence O output after the TDNN transformation from X is provided for the GMM model, and the sequence R of the T residual vectors is equal to R1,R2,...,RTIts GMM probability can be written as:
expressed in the logarithmic domain as:
and (3) applying Bayes theorem during identification, wherein in the models of N unknown speakers, the speaker corresponding to the model with the maximum likelihood probability is the target speaker:
here we selected 107 target speakers among them, 63 male and 44 female, using 1conv4w-1conv4w tested in NIST 2006 as an experiment. Each person selected about 2 minutes of speech as the training speech and the remaining speech as the test speech, thus forming about 23000 tests.
In order to test the improvement effect of the method of the present invention in a noisy environment, noise data were selected as noise (stationary noise) in a running car (2000cc group, general road) and noise (non-stationary noise) in a display booth in an exhibition in the japan electronics association standard noise database. These noises are superimposed on the 1conv4w-1conv4w speech at a certain signal-to-noise ratio (SNR) to generate a noisy speech.
Adopting correct recognition rate as the standard for judging the recognition effect of the speaker, wherein, correct _ ratio is Nv/Nt. Wherein, correct _ ratio is the correct recognition rate, NvNumber of tests for proper identification, NtTotal number of tests.
The method of the present invention (denoted TDNN-GMM) and the speaker training and recognition method using GMM only (denoted baseline GMM) are described herein. The results of the experiment are shown in FIGS. 3-5. FIG. 3 is a comparison of the recognition effect of varying the number of mixture terms M of the Gaussian probability density function in GMM at 1conv4w-1conv4w in the absence of noise; from fig. 3, it is found that the recognition effect of the GMM is improved after the TDNN is embedded, and the improvement effect is more obvious when the number of mixed terms M is less, because the learning effect of the neural network is better when the subclasses in the class are less.
Fig. 4 and 5 compare the comparison results under different noise and signal-to-noise ratios (SNR), where M is 80. As can be seen from fig. 4 and 5, the method of the present invention has a great improvement in speaker recognition effect compared to the baseline GMM under different signal-to-noise ratios.
Claims (4)
1. A speaker recognition method based on a Gaussian mixture model embedded with a time delay neural network is characterized by comprising the following steps:
(1) preprocessing and feature extraction;
firstly, carrying out silence detection by using a method based on energy and a zero crossing rate, removing noise by using a spectral subtraction method, carrying out pre-emphasis and framing on a speech signal, carrying out Linear Prediction (LPC) analysis, and then obtaining a cepstrum coefficient from the obtained LPC coefficient to be used as a feature vector for speaker identification;
(2) training;
during training, delaying the extracted feature vector, then entering a Time Delay Neural Network (TDNN), wherein the TDNN learns the structure of the feature vector, and extracting time information of a feature vector sequence; then, providing the learning result to a Gaussian Mixture Model (GMM) in the form of residual error characteristic vector, training the GMM by adopting a maximum expectation method, and updating the weight coefficient of the TDNN by utilizing a backward inversion method with inertia; the specific training process is as follows:
(2-1) determining the GMM model and the TDNN structure:
the probability density function of an M-order GMM is obtained by weighted summation of M gaussian probability density functions and can be expressed as follows:
in the above formula xtD-dimensional feature vectors, where D is 13; bi(xt) Is a member density function which is the mean vector uiThe covariance matrix is sigmai(ii) a gaussian function of;
λ={(pi,ui,∑i),i=1,2,...,M}
the TDNN characteristic vector x (n) without feedback is used as the input of the TDNN after being delayed by a linear delay block, the TDNN performs nonlinear transformation on the input, then performs linear weighting to obtain an output vector, and then compares the output vector with the characteristic vector, wherein the commonly used criterion is minimum mean square criterion (MMSE); the ratio of the number of neurons in the hidden layer to the number of neurons in the input layer of TDNN is 3: 2, and the nonlinear activation S function isy is the input after weighted summation; during training, the inertia coefficient gamma of the neural network is 0.8;
(2-2) setting a convergence condition and a maximum iteration number; specifically, the convergence condition is that the euclidean distance between the GMM coefficients and the TDNN weight coefficients of two adjacent times is less than 0.0001, and the maximum iteration number is usually not more than 100;
(2-3) randomly determining TDNN and GMM model parameters of the initial iteration; setting an initial coefficient of TDNN as a pseudo-random number generated by a computer, wherein an initial mixing coefficient of GMM can be 1/M, M is the number of mixing terms of GMM, the initial mean value and variance of GMM are obtained by generating M aggregation classes through an LBG (Linde, Buzo, Gray) method by using a residual vector of TDNN, and respectively calculating the mean value and the variance of the M aggregation classes;
(2-4) inputting the feature vector x (n) into a TDNN network, and subtracting the feature vector x (n) before passing through the TDNN from the output feature vector o (n) of the TDNN to obtain all residual vectors;
(2-5) correcting parameters of the GMM model by adopting a maximum expectation method;
let the residual vector be rtFirst, the class posterior probability is calculated:
(2-6) substituting residual errors by using the weight coefficient, mean vector and variance of each Gaussian distribution of the modified GMM model to obtain a likelihood probability, and modifying TDNN parameters by using a backward inversion method with inertia;
the TDNN parameter is obtained by maximizing the function in the following equation:
wherein o istIs output by the neural network, xtIs the input feature vector;
and taking the logarithm of the formula and then taking the negative of the formula to obtain:
and (3) solving G (X) by adopting a backward inversion method with inertia, wherein the iterative formula is as follows:
wherein, for the m-th iteration, input x is connectediAnd output yjK is the layer number of the neural network, α is the iteration step size, f (x) -lnp ((x)t-ot) | λ), γ is the inertia coefficient;
(2-7) judging whether the convergence condition set in the step (2-2) is met or whether the maximum iteration number is reached, if so, stopping training, otherwise, jumping to the step (2-4);
(3) identification
During identification, inputting the characteristic vector sequence X into TDNN after delaying; then, a residual sequence R obtained by subtracting the output sequence O of the TDNN from X is provided to the GMM model, and R is equal to R for the sequence of T residual vectors1,R2,...,RTIts GMM probability can be written as:
expressed in the logarithmic domain as:
and (3) applying Bayes theorem during identification, wherein in the models of N unknown speakers, the speaker corresponding to the model with the maximum likelihood probability is the target speaker:
2. the speaker recognition method according to claim 1, wherein the speaker recognition method is based on a Gaussian mixture model with embedded delay neural networkThe calculation process is as follows:
in the case of a TDNN network, the output when sample x is input to the ith neuron of the k layer,the input when sample x is input for the ith neuron in the kth layer,to activate the function, then:
3. the speaker recognition method according to claim 2, wherein the speaker recognition method is based on the Gaussian mixture model with embedded delay neural networkThe calculation process is divided into two cases of an output layer and an implicit layer of the TDNN;
for the output layer:
wherein: <math><mrow><mrow><msub><mi>a</mi><mi>n</mi></msub><mrow><mo>(</mo><msub><mrow><mi>x</mi><mo>-</mo><mi>o</mi><mo>-</mo><mi>u</mi></mrow><mi>n</mi></msub><mo>)</mo></mrow><mo>=</mo><msup><mi>e</mi><mrow><mo>-</mo><mfrac><mn>1</mn><mn>2</mn></mfrac><msup><mrow><mo>(</mo><msub><mrow><mi>x</mi><mo>-</mo><mi>o</mi><mo>-</mo><mi>u</mi></mrow><mi>n</mi></msub><mo>)</mo></mrow><mi>T</mi></msup><msubsup><mi>Σ</mi><mi>n</mi><mrow><mo>-</mo><mn>1</mn></mrow></msubsup><mrow><mo>(</mo><msub><mrow><mi>x</mi><mo>-</mo><mi>o</mi><mo>-</mo><mi>u</mi></mrow><mi>n</mi></msub><mo>)</mo></mrow></mrow></msup></mrow><mo>,</mo></mrow></math> <math><mrow><msub><mi>c</mi><mi>n</mi></msub><mo>=</mo><mfrac><mn>1</mn><mrow><msup><mrow><mo>(</mo><mn>2</mn><mi>π</mi><mo>)</mo></mrow><mrow><mi>D</mi><mo>/</mo><mn>2</mn></mrow></msup><msup><mrow><mo>|</mo><msub><mi>Σ</mi><mi>n</mi></msub><mo>|</mo></mrow><mrow><mn>1</mn><mo>/</mo><mn>2</mn></mrow></msup></mrow></mfrac></mrow></math>
for the hidden layer:
4. the speaker recognition method according to claim 1, wherein the pre-emphasis is performed by using f (Z) -1-0.97Z-1The filter adopts a Hamming window with the length of 20ms and the window shift of 10ms for framing, the order of linear prediction analysis is 20, and the feature vector is a cepstrum coefficient of 13 orders.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2009100354240A CN102034472A (en) | 2009-09-28 | 2009-09-28 | Speaker recognition method based on Gaussian mixture model embedded with time delay neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2009100354240A CN102034472A (en) | 2009-09-28 | 2009-09-28 | Speaker recognition method based on Gaussian mixture model embedded with time delay neural network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN102034472A true CN102034472A (en) | 2011-04-27 |
Family
ID=43887277
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2009100354240A Pending CN102034472A (en) | 2009-09-28 | 2009-09-28 | Speaker recognition method based on Gaussian mixture model embedded with time delay neural network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102034472A (en) |
Cited By (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102436809A (en) * | 2011-10-21 | 2012-05-02 | 东南大学 | Network speech recognition method in English oral language machine examination system |
CN102708871A (en) * | 2012-05-08 | 2012-10-03 | 哈尔滨工程大学 | Line spectrum-to-parameter dimensional reduction quantizing method based on conditional Gaussian mixture model |
WO2013086736A1 (en) * | 2011-12-16 | 2013-06-20 | 华为技术有限公司 | Speaker recognition method and device |
CN103680496A (en) * | 2013-12-19 | 2014-03-26 | 百度在线网络技术(北京)有限公司 | Deep-neural-network-based acoustic model training method, hosts and system |
CN104112445A (en) * | 2014-07-30 | 2014-10-22 | 宇龙计算机通信科技(深圳)有限公司 | Terminal and voice identification method |
CN104183239A (en) * | 2014-07-25 | 2014-12-03 | 南京邮电大学 | Method for identifying speaker unrelated to text based on weighted Bayes mixture model |
CN104882141A (en) * | 2015-03-03 | 2015-09-02 | 盐城工学院 | Serial port voice control projection system based on time delay neural network and hidden Markov model |
CN105765562A (en) * | 2013-12-03 | 2016-07-13 | 罗伯特·博世有限公司 | Method and device for determining a data-based functional model |
CN106779050A (en) * | 2016-11-24 | 2017-05-31 | 厦门中控生物识别信息技术有限公司 | The optimization method and device of a kind of convolutional neural networks |
CN108417224A (en) * | 2018-01-19 | 2018-08-17 | 苏州思必驰信息科技有限公司 | The training and recognition methods of two way blocks model and system |
CN108877823A (en) * | 2018-07-27 | 2018-11-23 | 三星电子(中国)研发中心 | Sound enhancement method and device |
CN109036386A (en) * | 2018-09-14 | 2018-12-18 | 北京网众共创科技有限公司 | A kind of method of speech processing and device |
CN109119089A (en) * | 2018-06-05 | 2019-01-01 | 安克创新科技股份有限公司 | The method and apparatus of penetrating processing is carried out to music |
CN109166571A (en) * | 2018-08-06 | 2019-01-08 | 广东美的厨房电器制造有限公司 | Wake-up word training method, device and the household appliance of household appliance |
CN109214444A (en) * | 2018-08-24 | 2019-01-15 | 小沃科技有限公司 | Game Anti-addiction decision-making system and method based on twin neural network and GMM |
CN109271482A (en) * | 2018-09-05 | 2019-01-25 | 东南大学 | A kind of implementation method of the automatic Evaluation Platform of postgraduates'english oral teaching voice |
CN109326278A (en) * | 2017-07-31 | 2019-02-12 | 科大讯飞股份有限公司 | Acoustic model construction method and device and electronic equipment |
CN110232932A (en) * | 2019-05-09 | 2019-09-13 | 平安科技(深圳)有限公司 | Method for identifying speaker, device, equipment and medium based on residual error time-delay network |
-
2009
- 2009-09-28 CN CN2009100354240A patent/CN102034472A/en active Pending
Cited By (30)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102436809A (en) * | 2011-10-21 | 2012-05-02 | 东南大学 | Network speech recognition method in English oral language machine examination system |
US9142210B2 (en) | 2011-12-16 | 2015-09-22 | Huawei Technologies Co., Ltd. | Method and device for speaker recognition |
CN103562993A (en) * | 2011-12-16 | 2014-02-05 | 华为技术有限公司 | Speaker recognition method and device |
WO2013086736A1 (en) * | 2011-12-16 | 2013-06-20 | 华为技术有限公司 | Speaker recognition method and device |
CN103562993B (en) * | 2011-12-16 | 2015-05-27 | 华为技术有限公司 | Speaker recognition method and device |
CN102708871A (en) * | 2012-05-08 | 2012-10-03 | 哈尔滨工程大学 | Line spectrum-to-parameter dimensional reduction quantizing method based on conditional Gaussian mixture model |
CN105765562B (en) * | 2013-12-03 | 2022-01-11 | 罗伯特·博世有限公司 | Method and device for obtaining a data-based function model |
CN105765562A (en) * | 2013-12-03 | 2016-07-13 | 罗伯特·博世有限公司 | Method and device for determining a data-based functional model |
CN103680496A (en) * | 2013-12-19 | 2014-03-26 | 百度在线网络技术(北京)有限公司 | Deep-neural-network-based acoustic model training method, hosts and system |
CN103680496B (en) * | 2013-12-19 | 2016-08-10 | 百度在线网络技术(北京)有限公司 | Acoustic training model method based on deep-neural-network, main frame and system |
CN104183239B (en) * | 2014-07-25 | 2017-04-19 | 南京邮电大学 | Method for identifying speaker unrelated to text based on weighted Bayes mixture model |
CN104183239A (en) * | 2014-07-25 | 2014-12-03 | 南京邮电大学 | Method for identifying speaker unrelated to text based on weighted Bayes mixture model |
CN104112445A (en) * | 2014-07-30 | 2014-10-22 | 宇龙计算机通信科技(深圳)有限公司 | Terminal and voice identification method |
CN104882141A (en) * | 2015-03-03 | 2015-09-02 | 盐城工学院 | Serial port voice control projection system based on time delay neural network and hidden Markov model |
CN106779050A (en) * | 2016-11-24 | 2017-05-31 | 厦门中控生物识别信息技术有限公司 | The optimization method and device of a kind of convolutional neural networks |
CN109326278A (en) * | 2017-07-31 | 2019-02-12 | 科大讯飞股份有限公司 | Acoustic model construction method and device and electronic equipment |
CN109326278B (en) * | 2017-07-31 | 2022-06-07 | 科大讯飞股份有限公司 | Acoustic model construction method and device and electronic equipment |
CN108417224B (en) * | 2018-01-19 | 2020-09-01 | 苏州思必驰信息科技有限公司 | Training and recognition method and system of bidirectional neural network model |
CN108417224A (en) * | 2018-01-19 | 2018-08-17 | 苏州思必驰信息科技有限公司 | The training and recognition methods of two way blocks model and system |
CN109119089A (en) * | 2018-06-05 | 2019-01-01 | 安克创新科技股份有限公司 | The method and apparatus of penetrating processing is carried out to music |
CN108877823A (en) * | 2018-07-27 | 2018-11-23 | 三星电子(中国)研发中心 | Sound enhancement method and device |
CN108877823B (en) * | 2018-07-27 | 2020-12-18 | 三星电子(中国)研发中心 | Speech enhancement method and device |
CN109166571A (en) * | 2018-08-06 | 2019-01-08 | 广东美的厨房电器制造有限公司 | Wake-up word training method, device and the household appliance of household appliance |
CN109214444A (en) * | 2018-08-24 | 2019-01-15 | 小沃科技有限公司 | Game Anti-addiction decision-making system and method based on twin neural network and GMM |
CN109214444B (en) * | 2018-08-24 | 2022-01-07 | 小沃科技有限公司 | Game anti-addiction determination system and method based on twin neural network and GMM |
CN109271482A (en) * | 2018-09-05 | 2019-01-25 | 东南大学 | A kind of implementation method of the automatic Evaluation Platform of postgraduates'english oral teaching voice |
CN109036386A (en) * | 2018-09-14 | 2018-12-18 | 北京网众共创科技有限公司 | A kind of method of speech processing and device |
WO2020224114A1 (en) * | 2019-05-09 | 2020-11-12 | 平安科技(深圳)有限公司 | Residual delay network-based speaker confirmation method and apparatus, device and medium |
CN110232932A (en) * | 2019-05-09 | 2019-09-13 | 平安科技(深圳)有限公司 | Method for identifying speaker, device, equipment and medium based on residual error time-delay network |
CN110232932B (en) * | 2019-05-09 | 2023-11-03 | 平安科技(深圳)有限公司 | Speaker confirmation method, device, equipment and medium based on residual delay network |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102034472A (en) | Speaker recognition method based on Gaussian mixture model embedded with time delay neural network | |
CN102693724A (en) | Noise classification method of Gaussian Mixture Model based on neural network | |
Hermansky et al. | Tandem connectionist feature extraction for conventional HMM systems | |
Weninger et al. | Single-channel speech separation with memory-enhanced recurrent neural networks | |
US11776548B2 (en) | Convolutional neural network with phonetic attention for speaker verification | |
Prasad et al. | Improved cepstral mean and variance normalization using Bayesian framework | |
CN110706692B (en) | Training method and system of child voice recognition model | |
US8838446B2 (en) | Method and apparatus of transforming speech feature vectors using an auto-associative neural network | |
CN101814159B (en) | Speaker verification method based on combination of auto-associative neural network and Gaussian mixture background model | |
CN109346084A (en) | Method for distinguishing speek person based on depth storehouse autoencoder network | |
US10283112B2 (en) | System and method for neural network based feature extraction for acoustic model development | |
Mallidi et al. | Autoencoder based multi-stream combination for noise robust speech recognition. | |
Pandharipande et al. | An unsupervised frame selection technique for robust emotion recognition in noisy speech | |
Sivaram et al. | Data-driven and feedback based spectro-temporal features for speech recognition | |
Reshma et al. | A survey on speech emotion recognition | |
Dolfing et al. | Combination of confidence measures in isolated word recognition. | |
Mohammadi et al. | Weighted X-vectors for robust text-independent speaker verification with multiple enrollment utterances | |
CN104183239B (en) | Method for identifying speaker unrelated to text based on weighted Bayes mixture model | |
CN104036777A (en) | Method and device for voice activity detection | |
Mekonnen et al. | Noise robust speaker verification using GMM-UBM multi-condition training | |
Yee et al. | Malay language text-independent speaker verification using NN-MLP classifier with MFCC | |
Nathwani et al. | Consistent DNN uncertainty training and decoding for robust ASR | |
Avila et al. | On the use of blind channel response estimation and a residual neural network to detect physical access attacks to speaker verification systems | |
Tzagkarakis et al. | Sparsity based robust speaker identification using a discriminative dictionary learning approach | |
Nehra et al. | Speaker identification system using CNN approach |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C02 | Deemed withdrawal of patent application after publication (patent law 2001) | ||
WD01 | Invention patent application deemed withdrawn after publication |
Open date: 20110427 |