CN102034472A

CN102034472A - Speaker recognition method based on Gaussian mixture model embedded with time delay neural network

Info

Publication number: CN102034472A
Application number: CN2009100354240A
Authority: CN
Inventors: 戴红霞; 王吉林; 余华; 魏昕; 赵力
Original assignee: Individual
Current assignee: Individual
Priority date: 2009-09-28
Filing date: 2009-09-28
Publication date: 2011-04-27

Abstract

The invention discloses a speaker recognition method based on a Gaussian mixture model (GMM) embedded with a time delay neural network (TDNN). In the speaker recognition method, the advantages of the TDNN and the GMM are fully considered, the TDNN is embedded into the GMM, and solves a residual of input and output vectors of the TDNN by fully utilizing the time sequence of an input characteristic vector through the conversion of a time delay network, and the residual modifies the training of the GMM through an expectation maximization method; besides, a likelihood probability is acquired by a modified GMM model parameter and the residual, and a TDNN parameter is modified by an inertial backward inversion method so as to ensure that parameters of the GMM and the TDNN are alternately updated. An experiment shows that: a recognition rate of the method is improved to a certain extent compared with that of a baseline GMM under various signal to noise ratios.

Description

Speaker recognition method based on Gaussian mixture model embedded with time delay neural network

Technical Field

The invention relates to a speaker recognition method, in particular to a speaker recognition method based on a Gaussian mixture model embedded with a time delay neural network.

Background

In the aspects of entrance guard, credit card transaction, court evidence and the like, automatic speaker recognition, particularly speaker recognition irrelevant to text plays an increasingly important role, and the aim of the method is to correctly judge the voice to be recognized as belonging to one of a plurality of reference persons in a voice library.

In the speaker recognition method, attention is paid more and more to a Gaussian Mixture Model (GMM) based method, and the GMM based method has become the mainstream recognition method at present because of its advantages of high recognition rate, simple training, low requirement for training data volume, and the like. Since the Gaussian Mixture Model (GMM) has a good ability to represent the distribution of data, it can approximate any distribution model as long as there are enough terms, and enough training data. However, there are several problems with the practical use of GMMs. Firstly, GMM does not utilize the time information of speaker voice, and the training and recognition result is irrelevant to the input sequence of the feature vector; secondly, during GMM training, we always assume that the feature vectors are independent of each other, which is obviously unreasonable; in addition, because the selection of the number of the mixture items does not have a good guiding principle when the GMM model is selected, enough Gaussian mixture items are required to obtain a good result.

The neural network also has an important position in the speaker recognition aspect, the multilayer perceptron, the ray-based network, the self-association neural network and the like are successfully applied to speaker recognition, and particularly, the Time Delay Neural Network (TDNN) is widely applied to signal processing, speech recognition and speaker recognition. However, the GMM and TDNN are only used for speaker recognition at present, and no method for combining the advantages of the GMM and TDNN to better improve the speaker recognition effect exists.

Disclosure of Invention

The invention aims to solve the defects of the prior art and provides a speaker recognition method based on a Gaussian mixture model embedded with a time delay neural network. The technical scheme of the invention is as follows:

a speaker recognition method based on a Gaussian mixture model embedded with a time delay neural network comprises the following steps:

(1) preprocessing and feature extraction;

first, silence detection is performed using a method based on energy and a zero-crossing rate, noise is removed by spectral subtraction, a speech signal is pre-emphasized, framed, and subjected to Linear Prediction (LPC) analysis, and then cepstrum coefficients are found from the obtained LPC coefficients as feature vectors for speaker recognition.

(2) Training;

during training, the extracted feature vector is delayed and then used as the input of TDNN, the TDNN learns the structure of the feature vector, and the time information of the feature vector sequence is extracted. And then providing the learning result to the GMM in the form of residual characteristic vectors, carrying out GMM model training by adopting a maximum Expectation (EM) criterion, and updating weight coefficients of the TDNN by using a backward inversion method with inertia. The specific training process is as follows:

(2-1) determining the GMM model and the TDNN structure:

the probability density function of an M-order GMM is obtained by weighted summation of M gaussian probability density functions and can be expressed as follows:

in the above formula x_tD-dimensional feature vectors, where D is 13;b_i(x_t) Is a member density function which is the mean vector u_iThe covariance matrix is sigma_i(ii) a gaussian function of;

p_ithe mixed weight satisfies the condition:

the complete GMM model parameters are as follows:

λ＝{(p_i，u_i，∑_i)，i＝1，2，...，M}

here, TDNN without feedback is utilized. The eigenvector x (n) is delayed by the linear delay block and then used as the input of the TDNN, which performs nonlinear transformation on the input, then performs linear weighting to obtain the output vector, and then compares the output vector with the eigenvector, wherein the commonly used criterion is minimum mean square criterion (MMSE). Specifically, the ratio of the number of neurons in the hidden layer to the number of neurons in the input layer of TDNN is 3: 2, and the nonlinear activation S function (the [ integral ] notation in FIG. 2) is

And y is the input after weighted summation. During training, the inertia coefficient gamma of the neural network is 0.8.

(2-2) setting a convergence condition and a maximum iteration number; specifically, the convergence condition is that the euclidean distance between the GMM coefficients and the TDNN weight coefficients of two adjacent times is less than 0.0001, and the maximum number of iterations is usually not more than 100.

(2-3) randomly determining TDNN and GMM model parameters of the initial iteration; the initial coefficient of TDNN is set as pseudo random number generated by computer, the initial mixing coefficient of GMM can be 1/M, M is the number of mixing terms of GMM, the initial mean value and variance of GMM are obtained by generating M aggregation classes by LBG (Linde, Buzo, Gray) method of the residual vector of TDNN, and respectively calculating the mean value and variance of the M aggregation classes.

(2-4) inputting the feature vector x (n) into a TDNN network, and subtracting the feature vector x (n) before passing through the TDNN from the output feature vector o (n) of the TDNN to obtain all residual vectors;

(2-5) correcting parameters of the GMM by adopting an EM method;

let the residual vector be r_tFirst, the class posterior probability is calculated:

then updating the mixing weightMean vector

Sum covariance matrix

<math><mrow><mover><msub><mi>p</mi><mi>i</mi></msub><mo>&OverBar;</mo></mover><mo>=</mo><mfrac><mn>1</mn><mi>N</mi></mfrac><munderover><mi>Σ</mi><mrow><mi>t</mi><mo>=</mo><mn>1</mn></mrow><mi>N</mi></munderover><mi>p</mi><mrow><mo>(</mo><msub><mrow><mi>i</mi><mo>|</mo><mi>r</mi></mrow><mi>t</mi></msub><mo>,</mo><mi>λ</mi><mo>)</mo></mrow></mrow></math>

<math><mrow><mover><msub><mi>u</mi><mi>i</mi></msub><mo>&OverBar;</mo></mover><mo>=</mo><mfrac><mrow><msubsup><mi>Σ</mi><mrow><mi>t</mi><mo>=</mo><mn>1</mn></mrow><mi>N</mi></msubsup><mi>p</mi><mrow><mo>(</mo><msub><mrow><mi>i</mi><mo>|</mo><mi>r</mi></mrow><mi>t</mi></msub><mo>,</mo><mi>λ</mi><mo>)</mo></mrow><msub><mi>x</mi><mi>t</mi></msub></mrow><mrow><msubsup><mi>Σ</mi><mrow><mi>t</mi><mo>=</mo><mn>1</mn></mrow><mi>N</mi></msubsup><mi>p</mi><mrow><mo>(</mo><msub><mrow><mi>i</mi><mo>|</mo><mi>r</mi></mrow><mi>t</mi></msub><mo>,</mo><mi>λ</mi><mo>)</mo></mrow></mrow></mfrac></mrow></math>

<math><mrow><msubsup><mover><mi>Σ</mi><mo>&OverBar;</mo></mover><mi>i</mi><mn>2</mn></msubsup><mo>=</mo><mfrac><mrow><msubsup><mi>Σ</mi><mrow><mi>t</mi><mo>=</mo><mn>1</mn></mrow><mi>N</mi></msubsup><mi>p</mi><mrow><mo>(</mo><msub><mrow><mi>i</mi><mo>|</mo><mi>r</mi></mrow><mi>t</mi></msub><mo>,</mo><mi>λ</mi><mo>)</mo></mrow><msubsup><mi>x</mi><mi>t</mi><mn>2</mn></msubsup></mrow><mrow><msubsup><mi>Σ</mi><mrow><mi>t</mi><mo>=</mo><mn>1</mn></mrow><mi>N</mi></msubsup><mi>p</mi><mrow><mo>(</mo><msub><mrow><mi>i</mi><mo>|</mo><mi>r</mi></mrow><mi>t</mi></msub><mo>,</mo><mi>λ</mi><mo>)</mo></mrow></mrow></mfrac><mo>-</mo><msup><mover><msub><mi>u</mi><mi>i</mi></msub><mo>&OverBar;</mo></mover><mn>2</mn></msup></mrow></math>

(2-6) substituting residual errors by using the weight coefficient, mean vector and variance of each Gaussian distribution of the modified GMM model to obtain a likelihood probability, and modifying TDNN parameters by using a backward inversion method with inertia;

the TDNN network parameters are obtained by maximizing the function in the following equation:

wherein o is_tIs output by the neural network, x_tIs the input feature vector.

And taking the logarithm of the formula and then taking the negative of the formula to obtain:

and (3) solving G (X) by adopting a backward inversion method with inertia, wherein the iterative formula is as follows:

<math><mrow><msubsup><mi>Δω</mi><mi>ij</mi><mi>k</mi></msubsup><mrow><mo>(</mo><mi>m</mi><mo>+</mo><mn>1</mn><mo>)</mo></mrow><mo>=</mo><msubsup><mi>γΔω</mi><mi>ij</mi><mi>k</mi></msubsup><mrow><mo>(</mo><mi>m</mi><mo>)</mo></mrow><mo>-</mo><mrow><mo>(</mo><mn>1</mn><mo>-</mo><mi>γ</mi><mo>)</mo></mrow><mi>α</mi><mfrac><mrow><mo>&PartialD;</mo><mi>F</mi><mrow><mo>(</mo><mi>x</mi><mo>)</mo></mrow></mrow><mrow><mo>&PartialD;</mo><msubsup><mi>ω</mi><mi>ij</mi><mi>k</mi></msubsup></mrow></mfrac><msub><mo>|</mo><mrow><msubsup><mi>ω</mi><mi>ij</mi><mi>k</mi></msubsup><mo>=</mo><msubsup><mi>ω</mi><mi>ij</mi><mi>k</mi></msubsup><mrow><mo>(</mo><mi>m</mi><mo>)</mo></mrow></mrow></msub></mrow></math>

wherein,

for the m-th iteration, input x is connected_iAnd output y_jK is the layer number of the neural network, α is the iteration step size, f (x) -lnp ((x)_t-o_t) | λ), γ is the inertia coefficient.

And (2-7) judging whether the convergence condition set in the step (2-2) is met or whether the maximum iteration number is reached, if so, stopping training, otherwise, jumping to the step (2-4).

(3) Speaker recognition

During identification, the feature vector sequence X is input into TDNN after being delayed. Then, a residual sequence R obtained by subtracting the output sequence O of the TDNN from X is provided to the GMM model, and R is equal to R for the sequence of T residual vectors₁，R₂，...，R_TIts GMM probability can be written as:

expressed in the logarithmic domain as:

and (3) applying Bayes theorem during identification, wherein in the models of N unknown speakers, the speaker corresponding to the model with the maximum likelihood probability is the target speaker:

in the speaker identification method based on the Gaussian mixture model embedded with the time delay neural network, the speaker identification method comprises the following steps

The calculation process of (2) is as follows:

<math><mrow><mfrac><mrow><mo>&PartialD;</mo><mi>F</mi><mrow><mo>(</mo><mi>x</mi><mo>)</mo></mrow></mrow><mrow><mo>&PartialD;</mo><msubsup><mi>ω</mi><mi>ij</mi><mi>k</mi></msubsup></mrow></mfrac><mo>=</mo><mfrac><mrow><mo>&PartialD;</mo><mi>F</mi><mrow><mo>(</mo><mi>x</mi><mo>)</mo></mrow></mrow><msubsup><mrow><mo>&PartialD;</mo><mi>y</mi></mrow><mi>i</mi><mi>k</mi></msubsup></mfrac><mfrac><msubsup><mrow><mo>&PartialD;</mo><mi>y</mi></mrow><mi>i</mi><mi>k</mi></msubsup><msubsup><mrow><mo>&PartialD;</mo><mi>ω</mi></mrow><mi>ij</mi><mi>k</mi></msubsup></mfrac></mrow></math>

in the case of a TDNN network,

the output when sample x is input to the ith neuron of the k layer,

the input when sample x is input for the ith neuron in the kth layer,

is an activation function. Then:

<math><mrow><mfrac><msubsup><mrow><mo>&PartialD;</mo><mi>y</mi></mrow><mi>i</mi><mi>k</mi></msubsup><msubsup><mrow><mo>&PartialD;</mo><mi>ω</mi></mrow><mi>ij</mi><mi>k</mi></msubsup></mfrac><mo>=</mo><msubsup><mi>o</mi><mi>j</mi><mrow><mi>k</mi><mo>-</mo><mn>1</mn></mrow></msubsup></mrow></math>

in the speaker identification method based on the Gaussian mixture model embedded with the time delay neural network, the speaker identification method comprises the following stepsThe calculation process is divided into two cases of an output layer and an implicit layer of the TDNN;

for the output layer:

<math><mrow><mfrac><mrow><mo>&PartialD;</mo><mi>F</mi><mrow><mo>(</mo><mi>x</mi><mo>)</mo></mrow></mrow><msubsup><mrow><mo>&PartialD;</mo><mi>y</mi></mrow><mi>i</mi><mi>k</mi></msubsup></mfrac><mo>=</mo><mo>-</mo><mfrac><mn>1</mn><mrow><mi>p</mi><mrow><mo>(</mo><mrow><mo>(</mo><mi>x</mi><mo>-</mo><mi>o</mi><mo>)</mo></mrow><mo>|</mo><mi>λ</mi><mo>)</mo></mrow></mrow></mfrac><mfrac><mrow><mo>&PartialD;</mo><mi>p</mi><mrow><mo>(</mo><mrow><mo>(</mo><mi>x</mi><mo>-</mo><mi>o</mi><mo>)</mo></mrow><mo>|</mo><mi>λ</mi><mo>)</mo></mrow></mrow><msubsup><mrow><mo>&PartialD;</mo><mi>o</mi></mrow><mi>i</mi><mi>k</mi></msubsup></mfrac><mfrac><msubsup><mrow><mo>&PartialD;</mo><mi>o</mi></mrow><mi>i</mi><mi>k</mi></msubsup><msubsup><mi>y</mi><mi>i</mi><mi>k</mi></msubsup></mfrac></mrow></math>

<math><mrow><mo>=</mo><mo>-</mo><mfrac><mrow><msup><mi>f</mi><mo>′</mo></msup><mrow><mo>(</mo><msubsup><mi>y</mi><mi>i</mi><mi>k</mi></msubsup><mo>)</mo></mrow></mrow><mrow><mi>p</mi><mrow><mo>(</mo><mrow><mo>(</mo><mi>x</mi><mo>-</mo><mi>o</mi><mo>)</mo></mrow><mo>|</mo><mi>λ</mi><mo>)</mo></mrow></mrow></mfrac><mo>&PartialD;</mo><mrow><mo>(</mo><munderover><mi>Σ</mi><mrow><mi>n</mi><mo>=</mo><mn>1</mn></mrow><mi>M</mi></munderover><msub><mi>p</mi><mi>n</mi></msub><msub><mi>c</mi><mi>n</mi></msub><msup><mi>e</mi><mrow><mo>-</mo><mfrac><mn>1</mn><mn>2</mn></mfrac><msup><mrow><mo>(</mo><msub><mrow><mi>x</mi><mo>-</mo><mi>o</mi><mo>-</mo><mi>u</mi></mrow><mi>n</mi></msub><mo>)</mo></mrow><mi>T</mi></msup><msubsup><mi>Σ</mi><mi>n</mi><mrow><mo>-</mo><mn>1</mn></mrow></msubsup><mrow><mo>(</mo><msub><mrow><mi>x</mi><mo>-</mo><mi>o</mi><mo>-</mo><mi>u</mi></mrow><mi>n</mi></msub><mo>)</mo></mrow></mrow></msup><mo>)</mo></mrow><mo>/</mo><msubsup><mrow><mo>&PartialD;</mo><mi>o</mi></mrow><mi>i</mi><mi>k</mi></msubsup></mrow></math>

wherein:

for the hidden layer:

<math><mrow><mfrac><mrow><mo>&PartialD;</mo><mi>F</mi><mrow><mo>(</mo><mi>x</mi><mo>)</mo></mrow></mrow><msubsup><mrow><mo>&PartialD;</mo><mi>y</mi></mrow><mi>i</mi><mi>k</mi></msubsup></mfrac><mo>=</mo><munder><mi>Σ</mi><mi>j</mi></munder><mfrac><mrow><mo>&PartialD;</mo><mi>F</mi><mrow><mo>(</mo><mi>x</mi><mo>)</mo></mrow></mrow><mrow><mo>&PartialD;</mo><msubsup><mi>y</mi><mi>j</mi><mrow><mi>k</mi><mo>+</mo><mn>1</mn></mrow></msubsup></mrow></mfrac><mfrac><msubsup><mrow><mo>&PartialD;</mo><mi>y</mi></mrow><mi>i</mi><mrow><mi>k</mi><mo>+</mo><mn>1</mn></mrow></msubsup><mrow><mo>&PartialD;</mo><msubsup><mi>y</mi><mi>i</mi><mi>k</mi></msubsup></mrow></mfrac><mo>=</mo><munder><mi>Σ</mi><mi>j</mi></munder><mfrac><mrow><mo>&PartialD;</mo><mi>F</mi><mrow><mo>(</mo><mi>x</mi><mo>)</mo></mrow></mrow><mrow><mo>&PartialD;</mo><msubsup><mi>y</mi><mi>j</mi><mrow><mi>k</mi><mo>+</mo><mn>1</mn></mrow></msubsup></mrow></mfrac><mfrac><mrow><mo>&PartialD;</mo><mrow><mo>(</mo><munder><mi>Σ</mi><mi>n</mi></munder><msubsup><mi>ω</mi><mi>jn</mi><mrow><mi>k</mi><mo>+</mo><mn>1</mn></mrow></msubsup><msubsup><mi>o</mi><mi>n</mi><mi>k</mi></msubsup><mo>)</mo></mrow></mrow><mrow><mo>&PartialD;</mo><msubsup><mi>y</mi><mi>i</mi><mi>k</mi></msubsup></mrow></mfrac></mrow></math>

<math><mrow><mo>=</mo><munder><mi>Σ</mi><mi>j</mi></munder><mfrac><mrow><mo>&PartialD;</mo><mi>F</mi><mrow><mo>(</mo><mi>x</mi><mo>)</mo></mrow></mrow><mrow><mo>&PartialD;</mo><msubsup><mi>y</mi><mi>i</mi><mrow><mi>k</mi><mo>+</mo><mn>1</mn></mrow></msubsup></mrow></mfrac><mfrac><msubsup><mrow><mo>&PartialD;</mo><mi>o</mi></mrow><mi>i</mi><mi>k</mi></msubsup><msubsup><mrow><mo>&PartialD;</mo><mi>y</mi></mrow><mi>i</mi><mi>k</mi></msubsup></mfrac><msubsup><mi>ω</mi><mi>ji</mi><mrow><mi>k</mi><mo>+</mo><mn>1</mn></mrow></msubsup><mo>=</mo><msup><mi>f</mi><mo>′</mo></msup><mrow><mo>(</mo><msubsup><mi>y</mi><mi>i</mi><mi>k</mi></msubsup><mo>)</mo></mrow><munder><mi>Σ</mi><mi>j</mi></munder><mfrac><mrow><mo>&PartialD;</mo><mi>F</mi><mrow><mo>(</mo><mi>x</mi><mo>)</mo></mrow></mrow><mrow><mo>&PartialD;</mo><msubsup><mi>y</mi><mi>i</mi><mrow><mi>k</mi><mo>+</mo><mn>1</mn></mrow></msubsup></mrow></mfrac><msubsup><mi>ω</mi><mi>ji</mi><mrow><mi>k</mi><mo>+</mo><mn>1</mn></mrow></msubsup><mo>.</mo></mrow></math>

in the speaker identification method based on the Gaussian mixture model embedded with the time delay neural network, the pre-emphasis adopts f (Z) -1-0.97Z^-1The filter adopts a Hamming window with the length of 20ms and the window shift of 10ms for framing, the order of linear prediction analysis is 20, and the feature vector is a cepstrum coefficient of 13 orders.

The invention has the advantages and effects that:

1. the advantages of TDNN and GMM are fully utilized, so that TDNN can learn the time information of the feature vector, the feature vector set is mapped to the subspace capable of increasing the likelihood probability, the influence of the unreasonable assumption that the feature vector is independent can be reduced, the likelihood probability of the target model is enhanced, and the likelihood probability of the non-target model is reduced. And the GMM has the advantages of high recognition rate, simple training and low requirement on the training data volume. Therefore, the recognition rate of the whole speaker recognition system is greatly improved.

2. The method provided by the invention has the advantage that the speaker recognition effect under the voice in the noise-free and noise environment is improved compared with that of the GMM adopted alone.

Other advantages and effects of the present invention will be described further below.

Drawings

FIG. 1-speaker training and recognition model.

FIG. 2-time-lapse neural network model.

FIG. 3-comparison data at 1conv4w-1conv4w without noise.

FIG. 4-comparative data on noise in a car.

FIG. 5-comparative data showing noise within a compartment.

Detailed Description

The technical solution of the present invention is further explained below with reference to the drawings and the embodiments.

Fig. 1 is a training and recognition model for speaker recognition embedded in a TDNN network that differs from a baseline GMM model (which uses only the GMM model for speaker recognition) in both training and recognition.

1. Preprocessing and feature extraction

First, silence detection is performed using a method based on energy and zero-crossing rate, and noise is removed by spectral subtraction, and then f (Z) is 1 to 0.97Z^-1The filter (2) performs pre-emphasis, performs 20-order Linear Prediction (LPC) analysis by framing with a Hamming window of 20ms length and 10ms window shift, and then obtains 13-order cepstral coefficients from the 20-order LPC coefficients as a feature vector for speaker recognition.

2. Speaker model training

During training, the process of training the TDNN and the process of training the GMM are performed alternately. TDNN is a multi-layer perceptron network (MLP) as shown in FIG. 2. And the feature vector is used as the input of TDNN after being delayed by the linear delay block, and the TDNN learns the structure of the feature vector and extracts the time information of the feature vector sequence. The learning result is then provided to the GMM in the form of a residual eigenvector (i.e., the difference between the input vector and the output of the TDNN), the GMM model is trained using a maximum Expectation (EM) method, and the weight coefficients of the TDNN network are updated using an inertial backward inversion method. Here, the criterion for both TDNN and GMM model learning and training is the maximum likelihood probability. Thus, by learning, the residual distribution is likely to proceed toward the direction of enhancing the likelihood probability. The specific training process is described as follows:

(1) determining a GMM model and a TDNN structure:

where x is_tIs a D-dimensional random vector, and in speaker recognition applications, x_tIs a feature vector; b_i(x_t) I 1, 2, M is the member density; p is a radical of_i I 1, 2, M is the mixing weight. Each member density function is a D-dimensional variable with a mean vector of u_iThe covariance matrix is sigma_iOf the form:

wherein the mixing weight satisfies the condition:

the complete GMM model consists of the mean vector of all member densities, the covariance matrix, and the mixture weight parameters. These parameters are collectively represented as:

λ＝{(p_i，u_i，∑_i)，i＝1，2，...，M} (3)

since speaker recognition generally has less training and recognition data, it is common in practice to set the covariance matrix for each gaussian mixture density to be a diagonal matrix.

Here, TDNN without feedback is utilized, as shown in fig. 2. The eigenvector x (n) is delayed by the linear delay block and then used as the input of the TDNN, which performs nonlinear transformation on the input, then performs linear weighting to obtain the output vector, and then compares the output vector with the eigenvector, wherein the commonly used criterion is minimum mean square criterion (MMSE). Specifically, the ratio of the number of neurons in the hidden layer to the number of neurons in the input layer of TDNN is 3: 2, and the nonlinear activation S function (the [ integral ] notation in FIG. 2) is

And y is the input after weighted summation.

During training, the inertia coefficient γ of the neural network is 0.8.

(2) Setting a convergence condition and a maximum iteration number; specifically, the convergence condition is that the euclidean distance between the GMM coefficients and the TDNN weight coefficients of two adjacent times is less than 0.0001, and the maximum number of iterations is usually not more than 100.

(3) Randomly determining TDNN and GMM model parameters of initial iteration; the initial coefficient of TDNN is set as pseudo random number generated by computer, the initial mixing coefficient of GMM can be 1/M, M is the number of mixing terms of GMM, the initial mean value and variance of GMM are obtained by generating M aggregation classes by LBG (Linde, Buzo, Gray) method of the residual vector of TDNN, and respectively calculating the mean value and variance of the M aggregation classes.

(4) Inputting the feature vector x (n) into a TDNN network, and subtracting the feature vector x (n) before passing through the TDNN from the output feature vector o (n) of the TDNN to obtain all residual vectors;

(5) correcting parameters of the GMM model by adopting an EM (effective velocity) method;

let the residual vector be r_tFirst, the class posterior probability is calculated using equation (4):

then, the updated mixing weight values are obtained by the formulas (5), (6) and (7)

Mean vector

Sum covariance matrix

<math><mrow><mover><msub><mi>p</mi><mi>i</mi></msub><mo>&OverBar;</mo></mover><mo>=</mo><mfrac><mn>1</mn><mi>N</mi></mfrac><munderover><mi>Σ</mi><mrow><mi>t</mi><mo>=</mo><mn>1</mn></mrow><mi>N</mi></munderover><mi>p</mi><mrow><mo>(</mo><msub><mrow><mi>i</mi><mo>|</mo><mi>r</mi></mrow><mi>t</mi></msub><mo>,</mo><mi>λ</mi><mo>)</mo></mrow><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>5</mn><mo>)</mo></mrow></mrow></math>

<math><mrow><mover><msub><mi>u</mi><mi>i</mi></msub><mo>&OverBar;</mo></mover><mo>=</mo><mfrac><mrow><msubsup><mi>Σ</mi><mrow><mi>t</mi><mo>=</mo><mn>1</mn></mrow><mi>N</mi></msubsup><mi>p</mi><mrow><mo>(</mo><msub><mrow><mi>i</mi><mo>|</mo><mi>r</mi></mrow><mi>t</mi></msub><mo>,</mo><mi>λ</mi><mo>)</mo></mrow><msub><mi>x</mi><mi>t</mi></msub></mrow><mrow><msubsup><mi>Σ</mi><mrow><mi>t</mi><mo>=</mo><mn>1</mn></mrow><mi>N</mi></msubsup><mi>p</mi><mrow><mo>(</mo><msub><mrow><mi>i</mi><mo>|</mo><mi>r</mi></mrow><mi>t</mi></msub><mo>,</mo><mi>λ</mi><mo>)</mo></mrow></mrow></mfrac><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>6</mn><mo>)</mo></mrow></mrow></math>

<math><mrow><msubsup><mover><mi>Σ</mi><mo>&OverBar;</mo></mover><mi>i</mi><mn>2</mn></msubsup><mo>=</mo><mfrac><mrow><msubsup><mi>Σ</mi><mrow><mi>t</mi><mo>=</mo><mn>1</mn></mrow><mi>N</mi></msubsup><mi>p</mi><mrow><mo>(</mo><msub><mrow><mi>i</mi><mo>|</mo><mi>r</mi></mrow><mi>t</mi></msub><mo>,</mo><mi>λ</mi><mo>)</mo></mrow><msubsup><mi>x</mi><mi>t</mi><mn>2</mn></msubsup></mrow><mrow><msubsup><mi>Σ</mi><mrow><mi>t</mi><mo>=</mo><mn>1</mn></mrow><mi>N</mi></msubsup><mi>p</mi><mrow><mo>(</mo><msub><mrow><mi>i</mi><mo>|</mo><mi>r</mi></mrow><mi>t</mi></msub><mo>,</mo><mi>λ</mi><mo>)</mo></mrow></mrow></mfrac><mo>-</mo><msup><mover><msub><mi>u</mi><mi>i</mi></msub><mo>&OverBar;</mo></mover><mn>2</mn></msup><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>7</mn><mo>)</mo></mrow></mrow></math>

(6) Carrying out residual error substitution by using the weight coefficient, mean vector and variance of each Gaussian distribution of the modified GMM model to obtain a likelihood probability, and modifying the parameter lambda of the TDNN by using a backward inversion method with inertia;

specifically, the parameter λ of TDNN is obtained by maximizing the function in the following equation:

wherein p (x | λ) is represented by formula (1), o_tIs the output vector of TDNN, x_tIs the feature vector of the input TDNN.

Because the minimum value is generally solved during the neural network iteration, and the product of the sum formula is more convenient than the product of the sum formula, the logarithm of the above formula is taken and then the negation is taken, and the following results are obtained:

here, the parameter λ is embodied as

Is the m-th iteration, the input x is connected_iAnd output y_jK is the layer number of the neural network. The method for solving G (X) by adopting the backward inversion method with inertia can accelerate the iterative convergence process and better process the local minimum value problem, and the iterative formula of the backward inversion method with inertia is as follows:

<math><mrow><msubsup><mi>Δω</mi><mi>ij</mi><mi>k</mi></msubsup><mrow><mo>(</mo><mi>m</mi><mo>+</mo><mn>1</mn><mo>)</mo></mrow><mo>=</mo><msubsup><mi>γΔω</mi><mi>ij</mi><mi>k</mi></msubsup><mrow><mo>(</mo><mi>m</mi><mo>)</mo></mrow><mo>-</mo><mrow><mo>(</mo><mn>1</mn><mo>-</mo><mi>γ</mi><mo>)</mo></mrow><mi>α</mi><mfrac><mrow><mo>&PartialD;</mo><mi>F</mi><mrow><mo>(</mo><mi>x</mi><mo>)</mo></mrow></mrow><mrow><mo>&PartialD;</mo><msubsup><mi>ω</mi><mi>ij</mi><mi>k</mi></msubsup></mrow></mfrac><msub><mo>|</mo><mrow><msubsup><mi>ω</mi><mi>ij</mi><mi>k</mi></msubsup><mo>=</mo><msubsup><mi>ω</mi><mi>ij</mi><mi>k</mi></msubsup><mrow><mo>(</mo><mi>m</mi><mo>)</mo></mrow></mrow></msub><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>10</mn><mo>)</mo></mrow></mrow></math>

wherein,

α is the iteration step, f (x) -lnp ((x)_t-o_t) | λ), γ is the inertia coefficient, where γ is 0.8.

In the formula (10), the compound represented by the formula (10),

the calculation process of (2) is as follows:

<math><mrow><mfrac><mrow><mo>&PartialD;</mo><mi>F</mi><mrow><mo>(</mo><mi>x</mi><mo>)</mo></mrow></mrow><mrow><mo>&PartialD;</mo><msubsup><mi>ω</mi><mi>ij</mi><mi>k</mi></msubsup></mrow></mfrac><mo>=</mo><mfrac><mrow><mo>&PartialD;</mo><mi>F</mi><mrow><mo>(</mo><mi>x</mi><mo>)</mo></mrow></mrow><msubsup><mrow><mo>&PartialD;</mo><mi>y</mi></mrow><mi>i</mi><mi>k</mi></msubsup></mfrac><mfrac><msubsup><mrow><mo>&PartialD;</mo><mi>y</mi></mrow><mi>i</mi><mi>k</mi></msubsup><msubsup><mrow><mo>&PartialD;</mo><mi>ω</mi></mrow><mi>ij</mi><mi>k</mi></msubsup></mfrac><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>11</mn><mo>)</mo></mrow></mrow></math>

the following calculation formulas for the two product terms in equation (11) are respectively found, since in a neural network:

o_{i}^{k} = f (y_{i}^{k}) - - - (13)

in the above two formulas, the first and second groups,

the output when sample x is input to the ith neuron of the k layer,

the input when sample x is input for the ith neuron in the kth layer,

is an activation function. Then:

<math><mrow><mfrac><mrow><mo>&PartialD;</mo><msubsup><mi>y</mi><mi>i</mi><mi>k</mi></msubsup></mrow><msubsup><mrow><mo>&PartialD;</mo><mi>ω</mi></mrow><mi>ij</mi><mi>k</mi></msubsup></mfrac><mo>=</mo><msubsup><mi>o</mi><mi>j</mi><mrow><mi>k</mi><mo>-</mo><mn>1</mn></mrow></msubsup><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>14</mn><mo>)</mo></mrow></mrow></math>

for the

The solution of (2) is divided into two cases of an output layer and a hidden layer:

(a) of the output layer

Wherein:

(b) hidden layer

<math><mrow><mfrac><mrow><mo>&PartialD;</mo><mi>F</mi><mrow><mo>(</mo><mi>x</mi><mo>)</mo></mrow></mrow><msubsup><mrow><mo>&PartialD;</mo><mi>y</mi></mrow><mi>i</mi><mi>k</mi></msubsup></mfrac><mo>=</mo><munder><mi>Σ</mi><mi>j</mi></munder><mfrac><mrow><mo>&PartialD;</mo><mi>F</mi><mrow><mo>(</mo><mi>x</mi><mo>)</mo></mrow></mrow><mrow><mo>&PartialD;</mo><msubsup><mi>y</mi><mi>j</mi><mrow><mi>k</mi><mo>+</mo><mn>1</mn></mrow></msubsup></mrow></mfrac><mfrac><msubsup><mrow><mo>&PartialD;</mo><mi>y</mi></mrow><mi>j</mi><mrow><mi>k</mi><mo>+</mo><mn>1</mn></mrow></msubsup><mrow><mo>&PartialD;</mo><msubsup><mi>y</mi><mi>i</mi><mi>k</mi></msubsup></mrow></mfrac></mrow></math>

<math><mrow><mo>=</mo><munder><mi>Σ</mi><mi>j</mi></munder><mfrac><mrow><mo>&PartialD;</mo><mi>F</mi><mrow><mo>(</mo><mi>x</mi><mo>)</mo></mrow></mrow><mrow><mo>&PartialD;</mo><msubsup><mi>y</mi><mi>j</mi><mrow><mi>k</mi><mo>+</mo><mn>1</mn></mrow></msubsup></mrow></mfrac><mfrac><mrow><mo>&PartialD;</mo><mrow><mo>(</mo><munder><mi>Σ</mi><mi>n</mi></munder><msubsup><mi>ω</mi><mi>jn</mi><mrow><mi>k</mi><mo>+</mo><mn>1</mn></mrow></msubsup><msubsup><mi>o</mi><mi>n</mi><mi>k</mi></msubsup><mo>)</mo></mrow></mrow><mrow><mo>&PartialD;</mo><msubsup><mi>y</mi><mi>i</mi><mi>k</mi></msubsup></mrow></mfrac></mrow></math>

<math><mrow><mo>=</mo><munder><mi>Σ</mi><mi>j</mi></munder><mfrac><mrow><mo>&PartialD;</mo><mi>F</mi><mrow><mo>(</mo><mi>x</mi><mo>)</mo></mrow></mrow><mrow><mo>&PartialD;</mo><msubsup><mi>y</mi><mi>j</mi><mrow><mi>k</mi><mo>+</mo><mn>1</mn></mrow></msubsup></mrow></mfrac><mfrac><msubsup><mrow><mo>&PartialD;</mo><mi>o</mi></mrow><mi>i</mi><mi>k</mi></msubsup><msubsup><mrow><mo>&PartialD;</mo><mi>y</mi></mrow><mi>i</mi><mi>k</mi></msubsup></mfrac><msubsup><mi>ω</mi><mi>ji</mi><mrow><mi>k</mi><mo>+</mo><mn>1</mn></mrow></msubsup></mrow></math>

<math><mrow><msup><mrow><mo>=</mo><mi>f</mi></mrow><mo>′</mo></msup><mrow><mo>(</mo><msubsup><mi>y</mi><mi>i</mi><mi>k</mi></msubsup><mo>)</mo></mrow><munder><mi>Σ</mi><mi>j</mi></munder><mfrac><mrow><mo>&PartialD;</mo><mi>F</mi><mrow><mo>(</mo><mi>x</mi><mo>)</mo></mrow></mrow><mrow><mo>&PartialD;</mo><msubsup><mi>y</mi><mi>i</mi><mrow><mi>k</mi><mo>+</mo><mn>1</mn></mrow></msubsup></mrow></mfrac><msubsup><mi>ω</mi><mi>ji</mi><mrow><mi>k</mi><mo>+</mo><mn>1</mn></mrow></msubsup><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>16</mn><mo>)</mo></mrow></mrow></math>

Since backward inversion is used, the calculation is carried out

Time of flight

It is known that the expression (16) can be substituted

(7) And (3) judging whether the convergence condition set in the step (2) is met or whether the maximum iteration number is reached, if so, stopping training, otherwise, jumping to the step (4).

3. Speaker recognition

During identification, the feature vector enters TDNN after being delayed. Since TDNN has learned the structure and timing information of the feature space, the input of the sequence X of feature vectors to be recognized will transform the feature vectors accordingly. Because the training is carried out, the TDNN transformation plays the roles of enhancing the likelihood probability of the target model and reducing the likelihood probability of the non-target model. Then, a residual sequence R obtained by subtracting the sequence O output after the TDNN transformation from X is provided for the GMM model, and the sequence R of the T residual vectors is equal to R₁，R₂，...，R_TIts GMM probability can be written as:

expressed in the logarithmic domain as:

here we selected 107 target speakers among them, 63 male and 44 female, using 1conv4w-1conv4w tested in NIST 2006 as an experiment. Each person selected about 2 minutes of speech as the training speech and the remaining speech as the test speech, thus forming about 23000 tests.

In order to test the improvement effect of the method of the present invention in a noisy environment, noise data were selected as noise (stationary noise) in a running car (2000cc group, general road) and noise (non-stationary noise) in a display booth in an exhibition in the japan electronics association standard noise database. These noises are superimposed on the 1conv4w-1conv4w speech at a certain signal-to-noise ratio (SNR) to generate a noisy speech.

Adopting correct recognition rate as the standard for judging the recognition effect of the speaker, wherein, correct _ ratio is N_v/N_t. Wherein, correct _ ratio is the correct recognition rate, N_vNumber of tests for proper identification, N_tTotal number of tests.

The method of the present invention (denoted TDNN-GMM) and the speaker training and recognition method using GMM only (denoted baseline GMM) are described herein. The results of the experiment are shown in FIGS. 3-5. FIG. 3 is a comparison of the recognition effect of varying the number of mixture terms M of the Gaussian probability density function in GMM at 1conv4w-1conv4w in the absence of noise; from fig. 3, it is found that the recognition effect of the GMM is improved after the TDNN is embedded, and the improvement effect is more obvious when the number of mixed terms M is less, because the learning effect of the neural network is better when the subclasses in the class are less.

Fig. 4 and 5 compare the comparison results under different noise and signal-to-noise ratios (SNR), where M is 80. As can be seen from fig. 4 and 5, the method of the present invention has a great improvement in speaker recognition effect compared to the baseline GMM under different signal-to-noise ratios.

Claims

1. A speaker recognition method based on a Gaussian mixture model embedded with a time delay neural network is characterized by comprising the following steps:

(1) preprocessing and feature extraction;

firstly, carrying out silence detection by using a method based on energy and a zero crossing rate, removing noise by using a spectral subtraction method, carrying out pre-emphasis and framing on a speech signal, carrying out Linear Prediction (LPC) analysis, and then obtaining a cepstrum coefficient from the obtained LPC coefficient to be used as a feature vector for speaker identification;

(2) training;

during training, delaying the extracted feature vector, then entering a Time Delay Neural Network (TDNN), wherein the TDNN learns the structure of the feature vector, and extracting time information of a feature vector sequence; then, providing the learning result to a Gaussian Mixture Model (GMM) in the form of residual error characteristic vector, training the GMM by adopting a maximum expectation method, and updating the weight coefficient of the TDNN by utilizing a backward inversion method with inertia; the specific training process is as follows:

(2-1) determining the GMM model and the TDNN structure:

in the above formula x_tD-dimensional feature vectors, where D is 13; b_i(x_t) Is a member density function which is the mean vector u_iThe covariance matrix is sigma_i(ii) a gaussian function of;

p_ithe mixed weight satisfies the condition:

the complete GMM model parameters are as follows:

λ＝{(p_i，u_i，∑_i)，i＝1，2，...，M}

the TDNN characteristic vector x (n) without feedback is used as the input of the TDNN after being delayed by a linear delay block, the TDNN performs nonlinear transformation on the input, then performs linear weighting to obtain an output vector, and then compares the output vector with the characteristic vector, wherein the commonly used criterion is minimum mean square criterion (MMSE); the ratio of the number of neurons in the hidden layer to the number of neurons in the input layer of TDNN is 3: 2, and the nonlinear activation S function isy is the input after weighted summation; during training, the inertia coefficient gamma of the neural network is 0.8;

(2-2) setting a convergence condition and a maximum iteration number; specifically, the convergence condition is that the euclidean distance between the GMM coefficients and the TDNN weight coefficients of two adjacent times is less than 0.0001, and the maximum iteration number is usually not more than 100;

(2-3) randomly determining TDNN and GMM model parameters of the initial iteration; setting an initial coefficient of TDNN as a pseudo-random number generated by a computer, wherein an initial mixing coefficient of GMM can be 1/M, M is the number of mixing terms of GMM, the initial mean value and variance of GMM are obtained by generating M aggregation classes through an LBG (Linde, Buzo, Gray) method by using a residual vector of TDNN, and respectively calculating the mean value and the variance of the M aggregation classes;

(2-5) correcting parameters of the GMM model by adopting a maximum expectation method;

then updating the mixing weight

Mean vector

Sum covariance matrix

the TDNN parameter is obtained by maximizing the function in the following equation:

wherein o is_tIs output by the neural network, x_tIs the input feature vector;

wherein,

for the m-th iteration, input x is connected_iAnd output y_jK is the layer number of the neural network, α is the iteration step size, f (x) -lnp ((x)_t-o_t) | λ), γ is the inertia coefficient;

(2-7) judging whether the convergence condition set in the step (2-2) is met or whether the maximum iteration number is reached, if so, stopping training, otherwise, jumping to the step (2-4);

(3) identification

During identification, inputting the characteristic vector sequence X into TDNN after delaying; then, a residual sequence R obtained by subtracting the output sequence O of the TDNN from X is provided to the GMM model, and R is equal to R for the sequence of T residual vectors₁，R₂，...，R_TIts GMM probability can be written as:

expressed in the logarithmic domain as:

2. the speaker recognition method according to claim 1, wherein the speaker recognition method is based on a Gaussian mixture model with embedded delay neural network

The calculation process is as follows:

in the case of a TDNN network,

the output when sample x is input to the ith neuron of the k layer,

the input when sample x is input for the ith neuron in the kth layer,to activate the function, then:

<math><mrow><mfrac><msubsup><mrow><mo>&PartialD;</mo><mi>y</mi></mrow><mi>i</mi><mi>k</mi></msubsup><msubsup><mrow><mo>&PartialD;</mo><mi>ω</mi></mrow><mi>ij</mi><mi>k</mi></msubsup></mfrac><mo>=</mo><msubsup><mi>o</mi><mi>j</mi><mrow><mi>k</mi><mo>-</mo><mn>1</mn></mrow></msubsup><mo>.</mo></mrow></math>

3. the speaker recognition method according to claim 2, wherein the speaker recognition method is based on the Gaussian mixture model with embedded delay neural network

The calculation process is divided into two cases of an output layer and an implicit layer of the TDNN;

for the output layer:

wherein:

for the hidden layer:

<math><mrow><mo>=</mo><munder><mi>Σ</mi><mi>j</mi></munder><mfrac><mrow><mo>&PartialD;</mo><mi>F</mi><mrow><mo>(</mo><mi>x</mi><mo>)</mo></mrow></mrow><mrow><mo>&PartialD;</mo><msubsup><mi>y</mi><mi>j</mi><mrow><mi>k</mi><mo>+</mo><mn>1</mn></mrow></msubsup></mrow></mfrac><mfrac><msubsup><mrow><mo>&PartialD;</mo><mi>o</mi></mrow><mi>i</mi><mi>k</mi></msubsup><msubsup><mrow><mo>&PartialD;</mo><mi>y</mi></mrow><mi>i</mi><mi>k</mi></msubsup></mfrac><msubsup><mi>ω</mi><mi>ji</mi><mrow><mi>k</mi><mo>+</mo><mn>1</mn></mrow></msubsup><mo>=</mo><msup><mi>f</mi><mo>′</mo></msup><mrow><mo>(</mo><msubsup><mi>y</mi><mi>i</mi><mi>k</mi></msubsup><mo>)</mo></mrow><munder><mi>Σ</mi><mi>j</mi></munder><mfrac><mrow><mo>&PartialD;</mo><mi>F</mi><mrow><mo>(</mo><mi>x</mi><mo>)</mo></mrow></mrow><mrow><mo>&PartialD;</mo><msubsup><mi>y</mi><mi>i</mi><mrow><mi>k</mi><mo>+</mo><mn>1</mn></mrow></msubsup></mrow></mfrac><msubsup><mi>ω</mi><mi>ji</mi><mrow><mi>k</mi><mo>+</mo><mn>1</mn></mrow></msubsup><mo>.</mo></mrow></math>

4. the speaker recognition method according to claim 1, wherein the pre-emphasis is performed by using f (Z) -1-0.97Z^-1The filter adopts a Hamming window with the length of 20ms and the window shift of 10ms for framing, the order of linear prediction analysis is 20, and the feature vector is a cepstrum coefficient of 13 orders.