CN113470655A

CN113470655A - Voiceprint recognition method of time delay neural network based on phoneme log-likelihood ratio

Info

Publication number: CN113470655A
Application number: CN202110752463.3A
Authority: CN
Inventors: 刘俊南; 薛辉; 缪蔚; 郭鹏; 齐心
Original assignee: Innomicro Technology Tianjin Co Ltd
Current assignee: Innomicro Technology Tianjin Co Ltd
Priority date: 2021-07-02
Filing date: 2021-07-02
Publication date: 2021-10-01

Abstract

A voiceprint recognition method for a time-delay neural network based on phoneme log-likelihood ratio, the method comprising the steps of: acquiring voice data; preprocessing the voice data; extracting a phoneme posterior probability vector from the preprocessed voice data by using a phoneme recognizer; training a time delay neural network by using the preprocessed voice data and extracting an X-vector distinguishing vector; training a Gaussian mixture model-a generic background model using the phoneme posterior probability vector; calculating an I-vector discrimination vector using the Gaussian mixture model-general background model; eliminating the influence of channel information in an I-vector characteristic space; generating a new classifier using the X-vector discrimination vector and the I-vector discrimination vector; inputting an X-vector feature and an I-vector feature into the new classifier; and acquiring and outputting the voiceprint information of the new classifier. The method and the device can be used for rapidly and accurately identifying the information of the voiceprint, improve the robustness of the system and can be used in a cross-platform mode.

Description

Voiceprint recognition method of time delay neural network based on phoneme log-likelihood ratio

Technical Field

The invention belongs to the technical field of voice recognition, and particularly relates to a voiceprint recognition method of a time delay neural network based on a phoneme log-likelihood ratio.

Background

With the rapid development of disciplines such as pattern recognition, artificial intelligence and the like, the development of human beings enters the intelligent era. Human-computer interaction through voice has gradually become a trend of development, wherein voiceprint recognition refers to identification and identity judgment by using speaker identity related information contained in a voice fragment. Voiceprint recognition is also an important branch in the field of voice recognition, and is a technology for automatically judging the identity of a person to which a voice is analyzed and processed by a computer.

The traditional voiceprint recognition technology comprises voice signal feature processing and extraction, acoustic model training and model discrimination training, but in a complex environment, the effect of the traditional method such as a full-difference space analysis method based on a statistical model is greatly reduced, and with the popularization of a neural network technology, the end-to-end voiceprint recognition system developed based on the neural network model is widely applied to the current voiceprint recognition field and has a better development prospect, wherein the time delay neural network model obtains extremely high accuracy.

Traditional voiceprint-based recognition tends to be more demanding on computing and storage of devices and is more demanding on the environment, so new approaches are needed to improve this deficiency to better accommodate various complex environments and to improve voiceprint recognition techniques to reduce end-to-end recognition implementation difficulties.

Disclosure of Invention

In order to solve the above problems, the present invention provides a voiceprint recognition method for a time-delay neural network based on a phoneme log-likelihood ratio, wherein the method comprises the steps of:

acquiring voice data;

preprocessing the voice data;

extracting a phoneme posterior probability vector from the preprocessed voice data by using a phoneme recognizer;

training a time delay neural network by using the preprocessed voice data and extracting an X-vector distinguishing vector;

training a Gaussian mixture model-a generic background model using the phoneme posterior probability vector;

calculating an I-vector discrimination vector using the Gaussian mixture model-general background model;

eliminating the influence of channel information in an I-vector characteristic space;

generating a new classifier using the X-vector discrimination vector and the I-vector discrimination vector;

inputting an X-vector feature and an I-vector feature into the new classifier;

and acquiring and outputting the voiceprint information of the new classifier.

Preferably, the preprocessing the voice data comprises the steps of:

extracting acoustic features of the voice data;

performing silence detection on the voice data;

and performing voice enhancement on the voice data.

Preferably, the extracting a phoneme posterior probability vector from the preprocessed speech data by using the phoneme recognizer includes:

acquiring a phoneme recognizer;

performing phoneme log likelihood ratio training on the phoneme recognizer;

acquiring the preprocessed voice data;

inputting the speech data into the phoneme recognizer;

and acquiring the phoneme posterior probability vector output by the phoneme recognizer.

Preferably, the training of the time-delay neural network by using the preprocessed voice data and the extraction of the X-vector discrimination vector comprise the steps of:

extracting the frame-level features of the preprocessed voice data by utilizing a neural network;

extracting the segment-level information of the preprocessed voice data through a pooling layer;

mapping the preprocessed voice data to a fixed dimension supervector to obtain a fixed dimension voice;

training a TDNN time delay neural network by using the fixed dimension voice;

and extracting the preprocessed X-vector discrimination vector of the voice data by using the TDNN time delay neural network.

Preferably, the training of the gaussian mixture model-general background model using the phoneme posterior probability vector includes the steps of:

training a Gaussian mixture model-a universal background model by utilizing the corpus;

performing maximum posterior probability algorithm self-adaptation on the Gaussian mixture model-general background model;

and iteratively optimizing the hidden parameters through an EM algorithm.

Preferably, the calculating the I-vector discrimination vector using the mixture gaussian model-general background model includes the steps of:

obtaining a mixed Gaussian supervectors of training speech phoneme log-likelihood ratio characteristic samples by a maximum posterior probability algorithm self-adaptive algorithm by utilizing a mixed Gaussian model-general background model;

calculating a full-difference space matrix by a forward-backward algorithm parameter estimation method;

acquiring an I-vector identification vector extractor;

and extracting a training set and a set to be recognized of the I-vector recognition vector characteristics by using the I-vector recognition vector extractor for the phoneme log likelihood ratio characteristics of the speech to be recognized.

Preferably, the eliminating the influence of the channel information in the I-vector feature space includes the steps of:

obtaining a probability linear discriminant analysis model;

and inputting the I-vector discrimination vector into the probability linear discriminant analysis method model.

Preferably, the expression of the probabilistic linear discriminant analysis model is as follows:

wherein xij represents a probabilistic linear discriminant analysis model, u represents the mean of all I-vector discrimination vector vectors, β_iA discrimination factor representing the ith speaker,

representing a speaker subspace of a given dimension, ε_ijRepresenting a residual containing the effects of the channel.

The method and the device can be used for rapidly and accurately identifying the information of the voiceprint, improve the robustness of the system and can be used in a cross-platform mode.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a schematic flow chart of a voiceprint recognition method of a time-delay neural network based on phoneme log-likelihood ratio according to the present invention;

FIG. 2 is a schematic diagram of a speech enhancement process based on deep unsupervised learning in a voiceprint recognition method of a time-delay neural network based on a phoneme log-likelihood ratio according to the present invention;

FIG. 3 is a schematic diagram of a training process of a phoneme recognizer in a voiceprint recognition method of a time-delay neural network based on a phoneme log-likelihood ratio according to the present invention;

FIG. 4 is a schematic diagram of an HMM-DNN training structure in the voiceprint recognition method of the delay neural network based on the phoneme log-likelihood ratio provided by the present invention;

FIG. 5 is a schematic diagram of X-vector voiceprint recognition in the voiceprint recognition method of the time delay neural network based on the phoneme log-likelihood ratio provided by the present invention;

fig. 6 is a schematic diagram of the TDNN principle of the time-delay neural network in the voiceprint recognition method of the time-delay neural network based on the phoneme log-likelihood ratio provided by the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings in conjunction with the following detailed description. It should be understood that the description is intended to be exemplary only, and is not intended to limit the scope of the present invention. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present invention.

In the embodiments of the present application, as shown in fig. 1 to 6, the present invention provides a voiceprint recognition method for a time-delay neural network based on phoneme log-likelihood ratio, the method including the steps of:

s1: acquiring voice data;

in the embodiment of the present application, voice data may be acquired using a voice data collecting apparatus.

S2: preprocessing the voice data;

in this embodiment of the present application, the preprocessing the voice data includes:

extracting acoustic features of the voice data;

performing silence detection on the voice data;

and performing voice enhancement on the voice data.

In the embodiment of the present application, when the speech data is preprocessed, specifically, by extracting an acoustic feature from the received input speech signal, the acoustic feature includes any one of an MFCC feature, a FilterBank feature, or a PLP feature; processing input voice data by using a detection technology based on a signal-to-noise ratio, removing a non-voice section in an audio signal, and enhancing the voice data by using a method of mixing multi-environment reverberation, wherein silence detection is to obtain a GMM (Gaussian mixture model) model which can divide silence and noise by using an EM (effective noise) algorithm iterative training; by adopting the deep learning algorithm, various background noises in the audio can be greatly eliminated, and as shown in fig. 2, the method is a speech enhancement flow diagram based on deep unsupervised learning. And data enhancement can be carried out by injecting noise into the pure data set, and a nonlinear function from noise-containing speech to clean speech is learned by using a deep neural network so as to achieve the purpose of denoising or dereverberating. More specifically, the noise-injected training may allow the objective function to obtain an optimal solution that is less sensitive to input variations.

S3: extracting a phoneme posterior probability vector from the preprocessed voice data by using a phoneme recognizer;

in an embodiment of the present application, the extracting a phoneme posterior probability vector from the preprocessed speech data by using a phoneme recognizer includes:

acquiring a phoneme recognizer;

performing phoneme log likelihood ratio training on the phoneme recognizer;

acquiring the preprocessed voice data;

inputting the speech data into the phoneme recognizer;

In the embodiment of the present application, when extracting a phoneme posterior probability vector from the preprocessed voice data by using a phoneme recognizer, extracting a frame-level phoneme posterior probability vector from the preprocessed voice data by using the phoneme recognizer; the PLLR (phoneme log likelihood ratio) module flow diagram shown in fig. 3 is utilized, and specifically includes the following steps: the PLLR (phoneme Log likelihood ratio) training part trains a phoneme recognizer, which recognizes a speech signal as a frame-level phoneme posterior probability vector, using a large amount of irrelevant corpus, but does not decode a phoneme string or a phoneme lattice, but performs a series of transformation processes on the phoneme posterior probability vector to obtain a frame-level form as with the features of an acoustic layer. Fig. 4 is a schematic diagram of an HMM-DNN (hidden markov model-deep neural network) structure used in the present invention. The phoneme recognizer can be trained by using mainstream speech recognition, and the trained phoneme recognizer has the advantage of being not limited by voiceprints in use; extracting the feature of the input speech signal, which is a phoneme posterior probability vector at the frame level, setting a phoneme posterior probability vector [ b (1), b (2), …, b (k) ] recognized as a k-dimension for each frame, and performing a regularization operation on the vector to obtain PLLR (phoneme log likelihood ratio) of each phoneme posterior probability, which is calculated as follows:

where b (k) represents the phoneme posterior probability.

S4: training a time delay neural network by using the preprocessed voice data and extracting an X-vector distinguishing vector;

in this embodiment of the present application, the training of the time-delay neural network using the preprocessed speech data and extracting the X-vector discrimination vector includes:

training a TDNN time delay neural network by using the fixed dimension voice;

In the embodiment of the present application, when the preprocessed voice data is used to train the time-delay neural network and extract the X-vector discrimination vector, specifically, the fixed-length voice feature sequence is used to train the frame-level time-delay neural network, which is an artificial neural network model for timing analysis, and the model extracts the X-vector discrimination vectorAnd vectors can effectively mine the time sequence relevance of the time sequence data. As shown in fig. 5, which is a schematic diagram of X-vector voiceprint recognition, the model utilizes a neural network to extract frame-level features, and then extracts segment-level embedding information through a pooling layer to map a voice to a fixed-dimension supervector, so that the similarity of voiceprints can be measured through euclidean distance or cosine similarity. The TDNN time-delay neural network is then trained using fixed-dimension speech features, the schematic of which is shown in fig. 6. The time delay neural network is composed of time delay neurons. Each time-delay neuron has M inputs I₁(t),I₂(t),…,I_M(t) and a corresponding one of the outputs { O (t) }. Each input I_i(t) includes N time delays for storing input information I at N times before the current time_i(t-d),

d

1,2, …, N, with a weight w_ijJ is 1,2, …, and N reflects the degree of influence of different times on the current time data. The time delay neuron calculation formula is as follows:

wherein, b_iAn offset for the ith input; f is an excitation function, and a sigmoid function is generally selected. It can be understood that the output of the neuron is determined by the timing data of the current time and the previous N times of each input, so that the time delay neural network can effectively handle the nonlinear dynamic timing problem. And extracting the characteristic of the X-vector identification vector, and extracting a training set and a set to be identified of the X-vector (identification vector) characteristic with identification properties by using a time delay neural network model.

S5: training a Gaussian mixture model-a generic background model using the phoneme posterior probability vector;

in an embodiment of the present application, the training of the gaussian mixture model-general background model using the phoneme posterior probability vector includes the steps of:

and iteratively optimizing the hidden parameters through an EM algorithm.

In the embodiment of the present application, when the gaussian mixture model-general background model is trained using the phoneme posterior probability vector, a GMM-UBM (gaussian mixture model-general background) model with a stable high order and irrelevant to both the speaker and the channel is trained, which can effectively solve the problem of insufficient registered speech in voiceprint recognition, and the specific training method is as follows: the corpus is used for training GMM-UBM (general background model of Gaussian mixture model), and the formula is as follows:

wherein x is_jIs an N-dimensional observation data feature vector; w is a_kIs the mixing weight of the k-th Gaussian component, p (x)_j|μ_k,Σ_k) Is an N-dimensional Gaussian function, u_kRepresents the mean value of the k-dimension gaussian,

represents the covariance squared, w, of the kth partial model_kIs the mixing weight of the k-th Gaussian component; then obtaining a characteristic distribution irrelevant to the speaker after self-adaption through a maximum posterior probability algorithm, and finely adjusting each Gaussian distribution of the UBM to the actual data of the target voiceprint by using an EM algorithm; iteratively optimizing hidden parameters through an EM algorithm so as to train and obtain a GMM-UBM (Gaussian mixture model-general background) model, wherein the model is a high-order GMM (Gaussian mixture model), the dimensionality can reach 1024-:

e, step E: calculating the partial model k to the observation data x according to the Gaussian mixture model parameter of 3-1^(j)The responsivity of (a) is as follows:

wherein，x_jIs a feature vector of N dimension; w is a_kIs the mixing weight of the k-th Gaussian component, p (x)_j|μ_k,Σ_k) Is an N-dimensional gaussian function.

And M: updating parameters of the Gaussian mixture model, wherein a parameter updating formula is as follows:

wherein u is_kRepresents the mean value of the k-dimension gaussian,

represents the covariance squared, w, of the kth partial model_kIs the mixing weight of the k-th Gaussian component, gamma_jkRepresenting the intensity of the response, x, of the kth partial model to the observed data_jIs an N-dimensional feature vector.

S6: calculating an I-vector discrimination vector using the Gaussian mixture model-general background model;

in this embodiment of the present application, the calculating an I-vector discrimination vector using the gaussian mixture model-general background model includes:

acquiring an I-vector identification vector extractor;

In the embodiment of the present application, when the I-vector discrimination vector is calculated by using the gaussian mixture model-general background model, a low-dimensional space vector of a fixed dimension, i.e., an I-vector discrimination vector, of different voiceprint speech signals is obtained (according to the vector, it is considered that both the influence of the speaker and the influence of the channel are included in a total variation space T), and the step specifically includes the following two steps: performing I-vector (discrimination vector) training, specifically comprising: obtaining a mixed Gaussian supervectors of a PLLR (phoneme-to-number likelihood ratio) feature sample of training voice by using a GMM-UBM (mixed Gaussian model-universal background) model through a MAP self-adaptive method, and then calculating a full difference space matrix through a Baum-Welch (forward-backward algorithm) parameter estimation method to obtain an I-vector extractor, wherein the Baum-Welch algorithm needs an estimation parameter formula as follows:

M＝m+Tw，

wherein T is a total change matrix, w is an implicit variable i-vector which accords with Gaussian distribution, and M is a mean value super vector calculated by a Gaussian mixture model-general background model; performing I-vector (identification vector) extraction, specifically comprising: extracting a training set and a set to be recognized of I-vector (discrimination vector) features with more discrimination properties by using an I-vector extractor to extract PLLR (phoneme log likelihood ratio) features of the speech to be recognized;

s7: eliminating the influence of channel information in an I-vector characteristic space;

in this embodiment of the present application, the eliminating of the influence of channel information in the I-vector feature space includes the steps of:

obtaining a probability linear discriminant analysis model;

In the embodiment of the present application, the expression of the probabilistic linear discriminant analysis model is as follows:

In the embodiment of the application, when the influence of channel information in an I-vector characteristic space is eliminated, a PLDA (probabilistic Linear discriminant analysis) model is generated, aiming at the problem that an outlier exists in the condition that the I-vector is influenced by a channel, and in the PLDA, it is assumed that both speaker hidden variables and channel hidden variables obey speaker t distribution rather than Gaussian distribution. The method can eliminate the influence of channel information in an I-vector (identification vector) characteristic space, a PLDA (probabilistic Linear discriminant analysis) is a channel compensation method, I-vector (identification vector) characteristics are decomposed into voice signals and random background noise to obtain a PLDA (probabilistic Linear discriminant analysis) model, and the calculation formula is as follows:

where u represents the mean of all I-vector vectors, β_iA discrimination factor representing the ith speaker, N (0, I), matrix

Representing a speaker subspace of a given dimension, ε_ijRepresents the residual containing the channel effects and is a normal distribution N (0, Σ).

S8: generating a new classifier using the X-vector discrimination vector and the I-vector discrimination vector;

s9: inputting an X-vector feature and an I-vector feature into the new classifier;

s10: and acquiring and outputting the voiceprint information of the new classifier.

In the embodiment of the application, a boosting algorithm is used for performing fusion enhancement operation on voiceprint features extracted by different extractors to generate a new classifier with a voiceprint classification effect, the boosting algorithm is a new classifier formed by combining a plurality of classifiers, the initial weights of all classifiers are the same, the weight of each classifier is further calculated according to a calculation misjudgment rate, weight iterative calculation is updated until convergence is reached, a fusion model is trained, and the I-vector and X-vector features serve as input, and voiceprint information which is classified is output. So far, the whole method flow is finished.

It is to be understood that the above-described embodiments of the present invention are merely illustrative of or explaining the principles of the invention and are not to be construed as limiting the invention. Therefore, any modification, equivalent replacement, improvement and the like made without departing from the spirit and scope of the present invention should be included in the protection scope of the present invention. Further, it is intended that the appended claims cover all such variations and modifications as fall within the scope and boundaries of the appended claims or the equivalents of such scope and boundaries.

Claims

1. A voiceprint recognition method of a time delay neural network based on phoneme log-likelihood ratio is characterized by comprising the following steps:

acquiring voice data;

preprocessing the voice data;

inputting an X-vector feature and an I-vector feature into the new classifier;

and acquiring and outputting the voiceprint information of the new classifier.

2. The method for voiceprint recognition based on phoneme log-likelihood ratio time-delay neural network as claimed in claim 1, wherein the preprocessing the speech data comprises the steps of:

extracting acoustic features of the voice data;

performing silence detection on the voice data;

and performing voice enhancement on the voice data.

3. The method for voiceprint recognition based on phoneme log-likelihood ratio time-lapse neural network as claimed in claim 1, wherein said extracting a phoneme posterior probability vector from the preprocessed voice data by using a phoneme recognizer comprises the steps of:

acquiring a phoneme recognizer;

performing phoneme log likelihood ratio training on the phoneme recognizer;

acquiring the preprocessed voice data;

inputting the speech data into the phoneme recognizer;

4. The method for voiceprint recognition of a time-delay neural network based on phoneme log-likelihood ratio as claimed in claim 1, wherein the training of the time-delay neural network by using the preprocessed voice data and extracting the X-vector discrimination vector comprises the steps of:

training a TDNN time delay neural network by using the fixed dimension voice;

5. The method for voiceprint recognition based on phoneme log-likelihood ratio time-lapse neural network as claimed in claim 1, wherein the training of the mixture gaussian model-general background model using the phoneme posterior probability vector comprises the steps of:

and iteratively optimizing the hidden parameters through an EM algorithm.

6. The method for voiceprint recognition based on phoneme log-likelihood ratio time-delay neural network as claimed in claim 1, wherein the calculating the I-vector discrimination vector by using the mixture gaussian model-general background model comprises the steps of:

acquiring an I-vector identification vector extractor;

7. The method for voiceprint recognition based on phoneme log-likelihood ratio time delay neural network as claimed in claim 1, wherein the eliminating the influence of channel information in I-vector feature space comprises the steps of:

obtaining a probability linear discriminant analysis model;

8. The method for recognizing the voiceprint of the time delay neural network based on the phoneme log-likelihood ratio as claimed in claim 7, wherein the expression of the probabilistic linear discriminant analysis model is as follows: