CN113470655A - Voiceprint recognition method of time delay neural network based on phoneme log-likelihood ratio - Google Patents
Voiceprint recognition method of time delay neural network based on phoneme log-likelihood ratio Download PDFInfo
- Publication number
- CN113470655A CN113470655A CN202110752463.3A CN202110752463A CN113470655A CN 113470655 A CN113470655 A CN 113470655A CN 202110752463 A CN202110752463 A CN 202110752463A CN 113470655 A CN113470655 A CN 113470655A
- Authority
- CN
- China
- Prior art keywords
- vector
- phoneme
- neural network
- voice data
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 47
- 238000000034 method Methods 0.000 title claims abstract description 43
- 239000013598 vector Substances 0.000 claims abstract description 162
- 238000012549 training Methods 0.000 claims abstract description 42
- 239000000203 mixture Substances 0.000 claims abstract description 30
- 238000007781 pre-processing Methods 0.000 claims abstract description 7
- 238000004422 calculation algorithm Methods 0.000 claims description 24
- 238000004458 analytical method Methods 0.000 claims description 17
- 230000000694 effects Effects 0.000 claims description 6
- 239000011159 matrix material Substances 0.000 claims description 6
- 238000001514 detection method Methods 0.000 claims description 5
- 238000011176 pooling Methods 0.000 claims description 4
- 238000013507 mapping Methods 0.000 claims description 3
- 238000010586 diagram Methods 0.000 description 9
- 230000006870 function Effects 0.000 description 6
- 238000004364 calculation method Methods 0.000 description 4
- 238000011161 development Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 210000002569 neuron Anatomy 0.000 description 4
- 239000000284 extract Substances 0.000 description 3
- 238000000605 extraction Methods 0.000 description 3
- 238000003062 neural network model Methods 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 230000004927 fusion Effects 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 241000282414 Homo sapiens Species 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 230000001934 delay Effects 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 230000005284 excitation Effects 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000005312 nonlinear dynamic Methods 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- NGVDGCNFYWLIFO-UHFFFAOYSA-N pyridoxal 5'-phosphate Chemical compound CC1=NC=C(COP(O)(O)=O)C(C=O)=C1O NGVDGCNFYWLIFO-UHFFFAOYSA-N 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
- 238000013179 statistical model Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/02—Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/18—Artificial neural networks; Connectionist approaches
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/20—Pattern transformations or operations aimed at increasing system robustness, e.g. against channel noise or different working conditions
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Machine Translation (AREA)
Abstract
A voiceprint recognition method for a time-delay neural network based on phoneme log-likelihood ratio, the method comprising the steps of: acquiring voice data; preprocessing the voice data; extracting a phoneme posterior probability vector from the preprocessed voice data by using a phoneme recognizer; training a time delay neural network by using the preprocessed voice data and extracting an X-vector distinguishing vector; training a Gaussian mixture model-a generic background model using the phoneme posterior probability vector; calculating an I-vector discrimination vector using the Gaussian mixture model-general background model; eliminating the influence of channel information in an I-vector characteristic space; generating a new classifier using the X-vector discrimination vector and the I-vector discrimination vector; inputting an X-vector feature and an I-vector feature into the new classifier; and acquiring and outputting the voiceprint information of the new classifier. The method and the device can be used for rapidly and accurately identifying the information of the voiceprint, improve the robustness of the system and can be used in a cross-platform mode.
Description
Technical Field
The invention belongs to the technical field of voice recognition, and particularly relates to a voiceprint recognition method of a time delay neural network based on a phoneme log-likelihood ratio.
Background
With the rapid development of disciplines such as pattern recognition, artificial intelligence and the like, the development of human beings enters the intelligent era. Human-computer interaction through voice has gradually become a trend of development, wherein voiceprint recognition refers to identification and identity judgment by using speaker identity related information contained in a voice fragment. Voiceprint recognition is also an important branch in the field of voice recognition, and is a technology for automatically judging the identity of a person to which a voice is analyzed and processed by a computer.
The traditional voiceprint recognition technology comprises voice signal feature processing and extraction, acoustic model training and model discrimination training, but in a complex environment, the effect of the traditional method such as a full-difference space analysis method based on a statistical model is greatly reduced, and with the popularization of a neural network technology, the end-to-end voiceprint recognition system developed based on the neural network model is widely applied to the current voiceprint recognition field and has a better development prospect, wherein the time delay neural network model obtains extremely high accuracy.
Traditional voiceprint-based recognition tends to be more demanding on computing and storage of devices and is more demanding on the environment, so new approaches are needed to improve this deficiency to better accommodate various complex environments and to improve voiceprint recognition techniques to reduce end-to-end recognition implementation difficulties.
Disclosure of Invention
In order to solve the above problems, the present invention provides a voiceprint recognition method for a time-delay neural network based on a phoneme log-likelihood ratio, wherein the method comprises the steps of:
acquiring voice data;
preprocessing the voice data;
extracting a phoneme posterior probability vector from the preprocessed voice data by using a phoneme recognizer;
training a time delay neural network by using the preprocessed voice data and extracting an X-vector distinguishing vector;
training a Gaussian mixture model-a generic background model using the phoneme posterior probability vector;
calculating an I-vector discrimination vector using the Gaussian mixture model-general background model;
eliminating the influence of channel information in an I-vector characteristic space;
generating a new classifier using the X-vector discrimination vector and the I-vector discrimination vector;
inputting an X-vector feature and an I-vector feature into the new classifier;
and acquiring and outputting the voiceprint information of the new classifier.
Preferably, the preprocessing the voice data comprises the steps of:
extracting acoustic features of the voice data;
performing silence detection on the voice data;
and performing voice enhancement on the voice data.
Preferably, the extracting a phoneme posterior probability vector from the preprocessed speech data by using the phoneme recognizer includes:
acquiring a phoneme recognizer;
performing phoneme log likelihood ratio training on the phoneme recognizer;
acquiring the preprocessed voice data;
inputting the speech data into the phoneme recognizer;
and acquiring the phoneme posterior probability vector output by the phoneme recognizer.
Preferably, the training of the time-delay neural network by using the preprocessed voice data and the extraction of the X-vector discrimination vector comprise the steps of:
extracting the frame-level features of the preprocessed voice data by utilizing a neural network;
extracting the segment-level information of the preprocessed voice data through a pooling layer;
mapping the preprocessed voice data to a fixed dimension supervector to obtain a fixed dimension voice;
training a TDNN time delay neural network by using the fixed dimension voice;
and extracting the preprocessed X-vector discrimination vector of the voice data by using the TDNN time delay neural network.
Preferably, the training of the gaussian mixture model-general background model using the phoneme posterior probability vector includes the steps of:
training a Gaussian mixture model-a universal background model by utilizing the corpus;
performing maximum posterior probability algorithm self-adaptation on the Gaussian mixture model-general background model;
and iteratively optimizing the hidden parameters through an EM algorithm.
Preferably, the calculating the I-vector discrimination vector using the mixture gaussian model-general background model includes the steps of:
obtaining a mixed Gaussian supervectors of training speech phoneme log-likelihood ratio characteristic samples by a maximum posterior probability algorithm self-adaptive algorithm by utilizing a mixed Gaussian model-general background model;
calculating a full-difference space matrix by a forward-backward algorithm parameter estimation method;
acquiring an I-vector identification vector extractor;
and extracting a training set and a set to be recognized of the I-vector recognition vector characteristics by using the I-vector recognition vector extractor for the phoneme log likelihood ratio characteristics of the speech to be recognized.
Preferably, the eliminating the influence of the channel information in the I-vector feature space includes the steps of:
obtaining a probability linear discriminant analysis model;
and inputting the I-vector discrimination vector into the probability linear discriminant analysis method model.
Preferably, the expression of the probabilistic linear discriminant analysis model is as follows:
wherein xij represents a probabilistic linear discriminant analysis model, u represents the mean of all I-vector discrimination vector vectors, βiA discrimination factor representing the ith speaker,representing a speaker subspace of a given dimension, εijRepresenting a residual containing the effects of the channel.
The method and the device can be used for rapidly and accurately identifying the information of the voiceprint, improve the robustness of the system and can be used in a cross-platform mode.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a schematic flow chart of a voiceprint recognition method of a time-delay neural network based on phoneme log-likelihood ratio according to the present invention;
FIG. 2 is a schematic diagram of a speech enhancement process based on deep unsupervised learning in a voiceprint recognition method of a time-delay neural network based on a phoneme log-likelihood ratio according to the present invention;
FIG. 3 is a schematic diagram of a training process of a phoneme recognizer in a voiceprint recognition method of a time-delay neural network based on a phoneme log-likelihood ratio according to the present invention;
FIG. 4 is a schematic diagram of an HMM-DNN training structure in the voiceprint recognition method of the delay neural network based on the phoneme log-likelihood ratio provided by the present invention;
FIG. 5 is a schematic diagram of X-vector voiceprint recognition in the voiceprint recognition method of the time delay neural network based on the phoneme log-likelihood ratio provided by the present invention;
fig. 6 is a schematic diagram of the TDNN principle of the time-delay neural network in the voiceprint recognition method of the time-delay neural network based on the phoneme log-likelihood ratio provided by the invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings in conjunction with the following detailed description. It should be understood that the description is intended to be exemplary only, and is not intended to limit the scope of the present invention. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present invention.
In the embodiments of the present application, as shown in fig. 1 to 6, the present invention provides a voiceprint recognition method for a time-delay neural network based on phoneme log-likelihood ratio, the method including the steps of:
s1: acquiring voice data;
in the embodiment of the present application, voice data may be acquired using a voice data collecting apparatus.
S2: preprocessing the voice data;
in this embodiment of the present application, the preprocessing the voice data includes:
extracting acoustic features of the voice data;
performing silence detection on the voice data;
and performing voice enhancement on the voice data.
In the embodiment of the present application, when the speech data is preprocessed, specifically, by extracting an acoustic feature from the received input speech signal, the acoustic feature includes any one of an MFCC feature, a FilterBank feature, or a PLP feature; processing input voice data by using a detection technology based on a signal-to-noise ratio, removing a non-voice section in an audio signal, and enhancing the voice data by using a method of mixing multi-environment reverberation, wherein silence detection is to obtain a GMM (Gaussian mixture model) model which can divide silence and noise by using an EM (effective noise) algorithm iterative training; by adopting the deep learning algorithm, various background noises in the audio can be greatly eliminated, and as shown in fig. 2, the method is a speech enhancement flow diagram based on deep unsupervised learning. And data enhancement can be carried out by injecting noise into the pure data set, and a nonlinear function from noise-containing speech to clean speech is learned by using a deep neural network so as to achieve the purpose of denoising or dereverberating. More specifically, the noise-injected training may allow the objective function to obtain an optimal solution that is less sensitive to input variations.
S3: extracting a phoneme posterior probability vector from the preprocessed voice data by using a phoneme recognizer;
in an embodiment of the present application, the extracting a phoneme posterior probability vector from the preprocessed speech data by using a phoneme recognizer includes:
acquiring a phoneme recognizer;
performing phoneme log likelihood ratio training on the phoneme recognizer;
acquiring the preprocessed voice data;
inputting the speech data into the phoneme recognizer;
and acquiring the phoneme posterior probability vector output by the phoneme recognizer.
In the embodiment of the present application, when extracting a phoneme posterior probability vector from the preprocessed voice data by using a phoneme recognizer, extracting a frame-level phoneme posterior probability vector from the preprocessed voice data by using the phoneme recognizer; the PLLR (phoneme log likelihood ratio) module flow diagram shown in fig. 3 is utilized, and specifically includes the following steps: the PLLR (phoneme Log likelihood ratio) training part trains a phoneme recognizer, which recognizes a speech signal as a frame-level phoneme posterior probability vector, using a large amount of irrelevant corpus, but does not decode a phoneme string or a phoneme lattice, but performs a series of transformation processes on the phoneme posterior probability vector to obtain a frame-level form as with the features of an acoustic layer. Fig. 4 is a schematic diagram of an HMM-DNN (hidden markov model-deep neural network) structure used in the present invention. The phoneme recognizer can be trained by using mainstream speech recognition, and the trained phoneme recognizer has the advantage of being not limited by voiceprints in use; extracting the feature of the input speech signal, which is a phoneme posterior probability vector at the frame level, setting a phoneme posterior probability vector [ b (1), b (2), …, b (k) ] recognized as a k-dimension for each frame, and performing a regularization operation on the vector to obtain PLLR (phoneme log likelihood ratio) of each phoneme posterior probability, which is calculated as follows:
where b (k) represents the phoneme posterior probability.
S4: training a time delay neural network by using the preprocessed voice data and extracting an X-vector distinguishing vector;
in this embodiment of the present application, the training of the time-delay neural network using the preprocessed speech data and extracting the X-vector discrimination vector includes:
extracting the frame-level features of the preprocessed voice data by utilizing a neural network;
extracting the segment-level information of the preprocessed voice data through a pooling layer;
mapping the preprocessed voice data to a fixed dimension supervector to obtain a fixed dimension voice;
training a TDNN time delay neural network by using the fixed dimension voice;
and extracting the preprocessed X-vector discrimination vector of the voice data by using the TDNN time delay neural network.
In the embodiment of the present application, when the preprocessed voice data is used to train the time-delay neural network and extract the X-vector discrimination vector, specifically, the fixed-length voice feature sequence is used to train the frame-level time-delay neural network, which is an artificial neural network model for timing analysis, and the model extracts the X-vector discrimination vectorAnd vectors can effectively mine the time sequence relevance of the time sequence data. As shown in fig. 5, which is a schematic diagram of X-vector voiceprint recognition, the model utilizes a neural network to extract frame-level features, and then extracts segment-level embedding information through a pooling layer to map a voice to a fixed-dimension supervector, so that the similarity of voiceprints can be measured through euclidean distance or cosine similarity. The TDNN time-delay neural network is then trained using fixed-dimension speech features, the schematic of which is shown in fig. 6. The time delay neural network is composed of time delay neurons. Each time-delay neuron has M inputs I1(t),I2(t),…,IM(t) and a corresponding one of the outputs { O (t) }. Each input Ii(t) includes N time delays for storing input information I at N times before the current timei(t-d), d 1,2, …, N, with a weight wijJ is 1,2, …, and N reflects the degree of influence of different times on the current time data. The time delay neuron calculation formula is as follows:
wherein, biAn offset for the ith input; f is an excitation function, and a sigmoid function is generally selected. It can be understood that the output of the neuron is determined by the timing data of the current time and the previous N times of each input, so that the time delay neural network can effectively handle the nonlinear dynamic timing problem. And extracting the characteristic of the X-vector identification vector, and extracting a training set and a set to be identified of the X-vector (identification vector) characteristic with identification properties by using a time delay neural network model.
S5: training a Gaussian mixture model-a generic background model using the phoneme posterior probability vector;
in an embodiment of the present application, the training of the gaussian mixture model-general background model using the phoneme posterior probability vector includes the steps of:
training a Gaussian mixture model-a universal background model by utilizing the corpus;
performing maximum posterior probability algorithm self-adaptation on the Gaussian mixture model-general background model;
and iteratively optimizing the hidden parameters through an EM algorithm.
In the embodiment of the present application, when the gaussian mixture model-general background model is trained using the phoneme posterior probability vector, a GMM-UBM (gaussian mixture model-general background) model with a stable high order and irrelevant to both the speaker and the channel is trained, which can effectively solve the problem of insufficient registered speech in voiceprint recognition, and the specific training method is as follows: the corpus is used for training GMM-UBM (general background model of Gaussian mixture model), and the formula is as follows:
wherein x isjIs an N-dimensional observation data feature vector; w is akIs the mixing weight of the k-th Gaussian component, p (x)j|μk,Σk) Is an N-dimensional Gaussian function, ukRepresents the mean value of the k-dimension gaussian,represents the covariance squared, w, of the kth partial modelkIs the mixing weight of the k-th Gaussian component; then obtaining a characteristic distribution irrelevant to the speaker after self-adaption through a maximum posterior probability algorithm, and finely adjusting each Gaussian distribution of the UBM to the actual data of the target voiceprint by using an EM algorithm; iteratively optimizing hidden parameters through an EM algorithm so as to train and obtain a GMM-UBM (Gaussian mixture model-general background) model, wherein the model is a high-order GMM (Gaussian mixture model), the dimensionality can reach 1024-:
e, step E: calculating the partial model k to the observation data x according to the Gaussian mixture model parameter of 3-1(j)The responsivity of (a) is as follows:
wherein,xjIs a feature vector of N dimension; w is akIs the mixing weight of the k-th Gaussian component, p (x)j|μk,Σk) Is an N-dimensional gaussian function.
And M: updating parameters of the Gaussian mixture model, wherein a parameter updating formula is as follows:
wherein u iskRepresents the mean value of the k-dimension gaussian,represents the covariance squared, w, of the kth partial modelkIs the mixing weight of the k-th Gaussian component, gammajkRepresenting the intensity of the response, x, of the kth partial model to the observed datajIs an N-dimensional feature vector.
S6: calculating an I-vector discrimination vector using the Gaussian mixture model-general background model;
in this embodiment of the present application, the calculating an I-vector discrimination vector using the gaussian mixture model-general background model includes:
obtaining a mixed Gaussian supervectors of training speech phoneme log-likelihood ratio characteristic samples by a maximum posterior probability algorithm self-adaptive algorithm by utilizing a mixed Gaussian model-general background model;
calculating a full-difference space matrix by a forward-backward algorithm parameter estimation method;
acquiring an I-vector identification vector extractor;
and extracting a training set and a set to be recognized of the I-vector recognition vector characteristics by using the I-vector recognition vector extractor for the phoneme log likelihood ratio characteristics of the speech to be recognized.
In the embodiment of the present application, when the I-vector discrimination vector is calculated by using the gaussian mixture model-general background model, a low-dimensional space vector of a fixed dimension, i.e., an I-vector discrimination vector, of different voiceprint speech signals is obtained (according to the vector, it is considered that both the influence of the speaker and the influence of the channel are included in a total variation space T), and the step specifically includes the following two steps: performing I-vector (discrimination vector) training, specifically comprising: obtaining a mixed Gaussian supervectors of a PLLR (phoneme-to-number likelihood ratio) feature sample of training voice by using a GMM-UBM (mixed Gaussian model-universal background) model through a MAP self-adaptive method, and then calculating a full difference space matrix through a Baum-Welch (forward-backward algorithm) parameter estimation method to obtain an I-vector extractor, wherein the Baum-Welch algorithm needs an estimation parameter formula as follows:
M=m+Tw,
wherein T is a total change matrix, w is an implicit variable i-vector which accords with Gaussian distribution, and M is a mean value super vector calculated by a Gaussian mixture model-general background model; performing I-vector (identification vector) extraction, specifically comprising: extracting a training set and a set to be recognized of I-vector (discrimination vector) features with more discrimination properties by using an I-vector extractor to extract PLLR (phoneme log likelihood ratio) features of the speech to be recognized;
s7: eliminating the influence of channel information in an I-vector characteristic space;
in this embodiment of the present application, the eliminating of the influence of channel information in the I-vector feature space includes the steps of:
obtaining a probability linear discriminant analysis model;
and inputting the I-vector discrimination vector into the probability linear discriminant analysis method model.
In the embodiment of the present application, the expression of the probabilistic linear discriminant analysis model is as follows:
wherein xij represents a probabilistic linear discriminant analysis model, u represents the mean of all I-vector discrimination vector vectors, βiA discrimination factor representing the ith speaker,representing a speaker subspace of a given dimension, εijRepresenting a residual containing the effects of the channel.
In the embodiment of the application, when the influence of channel information in an I-vector characteristic space is eliminated, a PLDA (probabilistic Linear discriminant analysis) model is generated, aiming at the problem that an outlier exists in the condition that the I-vector is influenced by a channel, and in the PLDA, it is assumed that both speaker hidden variables and channel hidden variables obey speaker t distribution rather than Gaussian distribution. The method can eliminate the influence of channel information in an I-vector (identification vector) characteristic space, a PLDA (probabilistic Linear discriminant analysis) is a channel compensation method, I-vector (identification vector) characteristics are decomposed into voice signals and random background noise to obtain a PLDA (probabilistic Linear discriminant analysis) model, and the calculation formula is as follows:
where u represents the mean of all I-vector vectors, βiA discrimination factor representing the ith speaker, N (0, I), matrixRepresenting a speaker subspace of a given dimension, εijRepresents the residual containing the channel effects and is a normal distribution N (0, Σ).
S8: generating a new classifier using the X-vector discrimination vector and the I-vector discrimination vector;
s9: inputting an X-vector feature and an I-vector feature into the new classifier;
s10: and acquiring and outputting the voiceprint information of the new classifier.
In the embodiment of the application, a boosting algorithm is used for performing fusion enhancement operation on voiceprint features extracted by different extractors to generate a new classifier with a voiceprint classification effect, the boosting algorithm is a new classifier formed by combining a plurality of classifiers, the initial weights of all classifiers are the same, the weight of each classifier is further calculated according to a calculation misjudgment rate, weight iterative calculation is updated until convergence is reached, a fusion model is trained, and the I-vector and X-vector features serve as input, and voiceprint information which is classified is output. So far, the whole method flow is finished.
The method and the device can be used for rapidly and accurately identifying the information of the voiceprint, improve the robustness of the system and can be used in a cross-platform mode.
It is to be understood that the above-described embodiments of the present invention are merely illustrative of or explaining the principles of the invention and are not to be construed as limiting the invention. Therefore, any modification, equivalent replacement, improvement and the like made without departing from the spirit and scope of the present invention should be included in the protection scope of the present invention. Further, it is intended that the appended claims cover all such variations and modifications as fall within the scope and boundaries of the appended claims or the equivalents of such scope and boundaries.
Claims (8)
1. A voiceprint recognition method of a time delay neural network based on phoneme log-likelihood ratio is characterized by comprising the following steps:
acquiring voice data;
preprocessing the voice data;
extracting a phoneme posterior probability vector from the preprocessed voice data by using a phoneme recognizer;
training a time delay neural network by using the preprocessed voice data and extracting an X-vector distinguishing vector;
training a Gaussian mixture model-a generic background model using the phoneme posterior probability vector;
calculating an I-vector discrimination vector using the Gaussian mixture model-general background model;
eliminating the influence of channel information in an I-vector characteristic space;
generating a new classifier using the X-vector discrimination vector and the I-vector discrimination vector;
inputting an X-vector feature and an I-vector feature into the new classifier;
and acquiring and outputting the voiceprint information of the new classifier.
2. The method for voiceprint recognition based on phoneme log-likelihood ratio time-delay neural network as claimed in claim 1, wherein the preprocessing the speech data comprises the steps of:
extracting acoustic features of the voice data;
performing silence detection on the voice data;
and performing voice enhancement on the voice data.
3. The method for voiceprint recognition based on phoneme log-likelihood ratio time-lapse neural network as claimed in claim 1, wherein said extracting a phoneme posterior probability vector from the preprocessed voice data by using a phoneme recognizer comprises the steps of:
acquiring a phoneme recognizer;
performing phoneme log likelihood ratio training on the phoneme recognizer;
acquiring the preprocessed voice data;
inputting the speech data into the phoneme recognizer;
and acquiring the phoneme posterior probability vector output by the phoneme recognizer.
4. The method for voiceprint recognition of a time-delay neural network based on phoneme log-likelihood ratio as claimed in claim 1, wherein the training of the time-delay neural network by using the preprocessed voice data and extracting the X-vector discrimination vector comprises the steps of:
extracting the frame-level features of the preprocessed voice data by utilizing a neural network;
extracting the segment-level information of the preprocessed voice data through a pooling layer;
mapping the preprocessed voice data to a fixed dimension supervector to obtain a fixed dimension voice;
training a TDNN time delay neural network by using the fixed dimension voice;
and extracting the preprocessed X-vector discrimination vector of the voice data by using the TDNN time delay neural network.
5. The method for voiceprint recognition based on phoneme log-likelihood ratio time-lapse neural network as claimed in claim 1, wherein the training of the mixture gaussian model-general background model using the phoneme posterior probability vector comprises the steps of:
training a Gaussian mixture model-a universal background model by utilizing the corpus;
performing maximum posterior probability algorithm self-adaptation on the Gaussian mixture model-general background model;
and iteratively optimizing the hidden parameters through an EM algorithm.
6. The method for voiceprint recognition based on phoneme log-likelihood ratio time-delay neural network as claimed in claim 1, wherein the calculating the I-vector discrimination vector by using the mixture gaussian model-general background model comprises the steps of:
obtaining a mixed Gaussian supervectors of training speech phoneme log-likelihood ratio characteristic samples by a maximum posterior probability algorithm self-adaptive algorithm by utilizing a mixed Gaussian model-general background model;
calculating a full-difference space matrix by a forward-backward algorithm parameter estimation method;
acquiring an I-vector identification vector extractor;
and extracting a training set and a set to be recognized of the I-vector recognition vector characteristics by using the I-vector recognition vector extractor for the phoneme log likelihood ratio characteristics of the speech to be recognized.
7. The method for voiceprint recognition based on phoneme log-likelihood ratio time delay neural network as claimed in claim 1, wherein the eliminating the influence of channel information in I-vector feature space comprises the steps of:
obtaining a probability linear discriminant analysis model;
and inputting the I-vector discrimination vector into the probability linear discriminant analysis method model.
8. The method for recognizing the voiceprint of the time delay neural network based on the phoneme log-likelihood ratio as claimed in claim 7, wherein the expression of the probabilistic linear discriminant analysis model is as follows:
wherein xij represents a probabilistic linear discriminant analysis model, u represents the mean of all I-vector discrimination vector vectors, βiA discrimination factor representing the ith speaker,representing a speaker subspace of a given dimension, εijRepresenting a residual containing the effects of the channel.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110752463.3A CN113470655A (en) | 2021-07-02 | 2021-07-02 | Voiceprint recognition method of time delay neural network based on phoneme log-likelihood ratio |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110752463.3A CN113470655A (en) | 2021-07-02 | 2021-07-02 | Voiceprint recognition method of time delay neural network based on phoneme log-likelihood ratio |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113470655A true CN113470655A (en) | 2021-10-01 |
Family
ID=77877768
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110752463.3A Pending CN113470655A (en) | 2021-07-02 | 2021-07-02 | Voiceprint recognition method of time delay neural network based on phoneme log-likelihood ratio |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113470655A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114780786A (en) * | 2022-04-14 | 2022-07-22 | 新疆大学 | Voice keyword retrieval method based on bottleneck characteristics and residual error network |
CN114974259A (en) * | 2021-12-23 | 2022-08-30 | 号百信息服务有限公司 | Voiceprint recognition method |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080195389A1 (en) * | 2007-02-12 | 2008-08-14 | Microsoft Corporation | Text-dependent speaker verification |
CN109635872A (en) * | 2018-12-17 | 2019-04-16 | 上海观安信息技术股份有限公司 | Personal identification method, electronic equipment and computer program product |
CN109801634A (en) * | 2019-01-31 | 2019-05-24 | 北京声智科技有限公司 | A kind of fusion method and device of vocal print feature |
CN111199741A (en) * | 2018-11-20 | 2020-05-26 | 阿里巴巴集团控股有限公司 | Voiceprint identification method, voiceprint verification method, voiceprint identification device, computing device and medium |
CN111462729A (en) * | 2020-03-31 | 2020-07-28 | 因诺微科技(天津)有限公司 | Fast language identification method based on phoneme log-likelihood ratio and sparse representation |
CN111508505A (en) * | 2020-04-28 | 2020-08-07 | 讯飞智元信息科技有限公司 | Speaker identification method, device, equipment and storage medium |
CN111783939A (en) * | 2020-05-28 | 2020-10-16 | 厦门快商通科技股份有限公司 | Voiceprint recognition model training method and device, mobile terminal and storage medium |
-
2021
- 2021-07-02 CN CN202110752463.3A patent/CN113470655A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080195389A1 (en) * | 2007-02-12 | 2008-08-14 | Microsoft Corporation | Text-dependent speaker verification |
CN111199741A (en) * | 2018-11-20 | 2020-05-26 | 阿里巴巴集团控股有限公司 | Voiceprint identification method, voiceprint verification method, voiceprint identification device, computing device and medium |
CN109635872A (en) * | 2018-12-17 | 2019-04-16 | 上海观安信息技术股份有限公司 | Personal identification method, electronic equipment and computer program product |
CN109801634A (en) * | 2019-01-31 | 2019-05-24 | 北京声智科技有限公司 | A kind of fusion method and device of vocal print feature |
CN111462729A (en) * | 2020-03-31 | 2020-07-28 | 因诺微科技(天津)有限公司 | Fast language identification method based on phoneme log-likelihood ratio and sparse representation |
CN111508505A (en) * | 2020-04-28 | 2020-08-07 | 讯飞智元信息科技有限公司 | Speaker identification method, device, equipment and storage medium |
CN111783939A (en) * | 2020-05-28 | 2020-10-16 | 厦门快商通科技股份有限公司 | Voiceprint recognition model training method and device, mobile terminal and storage medium |
Non-Patent Citations (2)
Title |
---|
GAO GUANYU ET AL.: "《Design and implementation of a high-performance client/server voiceprint recognition system》", 《2012 IEEE INTERNATIONAL CONFERENCE ON INFORMATION AND AUTOMATION》 * |
谢尔曼等: "基于2D-Haar声学特征的大规模说话人识别方法", 《北京理工大学学报》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114974259A (en) * | 2021-12-23 | 2022-08-30 | 号百信息服务有限公司 | Voiceprint recognition method |
CN114780786A (en) * | 2022-04-14 | 2022-07-22 | 新疆大学 | Voice keyword retrieval method based on bottleneck characteristics and residual error network |
CN114780786B (en) * | 2022-04-14 | 2024-05-14 | 新疆大学 | Voice keyword retrieval method based on bottleneck characteristics and residual error network |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
An et al. | Deep CNNs with self-attention for speaker identification | |
Yu et al. | Spoofing detection in automatic speaker verification systems using DNN classifiers and dynamic acoustic features | |
CN111462729B (en) | Fast language identification method based on phoneme log-likelihood ratio and sparse representation | |
Ohi et al. | Deep speaker recognition: Process, progress, and challenges | |
Khdier et al. | Deep learning algorithms based voiceprint recognition system in noisy environment | |
CN113470655A (en) | Voiceprint recognition method of time delay neural network based on phoneme log-likelihood ratio | |
CN110299132B (en) | Voice digital recognition method and device | |
CN112053694A (en) | Voiceprint recognition method based on CNN and GRU network fusion | |
CN114550703A (en) | Training method and device of voice recognition system, and voice recognition method and device | |
Sun et al. | A novel convolutional neural network voiceprint recognition method based on improved pooling method and dropout idea | |
CN110827809B (en) | Language identification and classification method based on condition generation type confrontation network | |
CN117079673B (en) | Intelligent emotion recognition method based on multi-mode artificial intelligence | |
Monteiro et al. | On the performance of time-pooling strategies for end-to-end spoken language identification | |
Fujii et al. | Automatic speech recognition using hidden conditional neural fields | |
Elnaggar et al. | A new unsupervised short-utterance based speaker identification approach with parametric t-SNE dimensionality reduction | |
CN116580708A (en) | Intelligent voice processing method and system | |
Anand et al. | Text-independent speaker recognition for Ambient Intelligence applications by using information set features | |
Al-Rawahy et al. | Text-independent speaker identification system based on the histogram of DCT-cepstrum coefficients | |
CN115064175A (en) | Speaker recognition method | |
CN113539238B (en) | End-to-end language identification and classification method based on cavity convolutional neural network | |
Dustor et al. | Speaker recognition system with good generalization properties | |
CN112259107A (en) | Voiceprint recognition method under meeting scene small sample condition | |
Olsson | Text dependent speaker verification with a hybrid HMM/ANN system | |
Singh | Bayesian distance metric learning and its application in automatic speaker recognition systems | |
Wu et al. | Dku-tencent submission to oriental language recognition ap18-olr challenge |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20211001 |
|
RJ01 | Rejection of invention patent application after publication |