CN113470655A - Voiceprint recognition method of time delay neural network based on phoneme log-likelihood ratio - Google Patents

Voiceprint recognition method of time delay neural network based on phoneme log-likelihood ratio Download PDF

Info

Publication number
CN113470655A
CN113470655A CN202110752463.3A CN202110752463A CN113470655A CN 113470655 A CN113470655 A CN 113470655A CN 202110752463 A CN202110752463 A CN 202110752463A CN 113470655 A CN113470655 A CN 113470655A
Authority
CN
China
Prior art keywords
vector
phoneme
neural network
voice data
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110752463.3A
Other languages
Chinese (zh)
Inventor
刘俊南
薛辉
缪蔚
郭鹏
齐心
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Innomicro Technology Tianjin Co Ltd
Original Assignee
Innomicro Technology Tianjin Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Innomicro Technology Tianjin Co Ltd filed Critical Innomicro Technology Tianjin Co Ltd
Priority to CN202110752463.3A priority Critical patent/CN113470655A/en
Publication of CN113470655A publication Critical patent/CN113470655A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/20Pattern transformations or operations aimed at increasing system robustness, e.g. against channel noise or different working conditions

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)

Abstract

A voiceprint recognition method for a time-delay neural network based on phoneme log-likelihood ratio, the method comprising the steps of: acquiring voice data; preprocessing the voice data; extracting a phoneme posterior probability vector from the preprocessed voice data by using a phoneme recognizer; training a time delay neural network by using the preprocessed voice data and extracting an X-vector distinguishing vector; training a Gaussian mixture model-a generic background model using the phoneme posterior probability vector; calculating an I-vector discrimination vector using the Gaussian mixture model-general background model; eliminating the influence of channel information in an I-vector characteristic space; generating a new classifier using the X-vector discrimination vector and the I-vector discrimination vector; inputting an X-vector feature and an I-vector feature into the new classifier; and acquiring and outputting the voiceprint information of the new classifier. The method and the device can be used for rapidly and accurately identifying the information of the voiceprint, improve the robustness of the system and can be used in a cross-platform mode.

Description

Voiceprint recognition method of time delay neural network based on phoneme log-likelihood ratio
Technical Field
The invention belongs to the technical field of voice recognition, and particularly relates to a voiceprint recognition method of a time delay neural network based on a phoneme log-likelihood ratio.
Background
With the rapid development of disciplines such as pattern recognition, artificial intelligence and the like, the development of human beings enters the intelligent era. Human-computer interaction through voice has gradually become a trend of development, wherein voiceprint recognition refers to identification and identity judgment by using speaker identity related information contained in a voice fragment. Voiceprint recognition is also an important branch in the field of voice recognition, and is a technology for automatically judging the identity of a person to which a voice is analyzed and processed by a computer.
The traditional voiceprint recognition technology comprises voice signal feature processing and extraction, acoustic model training and model discrimination training, but in a complex environment, the effect of the traditional method such as a full-difference space analysis method based on a statistical model is greatly reduced, and with the popularization of a neural network technology, the end-to-end voiceprint recognition system developed based on the neural network model is widely applied to the current voiceprint recognition field and has a better development prospect, wherein the time delay neural network model obtains extremely high accuracy.
Traditional voiceprint-based recognition tends to be more demanding on computing and storage of devices and is more demanding on the environment, so new approaches are needed to improve this deficiency to better accommodate various complex environments and to improve voiceprint recognition techniques to reduce end-to-end recognition implementation difficulties.
Disclosure of Invention
In order to solve the above problems, the present invention provides a voiceprint recognition method for a time-delay neural network based on a phoneme log-likelihood ratio, wherein the method comprises the steps of:
acquiring voice data;
preprocessing the voice data;
extracting a phoneme posterior probability vector from the preprocessed voice data by using a phoneme recognizer;
training a time delay neural network by using the preprocessed voice data and extracting an X-vector distinguishing vector;
training a Gaussian mixture model-a generic background model using the phoneme posterior probability vector;
calculating an I-vector discrimination vector using the Gaussian mixture model-general background model;
eliminating the influence of channel information in an I-vector characteristic space;
generating a new classifier using the X-vector discrimination vector and the I-vector discrimination vector;
inputting an X-vector feature and an I-vector feature into the new classifier;
and acquiring and outputting the voiceprint information of the new classifier.
Preferably, the preprocessing the voice data comprises the steps of:
extracting acoustic features of the voice data;
performing silence detection on the voice data;
and performing voice enhancement on the voice data.
Preferably, the extracting a phoneme posterior probability vector from the preprocessed speech data by using the phoneme recognizer includes:
acquiring a phoneme recognizer;
performing phoneme log likelihood ratio training on the phoneme recognizer;
acquiring the preprocessed voice data;
inputting the speech data into the phoneme recognizer;
and acquiring the phoneme posterior probability vector output by the phoneme recognizer.
Preferably, the training of the time-delay neural network by using the preprocessed voice data and the extraction of the X-vector discrimination vector comprise the steps of:
extracting the frame-level features of the preprocessed voice data by utilizing a neural network;
extracting the segment-level information of the preprocessed voice data through a pooling layer;
mapping the preprocessed voice data to a fixed dimension supervector to obtain a fixed dimension voice;
training a TDNN time delay neural network by using the fixed dimension voice;
and extracting the preprocessed X-vector discrimination vector of the voice data by using the TDNN time delay neural network.
Preferably, the training of the gaussian mixture model-general background model using the phoneme posterior probability vector includes the steps of:
training a Gaussian mixture model-a universal background model by utilizing the corpus;
performing maximum posterior probability algorithm self-adaptation on the Gaussian mixture model-general background model;
and iteratively optimizing the hidden parameters through an EM algorithm.
Preferably, the calculating the I-vector discrimination vector using the mixture gaussian model-general background model includes the steps of:
obtaining a mixed Gaussian supervectors of training speech phoneme log-likelihood ratio characteristic samples by a maximum posterior probability algorithm self-adaptive algorithm by utilizing a mixed Gaussian model-general background model;
calculating a full-difference space matrix by a forward-backward algorithm parameter estimation method;
acquiring an I-vector identification vector extractor;
and extracting a training set and a set to be recognized of the I-vector recognition vector characteristics by using the I-vector recognition vector extractor for the phoneme log likelihood ratio characteristics of the speech to be recognized.
Preferably, the eliminating the influence of the channel information in the I-vector feature space includes the steps of:
obtaining a probability linear discriminant analysis model;
and inputting the I-vector discrimination vector into the probability linear discriminant analysis method model.
Preferably, the expression of the probabilistic linear discriminant analysis model is as follows:
Figure BDA0003145332850000031
wherein xij represents a probabilistic linear discriminant analysis model, u represents the mean of all I-vector discrimination vector vectors, βiA discrimination factor representing the ith speaker,
Figure BDA0003145332850000032
representing a speaker subspace of a given dimension, εijRepresenting a residual containing the effects of the channel.
The method and the device can be used for rapidly and accurately identifying the information of the voiceprint, improve the robustness of the system and can be used in a cross-platform mode.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a schematic flow chart of a voiceprint recognition method of a time-delay neural network based on phoneme log-likelihood ratio according to the present invention;
FIG. 2 is a schematic diagram of a speech enhancement process based on deep unsupervised learning in a voiceprint recognition method of a time-delay neural network based on a phoneme log-likelihood ratio according to the present invention;
FIG. 3 is a schematic diagram of a training process of a phoneme recognizer in a voiceprint recognition method of a time-delay neural network based on a phoneme log-likelihood ratio according to the present invention;
FIG. 4 is a schematic diagram of an HMM-DNN training structure in the voiceprint recognition method of the delay neural network based on the phoneme log-likelihood ratio provided by the present invention;
FIG. 5 is a schematic diagram of X-vector voiceprint recognition in the voiceprint recognition method of the time delay neural network based on the phoneme log-likelihood ratio provided by the present invention;
fig. 6 is a schematic diagram of the TDNN principle of the time-delay neural network in the voiceprint recognition method of the time-delay neural network based on the phoneme log-likelihood ratio provided by the invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings in conjunction with the following detailed description. It should be understood that the description is intended to be exemplary only, and is not intended to limit the scope of the present invention. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present invention.
In the embodiments of the present application, as shown in fig. 1 to 6, the present invention provides a voiceprint recognition method for a time-delay neural network based on phoneme log-likelihood ratio, the method including the steps of:
s1: acquiring voice data;
in the embodiment of the present application, voice data may be acquired using a voice data collecting apparatus.
S2: preprocessing the voice data;
in this embodiment of the present application, the preprocessing the voice data includes:
extracting acoustic features of the voice data;
performing silence detection on the voice data;
and performing voice enhancement on the voice data.
In the embodiment of the present application, when the speech data is preprocessed, specifically, by extracting an acoustic feature from the received input speech signal, the acoustic feature includes any one of an MFCC feature, a FilterBank feature, or a PLP feature; processing input voice data by using a detection technology based on a signal-to-noise ratio, removing a non-voice section in an audio signal, and enhancing the voice data by using a method of mixing multi-environment reverberation, wherein silence detection is to obtain a GMM (Gaussian mixture model) model which can divide silence and noise by using an EM (effective noise) algorithm iterative training; by adopting the deep learning algorithm, various background noises in the audio can be greatly eliminated, and as shown in fig. 2, the method is a speech enhancement flow diagram based on deep unsupervised learning. And data enhancement can be carried out by injecting noise into the pure data set, and a nonlinear function from noise-containing speech to clean speech is learned by using a deep neural network so as to achieve the purpose of denoising or dereverberating. More specifically, the noise-injected training may allow the objective function to obtain an optimal solution that is less sensitive to input variations.
S3: extracting a phoneme posterior probability vector from the preprocessed voice data by using a phoneme recognizer;
in an embodiment of the present application, the extracting a phoneme posterior probability vector from the preprocessed speech data by using a phoneme recognizer includes:
acquiring a phoneme recognizer;
performing phoneme log likelihood ratio training on the phoneme recognizer;
acquiring the preprocessed voice data;
inputting the speech data into the phoneme recognizer;
and acquiring the phoneme posterior probability vector output by the phoneme recognizer.
In the embodiment of the present application, when extracting a phoneme posterior probability vector from the preprocessed voice data by using a phoneme recognizer, extracting a frame-level phoneme posterior probability vector from the preprocessed voice data by using the phoneme recognizer; the PLLR (phoneme log likelihood ratio) module flow diagram shown in fig. 3 is utilized, and specifically includes the following steps: the PLLR (phoneme Log likelihood ratio) training part trains a phoneme recognizer, which recognizes a speech signal as a frame-level phoneme posterior probability vector, using a large amount of irrelevant corpus, but does not decode a phoneme string or a phoneme lattice, but performs a series of transformation processes on the phoneme posterior probability vector to obtain a frame-level form as with the features of an acoustic layer. Fig. 4 is a schematic diagram of an HMM-DNN (hidden markov model-deep neural network) structure used in the present invention. The phoneme recognizer can be trained by using mainstream speech recognition, and the trained phoneme recognizer has the advantage of being not limited by voiceprints in use; extracting the feature of the input speech signal, which is a phoneme posterior probability vector at the frame level, setting a phoneme posterior probability vector [ b (1), b (2), …, b (k) ] recognized as a k-dimension for each frame, and performing a regularization operation on the vector to obtain PLLR (phoneme log likelihood ratio) of each phoneme posterior probability, which is calculated as follows:
Figure BDA0003145332850000061
where b (k) represents the phoneme posterior probability.
S4: training a time delay neural network by using the preprocessed voice data and extracting an X-vector distinguishing vector;
in this embodiment of the present application, the training of the time-delay neural network using the preprocessed speech data and extracting the X-vector discrimination vector includes:
extracting the frame-level features of the preprocessed voice data by utilizing a neural network;
extracting the segment-level information of the preprocessed voice data through a pooling layer;
mapping the preprocessed voice data to a fixed dimension supervector to obtain a fixed dimension voice;
training a TDNN time delay neural network by using the fixed dimension voice;
and extracting the preprocessed X-vector discrimination vector of the voice data by using the TDNN time delay neural network.
In the embodiment of the present application, when the preprocessed voice data is used to train the time-delay neural network and extract the X-vector discrimination vector, specifically, the fixed-length voice feature sequence is used to train the frame-level time-delay neural network, which is an artificial neural network model for timing analysis, and the model extracts the X-vector discrimination vectorAnd vectors can effectively mine the time sequence relevance of the time sequence data. As shown in fig. 5, which is a schematic diagram of X-vector voiceprint recognition, the model utilizes a neural network to extract frame-level features, and then extracts segment-level embedding information through a pooling layer to map a voice to a fixed-dimension supervector, so that the similarity of voiceprints can be measured through euclidean distance or cosine similarity. The TDNN time-delay neural network is then trained using fixed-dimension speech features, the schematic of which is shown in fig. 6. The time delay neural network is composed of time delay neurons. Each time-delay neuron has M inputs I1(t),I2(t),…,IM(t) and a corresponding one of the outputs { O (t) }. Each input Ii(t) includes N time delays for storing input information I at N times before the current timei(t-d), d 1,2, …, N, with a weight wijJ is 1,2, …, and N reflects the degree of influence of different times on the current time data. The time delay neuron calculation formula is as follows:
Figure BDA0003145332850000071
wherein, biAn offset for the ith input; f is an excitation function, and a sigmoid function is generally selected. It can be understood that the output of the neuron is determined by the timing data of the current time and the previous N times of each input, so that the time delay neural network can effectively handle the nonlinear dynamic timing problem. And extracting the characteristic of the X-vector identification vector, and extracting a training set and a set to be identified of the X-vector (identification vector) characteristic with identification properties by using a time delay neural network model.
S5: training a Gaussian mixture model-a generic background model using the phoneme posterior probability vector;
in an embodiment of the present application, the training of the gaussian mixture model-general background model using the phoneme posterior probability vector includes the steps of:
training a Gaussian mixture model-a universal background model by utilizing the corpus;
performing maximum posterior probability algorithm self-adaptation on the Gaussian mixture model-general background model;
and iteratively optimizing the hidden parameters through an EM algorithm.
In the embodiment of the present application, when the gaussian mixture model-general background model is trained using the phoneme posterior probability vector, a GMM-UBM (gaussian mixture model-general background) model with a stable high order and irrelevant to both the speaker and the channel is trained, which can effectively solve the problem of insufficient registered speech in voiceprint recognition, and the specific training method is as follows: the corpus is used for training GMM-UBM (general background model of Gaussian mixture model), and the formula is as follows:
Figure BDA0003145332850000081
wherein x isjIs an N-dimensional observation data feature vector; w is akIs the mixing weight of the k-th Gaussian component, p (x)jkk) Is an N-dimensional Gaussian function, ukRepresents the mean value of the k-dimension gaussian,
Figure BDA0003145332850000082
represents the covariance squared, w, of the kth partial modelkIs the mixing weight of the k-th Gaussian component; then obtaining a characteristic distribution irrelevant to the speaker after self-adaption through a maximum posterior probability algorithm, and finely adjusting each Gaussian distribution of the UBM to the actual data of the target voiceprint by using an EM algorithm; iteratively optimizing hidden parameters through an EM algorithm so as to train and obtain a GMM-UBM (Gaussian mixture model-general background) model, wherein the model is a high-order GMM (Gaussian mixture model), the dimensionality can reach 1024-:
e, step E: calculating the partial model k to the observation data x according to the Gaussian mixture model parameter of 3-1(j)The responsivity of (a) is as follows:
Figure BDA0003145332850000083
wherein,xjIs a feature vector of N dimension; w is akIs the mixing weight of the k-th Gaussian component, p (x)jkk) Is an N-dimensional gaussian function.
And M: updating parameters of the Gaussian mixture model, wherein a parameter updating formula is as follows:
Figure BDA0003145332850000084
Figure BDA0003145332850000091
Figure BDA0003145332850000092
wherein u iskRepresents the mean value of the k-dimension gaussian,
Figure BDA0003145332850000093
represents the covariance squared, w, of the kth partial modelkIs the mixing weight of the k-th Gaussian component, gammajkRepresenting the intensity of the response, x, of the kth partial model to the observed datajIs an N-dimensional feature vector.
S6: calculating an I-vector discrimination vector using the Gaussian mixture model-general background model;
in this embodiment of the present application, the calculating an I-vector discrimination vector using the gaussian mixture model-general background model includes:
obtaining a mixed Gaussian supervectors of training speech phoneme log-likelihood ratio characteristic samples by a maximum posterior probability algorithm self-adaptive algorithm by utilizing a mixed Gaussian model-general background model;
calculating a full-difference space matrix by a forward-backward algorithm parameter estimation method;
acquiring an I-vector identification vector extractor;
and extracting a training set and a set to be recognized of the I-vector recognition vector characteristics by using the I-vector recognition vector extractor for the phoneme log likelihood ratio characteristics of the speech to be recognized.
In the embodiment of the present application, when the I-vector discrimination vector is calculated by using the gaussian mixture model-general background model, a low-dimensional space vector of a fixed dimension, i.e., an I-vector discrimination vector, of different voiceprint speech signals is obtained (according to the vector, it is considered that both the influence of the speaker and the influence of the channel are included in a total variation space T), and the step specifically includes the following two steps: performing I-vector (discrimination vector) training, specifically comprising: obtaining a mixed Gaussian supervectors of a PLLR (phoneme-to-number likelihood ratio) feature sample of training voice by using a GMM-UBM (mixed Gaussian model-universal background) model through a MAP self-adaptive method, and then calculating a full difference space matrix through a Baum-Welch (forward-backward algorithm) parameter estimation method to obtain an I-vector extractor, wherein the Baum-Welch algorithm needs an estimation parameter formula as follows:
M=m+Tw,
wherein T is a total change matrix, w is an implicit variable i-vector which accords with Gaussian distribution, and M is a mean value super vector calculated by a Gaussian mixture model-general background model; performing I-vector (identification vector) extraction, specifically comprising: extracting a training set and a set to be recognized of I-vector (discrimination vector) features with more discrimination properties by using an I-vector extractor to extract PLLR (phoneme log likelihood ratio) features of the speech to be recognized;
s7: eliminating the influence of channel information in an I-vector characteristic space;
in this embodiment of the present application, the eliminating of the influence of channel information in the I-vector feature space includes the steps of:
obtaining a probability linear discriminant analysis model;
and inputting the I-vector discrimination vector into the probability linear discriminant analysis method model.
In the embodiment of the present application, the expression of the probabilistic linear discriminant analysis model is as follows:
Figure BDA0003145332850000101
wherein xij represents a probabilistic linear discriminant analysis model, u represents the mean of all I-vector discrimination vector vectors, βiA discrimination factor representing the ith speaker,
Figure BDA0003145332850000102
representing a speaker subspace of a given dimension, εijRepresenting a residual containing the effects of the channel.
In the embodiment of the application, when the influence of channel information in an I-vector characteristic space is eliminated, a PLDA (probabilistic Linear discriminant analysis) model is generated, aiming at the problem that an outlier exists in the condition that the I-vector is influenced by a channel, and in the PLDA, it is assumed that both speaker hidden variables and channel hidden variables obey speaker t distribution rather than Gaussian distribution. The method can eliminate the influence of channel information in an I-vector (identification vector) characteristic space, a PLDA (probabilistic Linear discriminant analysis) is a channel compensation method, I-vector (identification vector) characteristics are decomposed into voice signals and random background noise to obtain a PLDA (probabilistic Linear discriminant analysis) model, and the calculation formula is as follows:
Figure BDA0003145332850000103
where u represents the mean of all I-vector vectors, βiA discrimination factor representing the ith speaker, N (0, I), matrix
Figure BDA0003145332850000104
Representing a speaker subspace of a given dimension, εijRepresents the residual containing the channel effects and is a normal distribution N (0, Σ).
S8: generating a new classifier using the X-vector discrimination vector and the I-vector discrimination vector;
s9: inputting an X-vector feature and an I-vector feature into the new classifier;
s10: and acquiring and outputting the voiceprint information of the new classifier.
In the embodiment of the application, a boosting algorithm is used for performing fusion enhancement operation on voiceprint features extracted by different extractors to generate a new classifier with a voiceprint classification effect, the boosting algorithm is a new classifier formed by combining a plurality of classifiers, the initial weights of all classifiers are the same, the weight of each classifier is further calculated according to a calculation misjudgment rate, weight iterative calculation is updated until convergence is reached, a fusion model is trained, and the I-vector and X-vector features serve as input, and voiceprint information which is classified is output. So far, the whole method flow is finished.
The method and the device can be used for rapidly and accurately identifying the information of the voiceprint, improve the robustness of the system and can be used in a cross-platform mode.
It is to be understood that the above-described embodiments of the present invention are merely illustrative of or explaining the principles of the invention and are not to be construed as limiting the invention. Therefore, any modification, equivalent replacement, improvement and the like made without departing from the spirit and scope of the present invention should be included in the protection scope of the present invention. Further, it is intended that the appended claims cover all such variations and modifications as fall within the scope and boundaries of the appended claims or the equivalents of such scope and boundaries.

Claims (8)

1. A voiceprint recognition method of a time delay neural network based on phoneme log-likelihood ratio is characterized by comprising the following steps:
acquiring voice data;
preprocessing the voice data;
extracting a phoneme posterior probability vector from the preprocessed voice data by using a phoneme recognizer;
training a time delay neural network by using the preprocessed voice data and extracting an X-vector distinguishing vector;
training a Gaussian mixture model-a generic background model using the phoneme posterior probability vector;
calculating an I-vector discrimination vector using the Gaussian mixture model-general background model;
eliminating the influence of channel information in an I-vector characteristic space;
generating a new classifier using the X-vector discrimination vector and the I-vector discrimination vector;
inputting an X-vector feature and an I-vector feature into the new classifier;
and acquiring and outputting the voiceprint information of the new classifier.
2. The method for voiceprint recognition based on phoneme log-likelihood ratio time-delay neural network as claimed in claim 1, wherein the preprocessing the speech data comprises the steps of:
extracting acoustic features of the voice data;
performing silence detection on the voice data;
and performing voice enhancement on the voice data.
3. The method for voiceprint recognition based on phoneme log-likelihood ratio time-lapse neural network as claimed in claim 1, wherein said extracting a phoneme posterior probability vector from the preprocessed voice data by using a phoneme recognizer comprises the steps of:
acquiring a phoneme recognizer;
performing phoneme log likelihood ratio training on the phoneme recognizer;
acquiring the preprocessed voice data;
inputting the speech data into the phoneme recognizer;
and acquiring the phoneme posterior probability vector output by the phoneme recognizer.
4. The method for voiceprint recognition of a time-delay neural network based on phoneme log-likelihood ratio as claimed in claim 1, wherein the training of the time-delay neural network by using the preprocessed voice data and extracting the X-vector discrimination vector comprises the steps of:
extracting the frame-level features of the preprocessed voice data by utilizing a neural network;
extracting the segment-level information of the preprocessed voice data through a pooling layer;
mapping the preprocessed voice data to a fixed dimension supervector to obtain a fixed dimension voice;
training a TDNN time delay neural network by using the fixed dimension voice;
and extracting the preprocessed X-vector discrimination vector of the voice data by using the TDNN time delay neural network.
5. The method for voiceprint recognition based on phoneme log-likelihood ratio time-lapse neural network as claimed in claim 1, wherein the training of the mixture gaussian model-general background model using the phoneme posterior probability vector comprises the steps of:
training a Gaussian mixture model-a universal background model by utilizing the corpus;
performing maximum posterior probability algorithm self-adaptation on the Gaussian mixture model-general background model;
and iteratively optimizing the hidden parameters through an EM algorithm.
6. The method for voiceprint recognition based on phoneme log-likelihood ratio time-delay neural network as claimed in claim 1, wherein the calculating the I-vector discrimination vector by using the mixture gaussian model-general background model comprises the steps of:
obtaining a mixed Gaussian supervectors of training speech phoneme log-likelihood ratio characteristic samples by a maximum posterior probability algorithm self-adaptive algorithm by utilizing a mixed Gaussian model-general background model;
calculating a full-difference space matrix by a forward-backward algorithm parameter estimation method;
acquiring an I-vector identification vector extractor;
and extracting a training set and a set to be recognized of the I-vector recognition vector characteristics by using the I-vector recognition vector extractor for the phoneme log likelihood ratio characteristics of the speech to be recognized.
7. The method for voiceprint recognition based on phoneme log-likelihood ratio time delay neural network as claimed in claim 1, wherein the eliminating the influence of channel information in I-vector feature space comprises the steps of:
obtaining a probability linear discriminant analysis model;
and inputting the I-vector discrimination vector into the probability linear discriminant analysis method model.
8. The method for recognizing the voiceprint of the time delay neural network based on the phoneme log-likelihood ratio as claimed in claim 7, wherein the expression of the probabilistic linear discriminant analysis model is as follows:
Figure FDA0003145332840000031
wherein xij represents a probabilistic linear discriminant analysis model, u represents the mean of all I-vector discrimination vector vectors, βiA discrimination factor representing the ith speaker,
Figure FDA0003145332840000032
representing a speaker subspace of a given dimension, εijRepresenting a residual containing the effects of the channel.
CN202110752463.3A 2021-07-02 2021-07-02 Voiceprint recognition method of time delay neural network based on phoneme log-likelihood ratio Pending CN113470655A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110752463.3A CN113470655A (en) 2021-07-02 2021-07-02 Voiceprint recognition method of time delay neural network based on phoneme log-likelihood ratio

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110752463.3A CN113470655A (en) 2021-07-02 2021-07-02 Voiceprint recognition method of time delay neural network based on phoneme log-likelihood ratio

Publications (1)

Publication Number Publication Date
CN113470655A true CN113470655A (en) 2021-10-01

Family

ID=77877768

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110752463.3A Pending CN113470655A (en) 2021-07-02 2021-07-02 Voiceprint recognition method of time delay neural network based on phoneme log-likelihood ratio

Country Status (1)

Country Link
CN (1) CN113470655A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114780786A (en) * 2022-04-14 2022-07-22 新疆大学 Voice keyword retrieval method based on bottleneck characteristics and residual error network
CN114974259A (en) * 2021-12-23 2022-08-30 号百信息服务有限公司 Voiceprint recognition method

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080195389A1 (en) * 2007-02-12 2008-08-14 Microsoft Corporation Text-dependent speaker verification
CN109635872A (en) * 2018-12-17 2019-04-16 上海观安信息技术股份有限公司 Personal identification method, electronic equipment and computer program product
CN109801634A (en) * 2019-01-31 2019-05-24 北京声智科技有限公司 A kind of fusion method and device of vocal print feature
CN111199741A (en) * 2018-11-20 2020-05-26 阿里巴巴集团控股有限公司 Voiceprint identification method, voiceprint verification method, voiceprint identification device, computing device and medium
CN111462729A (en) * 2020-03-31 2020-07-28 因诺微科技(天津)有限公司 Fast language identification method based on phoneme log-likelihood ratio and sparse representation
CN111508505A (en) * 2020-04-28 2020-08-07 讯飞智元信息科技有限公司 Speaker identification method, device, equipment and storage medium
CN111783939A (en) * 2020-05-28 2020-10-16 厦门快商通科技股份有限公司 Voiceprint recognition model training method and device, mobile terminal and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080195389A1 (en) * 2007-02-12 2008-08-14 Microsoft Corporation Text-dependent speaker verification
CN111199741A (en) * 2018-11-20 2020-05-26 阿里巴巴集团控股有限公司 Voiceprint identification method, voiceprint verification method, voiceprint identification device, computing device and medium
CN109635872A (en) * 2018-12-17 2019-04-16 上海观安信息技术股份有限公司 Personal identification method, electronic equipment and computer program product
CN109801634A (en) * 2019-01-31 2019-05-24 北京声智科技有限公司 A kind of fusion method and device of vocal print feature
CN111462729A (en) * 2020-03-31 2020-07-28 因诺微科技(天津)有限公司 Fast language identification method based on phoneme log-likelihood ratio and sparse representation
CN111508505A (en) * 2020-04-28 2020-08-07 讯飞智元信息科技有限公司 Speaker identification method, device, equipment and storage medium
CN111783939A (en) * 2020-05-28 2020-10-16 厦门快商通科技股份有限公司 Voiceprint recognition model training method and device, mobile terminal and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
GAO GUANYU ET AL.: "《Design and implementation of a high-performance client/server voiceprint recognition system》", 《2012 IEEE INTERNATIONAL CONFERENCE ON INFORMATION AND AUTOMATION》 *
谢尔曼等: "基于2D-Haar声学特征的大规模说话人识别方法", 《北京理工大学学报》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114974259A (en) * 2021-12-23 2022-08-30 号百信息服务有限公司 Voiceprint recognition method
CN114780786A (en) * 2022-04-14 2022-07-22 新疆大学 Voice keyword retrieval method based on bottleneck characteristics and residual error network
CN114780786B (en) * 2022-04-14 2024-05-14 新疆大学 Voice keyword retrieval method based on bottleneck characteristics and residual error network

Similar Documents

Publication Publication Date Title
An et al. Deep CNNs with self-attention for speaker identification
Yu et al. Spoofing detection in automatic speaker verification systems using DNN classifiers and dynamic acoustic features
CN111462729B (en) Fast language identification method based on phoneme log-likelihood ratio and sparse representation
Ohi et al. Deep speaker recognition: Process, progress, and challenges
Khdier et al. Deep learning algorithms based voiceprint recognition system in noisy environment
CN113470655A (en) Voiceprint recognition method of time delay neural network based on phoneme log-likelihood ratio
CN110299132B (en) Voice digital recognition method and device
CN112053694A (en) Voiceprint recognition method based on CNN and GRU network fusion
CN114550703A (en) Training method and device of voice recognition system, and voice recognition method and device
Sun et al. A novel convolutional neural network voiceprint recognition method based on improved pooling method and dropout idea
CN110827809B (en) Language identification and classification method based on condition generation type confrontation network
CN117079673B (en) Intelligent emotion recognition method based on multi-mode artificial intelligence
Monteiro et al. On the performance of time-pooling strategies for end-to-end spoken language identification
Fujii et al. Automatic speech recognition using hidden conditional neural fields
Elnaggar et al. A new unsupervised short-utterance based speaker identification approach with parametric t-SNE dimensionality reduction
CN116580708A (en) Intelligent voice processing method and system
Anand et al. Text-independent speaker recognition for Ambient Intelligence applications by using information set features
Al-Rawahy et al. Text-independent speaker identification system based on the histogram of DCT-cepstrum coefficients
CN115064175A (en) Speaker recognition method
CN113539238B (en) End-to-end language identification and classification method based on cavity convolutional neural network
Dustor et al. Speaker recognition system with good generalization properties
CN112259107A (en) Voiceprint recognition method under meeting scene small sample condition
Olsson Text dependent speaker verification with a hybrid HMM/ANN system
Singh Bayesian distance metric learning and its application in automatic speaker recognition systems
Wu et al. Dku-tencent submission to oriental language recognition ap18-olr challenge

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20211001

RJ01 Rejection of invention patent application after publication