CN108091326B - Voiceprint recognition method and system based on linear regression - Google Patents
Voiceprint recognition method and system based on linear regression Download PDFInfo
- Publication number
- CN108091326B CN108091326B CN201810141059.0A CN201810141059A CN108091326B CN 108091326 B CN108091326 B CN 108091326B CN 201810141059 A CN201810141059 A CN 201810141059A CN 108091326 B CN108091326 B CN 108091326B
- Authority
- CN
- China
- Prior art keywords
- voiceprint
- vector
- feature vector
- voiceprint feature
- linear regression
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000012417 linear regression Methods 0.000 title claims abstract description 60
- 238000000034 method Methods 0.000 title claims abstract description 53
- 239000013598 vector Substances 0.000 claims abstract description 197
- 238000013507 mapping Methods 0.000 claims abstract description 47
- 238000012549 training Methods 0.000 claims description 60
- 239000011159 matrix material Substances 0.000 claims description 33
- 238000000605 extraction Methods 0.000 claims description 13
- 230000008569 process Effects 0.000 claims description 5
- 238000002474 experimental method Methods 0.000 abstract description 3
- 238000012360 testing method Methods 0.000 description 37
- 230000037433 frameshift Effects 0.000 description 18
- 239000000284 extract Substances 0.000 description 13
- 239000000203 mixture Substances 0.000 description 11
- 238000001514 detection method Methods 0.000 description 9
- 238000012935 Averaging Methods 0.000 description 6
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000013106 supervised machine learning method Methods 0.000 description 2
- 238000012706 support-vector machine Methods 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 210000002569 neuron Anatomy 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 210000004205 output neuron Anatomy 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 230000005634 sigma model Effects 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/02—Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/22—Interactive procedures; Man-machine interfaces
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Computational Linguistics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application provides a voiceprint recognition method and system based on linear regression, wherein a first voiceprint feature vector is obtained from voice data, a pre-trained linear regression model is used for mapping the first voiceprint feature vector to a second voiceprint feature vector, and the second voiceprint feature vector is subjected to classification recognition. The linear regression model is innovatively introduced into the field of voiceprint recognition, and experiments prove that the accuracy of voiceprint recognition can be effectively improved.
Description
Technical Field
The application relates to the field of electronic information, in particular to a voiceprint recognition method and system based on linear regression.
Background
Voiceprint recognition systems typically include two parts, a voiceprint feature extraction front-end and a voiceprint recognition back-end.
The voiceprint feature extraction front end is used for extracting the voiceprint features of the speaker from the speaker sentences: that is, a sentence of speech with an arbitrary length is mapped into a vector with a fixed length by a model. Common algorithms used by the voiceprint feature extraction front-end include: the speaker recognition method comprises a Gaussian mixture model-based general background model (GMM-UBM)/identity vector (i-vector) algorithm (GMM/i-vector algorithm for short), a deep learning-based general background model/i-vector algorithm for voice recognition acoustic model (DNN/i-vector algorithm for short), and a d-vector algorithm for classifying speakers by using a deep learning model and outputting the top hidden layer as a voiceprint vector of the speakers.
The voiceprint recognition back end classifies the voiceprint vector of the speaker through a supervised machine learning algorithm. The method can be divided into two parts, wherein the first part is to map the voiceprint feature vector into another new voiceprint feature vector by a supervised machine learning method, and the second part is to classify the new voiceprint feature vector after dimension reduction by the supervised machine learning method. For the first part, common mapping methods include Linear Discriminant Analysis (LDA), intra-class variance normalization (WCCN), and disturbance attribute projection (NAP), among others. For the second part, common classifiers include cosine distance classifier, Support Vector Machine (SVM) classifier, Probabilistic Linear Discriminant Analysis (PLDA) classifier, and the like. The LDA + PLDA method in the back-end algorithm achieves the optimal performance in a plurality of standardized tests and is widely adopted by practical systems at present.
The voiceprint feature extraction front end and the voiceprint recognition rear end can be combined at will to form a voiceprint recognition system. However, the accuracy of the current voiceprint recognition still needs to be improved.
Disclosure of Invention
The application provides a voiceprint recognition method and system based on linear regression, and aims to solve the problem of how to improve the accuracy of voiceprint recognition.
In order to achieve the above object, the present application provides the following technical solutions:
a voiceprint recognition method based on linear regression comprises the following steps:
acquiring a first voiceprint feature vector from voice data;
mapping the first voiceprint feature vector into a second voiceprint feature vector by using a pre-trained linear regression model;
and carrying out classification identification on the second acoustic line feature vector.
Optionally, the mapping the first voiceprint feature vector to the second voiceprint feature vector includes:
using the mapping z ═ ATAnd x, mapping the first voiceprint feature vector to a second voiceprint feature vector, wherein A is the pre-trained linear regression model, x is the first voiceprint feature vector, and z is the second voiceprint feature vector.
Optionally, the training process of the linear regression model includes:
obtaining training data from a voiceprint databaseWherein x isi,jFor a d-dimensional voiceprint feature vector extracted from each sentence in the voiceprint database, i is 1, …, n, j is 1, …, MiN is the number of speakers in the voiceprint database, and any speaker corresponds to MnA word; y isi,jIs an n-dimensional indicative vector y of the ith speakeri,j=[0,...,1,...,0]T(ii) a d is a preset value;
using A ═ (XX)T)-1XYTObtaining the linear regression model, wherein,formed as a voiceprint vector of training dataThe matrix of (a) is,formed as illustrative vectors of training dataOf the matrix of (a).
Optionally, the classifying and identifying the second acoustic line feature vector includes:
and using a cosine classifier to classify and identify the second acoustic line feature vector.
Optionally, the obtaining the first voiceprint feature vector from the voice data includes:
the first voiceprint feature vector is obtained from the speech data using a GMM/i-vector algorithm, a DNN/i-vector algorithm, or a d-vector algorithm.
A system for voiceprint recognition based on linear regression, comprising:
the voice print feature extraction front end is used for acquiring a first voice print feature vector from voice data;
a voiceprint recognition back end, the voiceprint recognition back end comprising a voiceprint feature mapping module and a voiceprint classifier, the voiceprint feature mapping module being configured to map the first voiceprint feature vector to a second voiceprint feature vector using a pre-trained linear regression model; and the voiceprint classifier is used for classifying and identifying the second voiceprint feature vector.
Optionally, the voiceprint feature mapping module is configured to map the first voiceprint feature vector to a second voiceprint feature vector by using a pre-trained linear regression model, and includes:
the voiceprint feature mapping module is specifically configured to use a mapping relationship of z ═ aTAnd x, mapping the first voiceprint feature vector to a second voiceprint feature vector, wherein A is the pre-trained linear regression model, x is the first voiceprint feature vector, and z is the second voiceprint feature vector.
Optionally, the voiceprint feature mapping module is further configured to:
obtaining training data from a voiceprint databaseWherein x isi,jFor a d-dimensional voiceprint feature vector extracted from each utterance in the voiceprint database, i 1iN is the number of speakers in the voiceprint database, and any speaker corresponds to MnA word; y isi,jIs an n-dimensional indicative vector y of the ith speakeri,j=[0,…,1,…,0]T(ii) a d is a preset value;
using A ═ (XX)T)-1XYTObtaining the linear regression model, wherein,formed as a voiceprint vector of training dataThe matrix of (a) is,formed as illustrative vectors of training dataOf the matrix of (a).
Optionally, the voiceprint classifier includes: and a cosine classifier.
Optionally, the voiceprint feature extraction front end includes:
a GMM/i-vector front end, a DNN/i-vector front end, or a d-vector front end.
The method and the system for voiceprint recognition based on linear regression acquire a first voiceprint feature vector from voice data, map the first voiceprint feature vector into a second voiceprint feature vector by using a pre-trained linear regression model, and perform classification recognition on the second voiceprint feature vector. The linear regression model is innovatively introduced into the field of voiceprint recognition, and experiments prove that the accuracy of voiceprint recognition can be effectively improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a schematic diagram of a voiceprint recognition system;
fig. 2 is a flowchart of a voiceprint recognition method based on linear regression disclosed in the embodiment of the present application.
Detailed Description
FIG. 1 is a schematic diagram of a voiceprint recognition system including a voiceprint feature extraction front end and a voiceprint recognition back end. The voiceprint recognition back end also comprises a voiceprint feature mapping module and a voiceprint classifier.
In order to improve the accuracy of voiceprint recognition, in the embodiment of the present application, the first part in the voiceprint recognition backend, i.e. the voiceprint feature mapping module, is improved. The core point of the method is that a trained Linear Regression (LR) model is used for mapping a voiceprint feature vector extracted from a voiceprint feature extraction front end into a new voiceprint feature vector, and the new voiceprint feature vector is used as a basis for voiceprint classification so as to improve accuracy of subsequent voiceprint classification.
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The workflow of the back-end of the voiceprint recognition system shown in figure 1 can be divided into three phases: a training phase, a registration phase and a testing phase. The training of the LR model is performed during a training phase, and both the registration phase and the testing phase require the use of a trained LR model.
The above three stages are explained in detail below. Fig. 2 is a voiceprint recognition method based on linear regression, which includes the following steps:
first, training phase
S201: training data is prepared.
Suppose that the voiceprint database contains speech data for n speakers, each speaker corresponding to MnIn words, the voiceprint feature extraction front-end extracts a d-dimensional voiceprint feature vector x from each sentencei,jWhere i 1, …, n, j 1i. d is a predetermined value. The value may be 200 to 800, and in this embodiment, is set to 400, depending on the task.
Each of the n speakers is assigned a number, the number of the first speaker is 1, the number of the ith speaker is i … …, and the number of the nth speaker is n. Thus, the numbering of all speakers is the sequence 1.And n is a number. Expanding each number into 0 and 1 coded representation vector, i.e. the representation vector of the ith speaker is n-dimensional vector yi,j=[0,...,1,...,0]TWhere 1 appears at the ith bit (e.g., the representational vector for the speaker numbered 2 is y2,j=[0,1,...,0]T)。
S202: the LR model was trained using the supervised training data obtained above.
Specifically, the LR model is obtained using equation (1):
A=(XXT)-1XYT (1)
whereinFormed as a voiceprint vector of training dataThe matrix of (a) is,formed as illustrative vectors of training dataOf the matrix of (a).
Second, registration stage
S203: acquiring voice data of registered personnel, and extracting registered data from the voice data of the registered personnelWherein enroll represents the registration phase.
The process of extracting the registration data may be the process of extracting the training data in the parameter S201, and is not described herein again.
S204: and mapping the registration data into a new voiceprint feature vector by using the LR model obtained by training, wherein the new voiceprint feature vector can be regarded as a voiceprint feature model of the registrant.
Specifically, the mapping is performed using equation (2):
z=ATx (2)
third, testing stage
S205: obtaining test voice data and extracting test data from the test voice dataWhere test denotes the test phase.
S206: and mapping the test data into a new voiceprint feature vector by using the LR model obtained by training.
S207: and comparing the new voiceprint feature vector obtained in the step S206 with the voiceprint feature models of the registrants, and identifying the registrants corresponding to the test voice data. And the registrant corresponding to the test voice data is the registrant who sends the test voice data.
As can be seen from the steps in fig. 2, the back end of the voiceprint recognition system (i.e., the voiceprint recognition back end) adopts a mechanism of first registration and then recognition, so that the user can register in the system first, and the system obtains the voiceprint feature model of the registrant by using the trained LR model. In the testing stage, the system can recognize which registrant the collected voice is sent out by, so that the voice data can be recognized.
In the research process, the applicant finds the voiceprint feature vector mapped by the LR model through experiments by using a large number of machine learning models, so that the subsequent classification identification has higher accuracy.
The voiceprint recognition back end using the flow shown in fig. 2 can be used in combination with a conventional voiceprint feature extraction front end to constitute the voiceprint recognition system shown in fig. 1. The following will exemplify the working flow of three voiceprint recognition systems in which the voiceprint recognition back end of the flow shown in fig. 2 is combined with different voiceprint feature extraction front ends.
The GMM/i-vector + LR + cosine voiceprint recognition system comprises the following components:
the system adopts GMM/i-vector as a voiceprint recognition front end, adopts LR shown in FIG. 2 as a voiceprint feature mapping module at a voiceprint recognition rear end, and adopts cosine similarity as a voiceprint classifier. The three stages are as follows:
1) a training stage:
step 1: the voiceprint recognition front end filters a mute section and a noise section of each section of voice frequency by using voice endpoint detection, and reserves a voice frequency segment only containing the voice of the training speaker.
Step 2: the voiceprint recognition front-end segments all audio in the training database into fixed length segments of 3 to 30 seconds in length, the present embodiment segments the audio into 15 second segments.
And step 3: the voiceprint recognition front end divides each audio segment into a plurality of frames according to the setting that the frame length is 15-30 milliseconds and the frame shift is 5-15 milliseconds, and extracts the acoustic features from each frame. The frame length of the present embodiment is set to 25 msec and the frame shift is set to 10 msec. The acoustic feature of the present embodiment is 20-dimensional MFCC feature (including 1-dimensional energy feature) + 13-dimensional RASTA-PLP feature + first-order difference feature + second-order difference feature, which is 99-dimensional.
And 4, step 4: the voiceprint recognition front end trains a Gaussian mixture model with U Gaussian components by adopting the existing GMM-UBM method to obtain a sigma model. This embodiment trains a gaussian mixture model containing 2048 gaussian components.
And 5: the voiceprint recognition front end adopts a GMM-UBM method, the Gaussian mixture model is applied to calculate the zero order statistic and the first order statistic of each audio fragment, and the zero order statistic and the first order statistic form a high-dimensional feature vector. The high-dimensional feature vector extracted in this embodiment is 204800 dimensions.
Step 6: and training an i-vector model by adopting the existing i-vector method at the voiceprint recognition front end to obtain a T matrix.
And 7: and the voiceprint recognition front end adopts an i-vector method, and the T matrix is applied to reduce the dimension of the high-dimensional feature vector output by the GMM-UBM to a low-dimensional space. The feature output space of this embodiment is 400, i.e. 204800-dimensional features of each audio piece are mapped to 400-dimensional features.
And 8: the voiceprint feature mapping module trains a linear regression model by adopting a formula (1) in the linear regression method to obtain an A matrix. The a matrix of this embodiment is a 400 × n matrix.
2) Registration phase
Step 1: the voiceprint recognition front end filters out a mute section and a noise section of each section of registered audio by using voice endpoint detection, and reserves an audio segment only containing the voice of the registered speaker.
Step 2: the voiceprint recognition front-end segments all audio in the registered speaker into fixed length segments of 3 to 30 seconds in length, with this embodiment segmenting the audio into 15 second segments.
And step 3: the voiceprint recognition front end divides each audio segment into a plurality of frames according to the setting that the frame length is 15-30 milliseconds and the frame shift is 5-15 milliseconds, and extracts the acoustic features from each frame. The frame length of the present embodiment is set to 25 msec and the frame shift is set to 10 msec. The acoustic feature of the present embodiment is 20-dimensional MFCC feature (including 1-dimensional energy feature) + 13-dimensional RASTA-PLP feature + first-order difference feature + second-order difference feature, which is 99-dimensional.
And 4, step 4: the voiceprint recognition front end adopts a GMM-UBM method, a Gaussian mixture model obtained in a training stage is applied to calculate the zero order statistic and the first order statistic of each audio clip, and the zero order statistic and the first order statistic form a high-dimensional feature vector. The high-dimensional feature vector extracted in this embodiment is 204800 dimensions.
And 5: and the voiceprint recognition front end adopts an i-vector method, and reduces the dimension of the high-dimensional feature vector output by the GMM-UBM to a low-dimensional space by applying a T matrix obtained in a training stage. The feature output space of this embodiment is 400, i.e. 204800-dimensional features of each audio piece are mapped to 400-dimensional features.
Step 6: the voiceprint feature mapping module further maps the i-vector feature into n-dimensional voiceprint features (n is the number of speakers in the training set) by applying the A matrix obtained in the training stage by adopting the formula (2) in the linear regression method provided by the invention
And 7: the voiceprint feature mapping module is used for obtaining voiceprint feature vectors of all audio segments of the registered speakerAveragingA voiceprint feature model of the registered speaker is obtained.
3) Testing phase
Step 1: the voiceprint recognition front end filters out a mute section and a noise section of each section of test audio by using voice endpoint detection, and reserves an audio segment only containing the voice of the test speaker.
Step 2: the voiceprint recognition front-end segments all audio in the test speaker into fixed length segments of 3 to 30 seconds in length, with this embodiment segmenting the audio into 15 second segments.
And step 3: the voiceprint recognition front end divides each audio segment into a plurality of frames according to the setting that the frame length is 15-30 milliseconds and the frame shift is 5-15 milliseconds, and extracts the acoustic features from each frame. The frame length of the present embodiment is set to 25 msec and the frame shift is set to 10 msec. The acoustic feature of the present embodiment is 20-dimensional MFCC feature (including 1-dimensional energy feature) + 13-dimensional RASTA-PLP feature + first-order difference feature + second-order difference feature, which is 99-dimensional.
And 4, step 4: the voiceprint recognition front end adopts a GMM-UBM method, a Gaussian mixture model obtained in a training stage is applied to calculate the zero order statistic and the first order statistic of each audio clip, and the zero order statistic and the first order statistic form a high-dimensional feature vector. The high-dimensional feature vector extracted in this embodiment is 204800 dimensions.
And 5: and the voiceprint recognition front end adopts an i-vector method, and reduces the dimension of the high-dimensional feature vector output by the GMM-UBM to a low-dimensional space by applying a T matrix obtained in a training stage. The feature output space of this embodiment is 400, i.e. 204800-dimensional features of each audio piece are mapped to 400-dimensional features.
Step 6: voiceprint feature mapping module samplingUsing formula (2) and applying the A matrix obtained in the training stage to further map the i-vector characteristics into n-dimensional voiceprint characteristics (n is the number of speakers in the training set)
And 7: the voiceprint feature mapping module obtains the voiceprint feature vectors of all the audio frequency segments of any test speakerAveragingAnd obtaining a voiceprint characteristic model of the test speaker.
And 8: the voiceprint classifier adopts a cosine similarity classifier to calculateAndsimilarity of (c):
and comparing with a decision threshold delta to decideWhether or not to cooperate withAre the same speaker.
(II) DNN/i-vector + LR + cosine voiceprint recognition system:
the system adopts DNN/i-vector as a voiceprint recognition front end, adopts LR shown in FIG. 2 as a voiceprint feature mapping module of a voiceprint recognition rear end, and adopts cosine similarity as a voiceprint classifier. The three stages are as follows:
1) a training stage:
step 1: the voiceprint recognition front end filters a mute section and a noise section of each section of voice frequency by using voice endpoint detection, and reserves a voice frequency segment only containing the voice of the training speaker.
Step 2: the voiceprint recognition front-end segments all audio in the training database into fixed length segments of 3 to 30 seconds in length, the present embodiment segments the audio into 15 second segments.
And step 3: the voiceprint recognition front end divides each audio segment into a plurality of frames according to the setting that the frame length is 15-30 milliseconds and the frame shift is 5-15 milliseconds, and extracts the acoustic features from each frame. The frame length of the present embodiment is set to 25 msec and the frame shift is set to 10 msec. The acoustic feature of the present embodiment is 20-dimensional MFCC feature (including 1-dimensional energy feature) + 13-dimensional RASTA-PLP feature + first-order difference feature + second-order difference feature, which is 99-dimensional.
And 4, step 4: the voiceprint recognition front end adopts a DNN-UBM method and trains a database containing U by using an independent voice recognition database containing voice content labeling informationDNNA deep neural network acoustic model Λ of each output state. The acoustic model used in this embodiment outputs 8073 states.
And 5: the voiceprint recognition front end adopts a DNN-UBM method, uses an acoustic model Lambda to recognize audio segments in a training database, and extracts U of each frame of dataDNNThe dimension posterior probability vector. The posterior probability vector of each frame of data obtained in this embodiment is 8073 dimensions.
Step 6: the voiceprint recognition front end adopts a DNN-UBM method, discards output states with lower posterior probability and only retainsAnd (4) an output state with a high posterior probability. Accordingly, the posterior probability vector of each frame of data is also adjusted toAnd (5) maintaining. Of the present embodiment3096 was set.
And 7: the front end of the voiceprint recognition adopts a DNN-UBM method, and the training comprisesA Gaussian mixture model of Gaussian components to obtain sigmaDNNAnd (4) modeling. This embodiment trains a gaussian mixture model containing 3096 gaussian components.
And 8: the voiceprint recognition front end adopts a GMM-UBM method and applies the Gaussian mixture model sigmaDNNThe zero order statistic and the first order statistic of each audio piece are calculated, and the zero order statistic and the first order statistic form a high-dimensional feature vector. The high-dimensional feature vector extracted in this embodiment is 309600 dimensions.
And step 9: training an i-vector model by an i-vector method at the front end of the voiceprint recognition to obtain TDNNAnd (4) matrix.
Step 10: the voiceprint recognition front end adopts an i-vector method and uses the TDNNThe matrix reduces the dimension of the high-dimensional feature vector output by the DNN-UBM to a low-dimensional space. The feature output space of this embodiment is 400, i.e. 309600 dimensional features of each audio clip are mapped to 400 dimensional features.
Step 11: the voiceprint feature mapping module trains a linear regression model by adopting a formula (1) to obtain ADNNAnd (4) matrix. A of the present exampleDNNThe matrix is a 400 × n matrix.
2) Registration phase
Step 1: the voiceprint recognition front end filters out a mute section and a noise section of each section of registered audio by using voice endpoint detection, and reserves an audio segment only containing the voice of the registered speaker.
Step 2: the voiceprint recognition front-end segments all audio in the registered speaker into fixed length segments of 3 to 30 seconds in length, with this embodiment segmenting the audio into 15 second segments.
And step 3: the voiceprint recognition front end divides each audio segment into a plurality of frames according to the setting that the frame length is 15-30 milliseconds and the frame shift is 5-15 milliseconds, and extracts the acoustic features from each frame. The frame length of the present embodiment is set to 25 msec and the frame shift is set to 10 msec. The acoustic feature of the present embodiment is 20-dimensional MFCC feature (including 1-dimensional energy feature) + 13-dimensional RASTA-PLP feature + first-order difference feature + second-order difference feature, which is 99-dimensional.
And 4, step 4: the voiceprint recognition front end adopts a DNN-UBM method, uses an acoustic model Lambda to recognize audio segments in the registered speaker, and extracts U of each frame of dataDNNThe dimension posterior probability vector. The posterior probability vector of each frame of data obtained in this embodiment is 8073 dimensions.
And 5: the voiceprint recognition front end adopts a DNN-UBM method, discards output states with lower posterior probability and only retainsThe output state with higher a posteriori probability (which state is specifically retained by the training phase). Accordingly, the posterior probability vector of each frame of data is also adjusted toAnd (5) maintaining. Of the present embodiment3096 was set.
Step 6: the voiceprint recognition front end adopts a GMM-UBM method and a Gaussian mixture model sigmaDNNThe zero order statistic and the first order statistic of each audio piece are calculated, and the zero order statistic and the first order statistic form a high-dimensional feature vector. The high-dimensional feature vector extracted in this embodiment is 309600 dimensions.
And 7: the voiceprint recognition front end adopts an i-vector method and uses TDNNThe matrix reduces the dimension of the high-dimensional feature vector output by the DNN-UBM to a low-dimensional space. The feature output space of this embodiment is 400, i.e. 309600 dimensional features of each audio clip are mapped to 400 dimensional features.
And 8: the voiceprint feature mapping module adopts a formula (2) and applies A obtained in the training stageDNNThe matrix further maps the i-vector features to n-dimensional voiceprint features (n is the number of speakers in the training set)
And step 9: the voiceprint feature mapping module obtains the voiceprint feature vector from all the audio segments of any registered speakerAveragingA voiceprint feature model of the registered speaker is obtained.
3) Testing phase
Step 1: the voiceprint recognition front end filters out a mute section and a noise section of each section of test audio by using voice endpoint detection, and reserves an audio segment only containing the voice of the test speaker.
Step 2: the voiceprint recognition front-end segments all audio in the test speaker into fixed length segments of 3 to 30 seconds in length, with this embodiment segmenting the audio into 15 second segments.
And step 3: the voiceprint recognition front end divides each audio segment into a plurality of frames according to the setting that the frame length is 15-30 milliseconds and the frame shift is 5-15 milliseconds, and extracts the acoustic features from each frame. The frame length of the present embodiment is set to 25 msec and the frame shift is set to 10 msec. The acoustic feature of the present embodiment is 20-dimensional MFCC feature (including 1-dimensional energy feature) + 13-dimensional RASTA-PLP feature + first-order difference feature + second-order difference feature, which is 99-dimensional.
And 4, step 4: the voiceprint recognition front end adopts a DNN-UBM method, uses an acoustic model Lambda to recognize the audio frequency segment in the tested speaker, and extracts the U of each frame of dataDNNThe dimension posterior probability vector. The posterior probability vector of each frame of data obtained in this embodiment is 8073 dimensions.
And 5: the voiceprint recognition front end adopts a DNN-UBM method, discards output states with lower posterior probability and only retainsOutput state with high posterior probability (specific guarantee is determined by training phase)Which states to leave). Accordingly, the posterior probability vector of each frame of data is also adjusted toAnd (5) maintaining. Of the present embodiment3096 was set.
Step 6: the voiceprint recognition front end adopts a GMM-UBM method and a Gaussian mixture model sigmaDNNThe zero order statistic and the first order statistic of each audio piece are calculated, and the zero order statistic and the first order statistic form a high-dimensional feature vector. The high-dimensional feature vector extracted in this embodiment is 309600 dimensions.
And 7: the voiceprint recognition front end adopts an i-vector method and uses TDNNThe matrix reduces the dimension of the high-dimensional feature vector output by the DNN-UBM to a low-dimensional space. The feature output space of this embodiment is 400, i.e. 309600 dimensional features of each audio clip are mapped to 400 dimensional features.
And 8: the voiceprint feature mapping module adopts a formula (2) and applies A obtained in the training stageDNNThe matrix further maps the i-vector features to n-dimensional voiceprint features (n is the number of speakers in the training set)
And step 9: the voice print characteristic vector obtained from all the audio frequency segments of any one test speakerAveragingAnd obtaining a voiceprint characteristic model of the test speaker.
and comparing with a decision threshold delta to decideWhether or not to cooperate withAre the same speaker.
(III) d-vector + LR + cosine voiceprint recognition system:
the system adopts d-vector as the voiceprint recognition front end, the LR of the invention as the voiceprint feature mapping module of the voiceprint recognition rear end and cosine similarity as the voiceprint classifier. The three stages are as follows:
1) a training stage:
step 1: the voiceprint recognition front end filters a mute section and a noise section of each section of voice frequency by using voice endpoint detection, and reserves a voice frequency segment only containing the voice of the training speaker.
Step 2: the voiceprint recognition front-end segments all audio in the training database into fixed length segments of 3 to 30 seconds in length, the present embodiment segments the audio into 15 second segments.
And step 3: the voiceprint recognition front end divides each audio segment into a plurality of frames according to the setting that the frame length is 15-30 milliseconds and the frame shift is 5-15 milliseconds, and extracts the acoustic features from each frame. The frame length of the present embodiment is set to 25 msec and the frame shift is set to 10 msec. The acoustic feature of the present embodiment is 20-dimensional MFCC feature (including 1-dimensional energy feature) + 13-dimensional RASTA-PLP feature + first-order difference feature + second-order difference feature, which is 99-dimensional.
And 4, step 4: the voiceprint recognition front end trains a deep neural network containing n output neurons by adopting the existing d-vector method to obtain sigmad-vectorModels where n is in the training datasetThe number of speakers. Suppose thatd-vectorThe highest hidden layer of the model contains Ud-vectorAnd (4) hiding the neurons. U of the embodimentd-vectorSet to 400.
And 5: the voiceprint recognition front end adopts a d-vector method and uses sigmad-vectorThe model predicts the voice of each frame and compares the sigma with the threshold valued-vectorThe output of the highest hidden layer of the model is taken as the feature of each frame of voice, and the features of all frames of each audio clip are averaged to obtain the U of each audio clipd-vectorA dimensional feature vector. U of the embodimentd-vectorSet to 400.
Step 6: the voiceprint feature mapping module trains a linear regression model by adopting a formula (1) in the linear regression method to obtain an A matrix. The a matrix of this embodiment is a 400 × n matrix.
2) Registration phase
Step 1: the voiceprint recognition front end filters out a mute section and a noise section of each section of registered audio by using voice endpoint detection, and reserves an audio segment only containing the voice of the registered speaker.
Step 2: the voiceprint recognition front-end segments all audio in the registered speaker into fixed length segments of 3 to 30 seconds in length, with this embodiment segmenting the audio into 15 second segments.
And step 3: the voiceprint recognition front end divides each audio segment into a plurality of frames according to the setting that the frame length is 15-30 milliseconds and the frame shift is 5-15 milliseconds, and extracts the acoustic features from each frame. The frame length of the present embodiment is set to 25 msec and the frame shift is set to 10 msec. The acoustic feature of the present embodiment is 20-dimensional MFCC feature (including 1-dimensional energy feature) + 13-dimensional RASTA-PLP feature + first-order difference feature + second-order difference feature, which is 99-dimensional.
And 4, step 4: the voiceprint recognition front end adopts a d-vector method and uses sigmad-vectorThe model predicts the voice of each frame and compares the sigma with the threshold valued-vectorThe output of the highest hidden layer of the model is taken as the feature of each frame of voice, and the features of all frames of each audio clip are averaged to obtain the U of each audio clipd-vectorA dimensional feature vector. Of the present embodimentUd-vectorSet to 400.
Step 6: the voiceprint feature mapping module further maps the i-vector features into n-dimensional voiceprint features (n is the number of speakers in the training set) by applying an A matrix obtained in a training stage by adopting a formula (2)
And 7: the voiceprint feature mapping module obtains the voiceprint feature vector from all the audio segments of any registered speakerAveragingA voiceprint feature model of the registered speaker is obtained.
3) Testing phase
Step 1: the voiceprint recognition front end filters out a mute section and a noise section of each section of test audio by using voice endpoint detection, and reserves an audio segment only containing the voice of the test speaker.
Step 2: the voiceprint recognition front-end segments all audio in the test speaker into fixed length segments of 3 to 30 seconds in length, with this embodiment segmenting the audio into 15 second segments.
And step 3: the voiceprint recognition front end divides each audio segment into a plurality of frames according to the setting that the frame length is 15-30 milliseconds and the frame shift is 5-15 milliseconds, and extracts the acoustic features from each frame. The frame length of the present embodiment is set to 25 msec and the frame shift is set to 10 msec. The acoustic feature of the present embodiment is 20-dimensional MFCC feature (including 1-dimensional energy feature) + 13-dimensional RASTA-PLP feature + first-order difference feature + second-order difference feature, which is 99-dimensional.
And 4, step 4: the voiceprint recognition front end adopts a d-vector method and uses sigmad-vectorThe model predicts the voice of each frame and compares the sigma with the threshold valued-vectorThe output of the highest hidden layer of the model is taken as the feature of each frame of voice, and the features of all frames of each audio clip are averaged to obtain the U of each audio clipd-vectorA dimensional feature vector. U of the embodimentd-vectorSet to 400.
And 5: the voiceprint feature mapping module further maps the i-vector features into n-dimensional voiceprint features (n is the number of speakers in the training set) by applying an A matrix obtained in a training stage by adopting a formula (2)
Step 6: the voiceprint feature mapping module obtains the voiceprint feature vectors of all the audio frequency segments of any test speakerAveragingAnd obtaining a voiceprint characteristic model of the test speaker.
And 7: the voiceprint classifier adopts a cosine similarity classifier to calculateAndsimilarity of (c):
and comparing with a decision threshold delta to decideWhether or not to cooperate withAre the same speaker.
Experimental validation was performed on the NIST SRE 2006 and NIST SRE 2008 data sets for the three examples above. The 8 conversion's bitmap data in NIST SRE 2006 dataset is used as training set, 402 speakers in total, effective voice is about 100 hours. The bitmap data from 8 sessions in NIST SRE 2008 data set was used as the enrollment and testing sets for a total of 395 speakers. The Test speaker's voice length is fixed at 30 seconds (cut into 2 segments, 15 seconds each) and the Enrollment speaker's voice length is 150 seconds (cut into 10 segments, 15 seconds each). Approximately 15 million test samples were constructed for any enrolled speaker and test speaker. The DNN acoustic model in the second example was trained using the Switchboard-1 database, with a precise annotation of speech for about 300 hours.
Using the above test samples, the recognition error rates of the voiceprint recognition back end of LR + cosine used in the above three examples and other voiceprint recognition back ends were compared, and the comparison results are shown in table 1:
TABLE 1
As can be seen from table 1, the LR + cosine has a lower recognition error rate than the conventional cosine, WCCN + cosine, LDA + cosine and LDA + PLDA classifiers with the same front end.
In the above three examples, GMM/i-vector + LR + cosine achieves the optimal performance in all the methods participating in comparison, and is relatively improved by 27.19% compared with the optimal voiceprint recognition system GMM/i-vector + LDA + PLDA participating in comparison. The relative improvement of the DNN/i-vector + LR + cosine is 23.39% compared with that of the DNN/i-vector + LDA + cosine of the optimal voiceprint recognition system adopting the same voiceprint recognition front end DNN/i-vector. The d-vector + LR + cosine is relatively improved by 7.31 percent compared with the optimal voiceprint recognition system adopting the same voiceprint recognition front end d-vector.
It should be noted that the above embodiments are only specific examples of the patent disclosure, and all the algorithms that use a linear regression algorithm for obtaining a voiceprint feature vector in a voiceprint recognition system are within the scope of the patent protection.
The functions described in the method of the embodiment of the present application, if implemented in the form of software functional units and sold or used as independent products, may be stored in a storage medium readable by a computing device. Based on such understanding, part of the contribution to the prior art of the embodiments of the present application or part of the technical solution may be embodied in the form of a software product stored in a storage medium and including several instructions for causing a computing device (which may be a personal computer, a server, a mobile computing device or a network device) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (8)
1. A voiceprint recognition method based on linear regression is characterized by comprising the following steps:
acquiring a first voiceprint feature vector from voice data;
mapping the first voiceprint feature vector into a second voiceprint feature vector by using a pre-trained linear regression model;
performing classification identification on the second acoustic line feature vector;
wherein the training process of the linear regression model comprises the following steps:
obtaining training data from a voiceprint databaseWherein x isi,jFor a d-dimensional voiceprint feature vector extracted from each utterance in the voiceprint database, i 1iN is the number of speakers in the voiceprint database, and any speaker corresponds to MiA word; y isi,jIs an n-dimensional indicative vector y of the ith speakeri,j=[0,...,1,...,0]T(ii) a d is a preset value;
2. The method of claim 1, wherein mapping the first voiceprint feature vector to a second voiceprint feature vector comprises:
using the mapping z ═ ATAnd x, mapping the first voiceprint feature vector to a second voiceprint feature vector, wherein A is the pre-trained linear regression model, x is the first voiceprint feature vector, and z is the second voiceprint feature vector.
3. The method of claim 1, wherein the performing classification identification on the second acoustic line feature vector comprises:
and using a cosine classifier to classify and identify the second acoustic line feature vector.
4. The method of claim 1, wherein obtaining the first voiceprint feature vector from the speech data comprises:
the first voiceprint feature vector is obtained from the speech data using a GMM/i-vector algorithm, a DNN/i-vector algorithm, or a d-vector algorithm.
5. A system for voiceprint recognition based on linear regression, comprising:
the voice print feature extraction front end is used for acquiring a first voice print feature vector from voice data;
a voiceprint recognition back end, the voiceprint recognition back end comprising a voiceprint feature mapping module and a voiceprint classifier, the voiceprint feature mapping module being configured to map the first voiceprint feature vector to a second voiceprint feature vector using a pre-trained linear regression model; the voiceprint classifier is used for classifying and identifying the second voiceprint feature vector;
wherein the voiceprint feature mapping module is further configured to:
obtaining training data from a voiceprint databaseWherein x isi,jFor a d-dimensional voiceprint feature vector extracted from each utterance in the voiceprint database, i 1iN is the number of speakers in the voiceprint database, and any speaker corresponds to MiA word; y isi,jIs an n-dimensional indicative vector y of the ith speakeri,j=[0,...,1,...,0]T(ii) a d is a preset value;
6. The system of claim 5, wherein the voiceprint feature mapping module is configured to map the first voiceprint feature vector to a second voiceprint feature vector using a pre-trained linear regression model comprising:
the voiceprint feature mapping module is specifically configured to use a mapping relationship of z ═ aTAnd x, mapping the first voiceprint feature vector to a second voiceprint feature vector, wherein A is the pre-trained linear regression model, x is the first voiceprint feature vector, and z is the second voiceprint feature vector.
7. The system of claim 5, wherein the voiceprint classifier comprises: and a cosine classifier.
8. The system of claim 5, wherein the voiceprint feature extraction front end comprises:
a GMM/i-vector front end, a DNN/i-vector front end, or a d-vector front end.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810141059.0A CN108091326B (en) | 2018-02-11 | 2018-02-11 | Voiceprint recognition method and system based on linear regression |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810141059.0A CN108091326B (en) | 2018-02-11 | 2018-02-11 | Voiceprint recognition method and system based on linear regression |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108091326A CN108091326A (en) | 2018-05-29 |
CN108091326B true CN108091326B (en) | 2021-08-06 |
Family
ID=62194472
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810141059.0A Active CN108091326B (en) | 2018-02-11 | 2018-02-11 | Voiceprint recognition method and system based on linear regression |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108091326B (en) |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109065028B (en) * | 2018-06-11 | 2022-12-30 | 平安科技(深圳)有限公司 | Speaker clustering method, speaker clustering device, computer equipment and storage medium |
CN109119069B (en) * | 2018-07-23 | 2020-08-14 | 深圳大学 | Specific crowd identification method, electronic device and computer readable storage medium |
CN109367350B (en) * | 2018-10-11 | 2020-08-11 | 山东科技大学 | Automatic starting method and system for vehicle air conditioner |
CN111462760B (en) * | 2019-01-21 | 2023-09-26 | 阿里巴巴集团控股有限公司 | Voiceprint recognition system, voiceprint recognition method, voiceprint recognition device and electronic equipment |
CN110517698B (en) * | 2019-09-05 | 2022-02-01 | 科大讯飞股份有限公司 | Method, device and equipment for determining voiceprint model and storage medium |
CN110610709A (en) * | 2019-09-26 | 2019-12-24 | 浙江百应科技有限公司 | Identity distinguishing method based on voiceprint recognition |
CN110853654B (en) * | 2019-11-17 | 2021-12-21 | 西北工业大学 | Model generation method, voiceprint recognition method and corresponding device |
CN111933147B (en) * | 2020-06-22 | 2023-02-14 | 厦门快商通科技股份有限公司 | Voiceprint recognition method, system, mobile terminal and storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1366295A (en) * | 2000-07-05 | 2002-08-28 | 松下电器产业株式会社 | Speaker's inspection and speaker's identification system and method based on prior knowledge |
CN103971690A (en) * | 2013-01-28 | 2014-08-06 | 腾讯科技(深圳)有限公司 | Voiceprint recognition method and device |
CN106601258A (en) * | 2016-12-12 | 2017-04-26 | 广东顺德中山大学卡内基梅隆大学国际联合研究院 | Speaker identification method capable of information channel compensation based on improved LSDA algorithm |
CN107517207A (en) * | 2017-03-13 | 2017-12-26 | 平安科技(深圳)有限公司 | Server, auth method and computer-readable recording medium |
CN107623614A (en) * | 2017-09-19 | 2018-01-23 | 百度在线网络技术(北京)有限公司 | Method and apparatus for pushed information |
CN107633845A (en) * | 2017-09-11 | 2018-01-26 | 清华大学 | A kind of duscriminant local message distance keeps the method for identifying speaker of mapping |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR100571574B1 (en) * | 2004-07-26 | 2006-04-17 | 한양대학교 산학협력단 | Similar Speaker Recognition Method Using Nonlinear Analysis and Its System |
-
2018
- 2018-02-11 CN CN201810141059.0A patent/CN108091326B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1366295A (en) * | 2000-07-05 | 2002-08-28 | 松下电器产业株式会社 | Speaker's inspection and speaker's identification system and method based on prior knowledge |
CN103971690A (en) * | 2013-01-28 | 2014-08-06 | 腾讯科技(深圳)有限公司 | Voiceprint recognition method and device |
CN106601258A (en) * | 2016-12-12 | 2017-04-26 | 广东顺德中山大学卡内基梅隆大学国际联合研究院 | Speaker identification method capable of information channel compensation based on improved LSDA algorithm |
CN107517207A (en) * | 2017-03-13 | 2017-12-26 | 平安科技(深圳)有限公司 | Server, auth method and computer-readable recording medium |
CN107633845A (en) * | 2017-09-11 | 2018-01-26 | 清华大学 | A kind of duscriminant local message distance keeps the method for identifying speaker of mapping |
CN107623614A (en) * | 2017-09-19 | 2018-01-23 | 百度在线网络技术(北京)有限公司 | Method and apparatus for pushed information |
Also Published As
Publication number | Publication date |
---|---|
CN108091326A (en) | 2018-05-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108091326B (en) | Voiceprint recognition method and system based on linear regression | |
US11636860B2 (en) | Word-level blind diarization of recorded calls with arbitrary number of speakers | |
US10109280B2 (en) | Blind diarization of recorded calls with arbitrary number of speakers | |
Villalba et al. | State-of-the-art speaker recognition with neural network embeddings in NIST SRE18 and speakers in the wild evaluations | |
JP5853029B2 (en) | Passphrase modeling device and method for speaker verification, and speaker verification system | |
Soltane et al. | Face and speech based multi-modal biometric authentication | |
US7475013B2 (en) | Speaker recognition using local models | |
CN111524527A (en) | Speaker separation method, device, electronic equipment and storage medium | |
US11837236B2 (en) | Speaker recognition based on signal segments weighted by quality | |
US20120232900A1 (en) | Speaker recognition from telephone calls | |
CN111091809B (en) | Regional accent recognition method and device based on depth feature fusion | |
Haris et al. | Robust speaker verification with joint sparse coding over learned dictionaries | |
Prasetio et al. | Generalized Discriminant Methods for Improved X-Vector Back-end Based Stress Speech Recognition | |
US20220405363A1 (en) | Methods for improving the performance of neural networks used for biometric authenticatio | |
Chandrakala et al. | Combination of generative models and SVM based classifier for speech emotion recognition | |
Silovsky et al. | Speech, speaker and speaker's gender identification in automatically processed broadcast stream | |
Dm et al. | Speech based emotion recognition using combination of features 2-D HMM model | |
Valanchery | Analysis of different classifier for the detection of double compressed AMR audio | |
Errity et al. | A comparative study of linear and nonlinear dimensionality reduction for speaker identification | |
Trabelsi et al. | Learning vector quantization for adapted gaussian mixture models in automatic speaker identification | |
Feng et al. | Duration Normalization Algorithm Based on Feature Space Trajectory in Pathological Speech Recognition | |
Tashan et al. | Two stage speaker verification using self organising map and multilayer perceptron neural network | |
Ye et al. | Discriminant kernel learning for acoustic scene classification with multiple observations | |
Heryanto et al. | A new direct access framework for speaker identification system | |
Madhusudhana Rao et al. | Machine hearing system for teleconference authentication with effective speech analysis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |