CN108091326B - Voiceprint recognition method and system based on linear regression - Google Patents

Voiceprint recognition method and system based on linear regression Download PDF

Info

Publication number
CN108091326B
CN108091326B CN201810141059.0A CN201810141059A CN108091326B CN 108091326 B CN108091326 B CN 108091326B CN 201810141059 A CN201810141059 A CN 201810141059A CN 108091326 B CN108091326 B CN 108091326B
Authority
CN
China
Prior art keywords
voiceprint
vector
feature vector
voiceprint feature
linear regression
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810141059.0A
Other languages
Chinese (zh)
Other versions
CN108091326A (en
Inventor
张晓雷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN201810141059.0A priority Critical patent/CN108091326B/en
Publication of CN108091326A publication Critical patent/CN108091326A/en
Application granted granted Critical
Publication of CN108091326B publication Critical patent/CN108091326B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a voiceprint recognition method and system based on linear regression, wherein a first voiceprint feature vector is obtained from voice data, a pre-trained linear regression model is used for mapping the first voiceprint feature vector to a second voiceprint feature vector, and the second voiceprint feature vector is subjected to classification recognition. The linear regression model is innovatively introduced into the field of voiceprint recognition, and experiments prove that the accuracy of voiceprint recognition can be effectively improved.

Description

Voiceprint recognition method and system based on linear regression
Technical Field
The application relates to the field of electronic information, in particular to a voiceprint recognition method and system based on linear regression.
Background
Voiceprint recognition systems typically include two parts, a voiceprint feature extraction front-end and a voiceprint recognition back-end.
The voiceprint feature extraction front end is used for extracting the voiceprint features of the speaker from the speaker sentences: that is, a sentence of speech with an arbitrary length is mapped into a vector with a fixed length by a model. Common algorithms used by the voiceprint feature extraction front-end include: the speaker recognition method comprises a Gaussian mixture model-based general background model (GMM-UBM)/identity vector (i-vector) algorithm (GMM/i-vector algorithm for short), a deep learning-based general background model/i-vector algorithm for voice recognition acoustic model (DNN/i-vector algorithm for short), and a d-vector algorithm for classifying speakers by using a deep learning model and outputting the top hidden layer as a voiceprint vector of the speakers.
The voiceprint recognition back end classifies the voiceprint vector of the speaker through a supervised machine learning algorithm. The method can be divided into two parts, wherein the first part is to map the voiceprint feature vector into another new voiceprint feature vector by a supervised machine learning method, and the second part is to classify the new voiceprint feature vector after dimension reduction by the supervised machine learning method. For the first part, common mapping methods include Linear Discriminant Analysis (LDA), intra-class variance normalization (WCCN), and disturbance attribute projection (NAP), among others. For the second part, common classifiers include cosine distance classifier, Support Vector Machine (SVM) classifier, Probabilistic Linear Discriminant Analysis (PLDA) classifier, and the like. The LDA + PLDA method in the back-end algorithm achieves the optimal performance in a plurality of standardized tests and is widely adopted by practical systems at present.
The voiceprint feature extraction front end and the voiceprint recognition rear end can be combined at will to form a voiceprint recognition system. However, the accuracy of the current voiceprint recognition still needs to be improved.
Disclosure of Invention
The application provides a voiceprint recognition method and system based on linear regression, and aims to solve the problem of how to improve the accuracy of voiceprint recognition.
In order to achieve the above object, the present application provides the following technical solutions:
a voiceprint recognition method based on linear regression comprises the following steps:
acquiring a first voiceprint feature vector from voice data;
mapping the first voiceprint feature vector into a second voiceprint feature vector by using a pre-trained linear regression model;
and carrying out classification identification on the second acoustic line feature vector.
Optionally, the mapping the first voiceprint feature vector to the second voiceprint feature vector includes:
using the mapping z ═ ATAnd x, mapping the first voiceprint feature vector to a second voiceprint feature vector, wherein A is the pre-trained linear regression model, x is the first voiceprint feature vector, and z is the second voiceprint feature vector.
Optionally, the training process of the linear regression model includes:
obtaining training data from a voiceprint database
Figure BDA0001577612330000021
Wherein x isi,jFor a d-dimensional voiceprint feature vector extracted from each sentence in the voiceprint database, i is 1, …, n, j is 1, …, MiN is the number of speakers in the voiceprint database, and any speaker corresponds to MnA word; y isi,jIs an n-dimensional indicative vector y of the ith speakeri,j=[0,...,1,...,0]T(ii) a d is a preset value;
using A ═ (XX)T)-1XYTObtaining the linear regression model, wherein,
Figure BDA0001577612330000022
formed as a voiceprint vector of training data
Figure BDA0001577612330000023
The matrix of (a) is,
Figure BDA0001577612330000024
formed as illustrative vectors of training data
Figure BDA0001577612330000025
Of the matrix of (a).
Optionally, the classifying and identifying the second acoustic line feature vector includes:
and using a cosine classifier to classify and identify the second acoustic line feature vector.
Optionally, the obtaining the first voiceprint feature vector from the voice data includes:
the first voiceprint feature vector is obtained from the speech data using a GMM/i-vector algorithm, a DNN/i-vector algorithm, or a d-vector algorithm.
A system for voiceprint recognition based on linear regression, comprising:
the voice print feature extraction front end is used for acquiring a first voice print feature vector from voice data;
a voiceprint recognition back end, the voiceprint recognition back end comprising a voiceprint feature mapping module and a voiceprint classifier, the voiceprint feature mapping module being configured to map the first voiceprint feature vector to a second voiceprint feature vector using a pre-trained linear regression model; and the voiceprint classifier is used for classifying and identifying the second voiceprint feature vector.
Optionally, the voiceprint feature mapping module is configured to map the first voiceprint feature vector to a second voiceprint feature vector by using a pre-trained linear regression model, and includes:
the voiceprint feature mapping module is specifically configured to use a mapping relationship of z ═ aTAnd x, mapping the first voiceprint feature vector to a second voiceprint feature vector, wherein A is the pre-trained linear regression model, x is the first voiceprint feature vector, and z is the second voiceprint feature vector.
Optionally, the voiceprint feature mapping module is further configured to:
obtaining training data from a voiceprint database
Figure BDA0001577612330000031
Wherein x isi,jFor a d-dimensional voiceprint feature vector extracted from each utterance in the voiceprint database, i 1iN is the number of speakers in the voiceprint database, and any speaker corresponds to MnA word; y isi,jIs an n-dimensional indicative vector y of the ith speakeri,j=[0,…,1,…,0]T(ii) a d is a preset value;
using A ═ (XX)T)-1XYTObtaining the linear regression model, wherein,
Figure BDA0001577612330000032
formed as a voiceprint vector of training data
Figure BDA0001577612330000033
The matrix of (a) is,
Figure BDA0001577612330000034
formed as illustrative vectors of training data
Figure BDA0001577612330000035
Of the matrix of (a).
Optionally, the voiceprint classifier includes: and a cosine classifier.
Optionally, the voiceprint feature extraction front end includes:
a GMM/i-vector front end, a DNN/i-vector front end, or a d-vector front end.
The method and the system for voiceprint recognition based on linear regression acquire a first voiceprint feature vector from voice data, map the first voiceprint feature vector into a second voiceprint feature vector by using a pre-trained linear regression model, and perform classification recognition on the second voiceprint feature vector. The linear regression model is innovatively introduced into the field of voiceprint recognition, and experiments prove that the accuracy of voiceprint recognition can be effectively improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a schematic diagram of a voiceprint recognition system;
fig. 2 is a flowchart of a voiceprint recognition method based on linear regression disclosed in the embodiment of the present application.
Detailed Description
FIG. 1 is a schematic diagram of a voiceprint recognition system including a voiceprint feature extraction front end and a voiceprint recognition back end. The voiceprint recognition back end also comprises a voiceprint feature mapping module and a voiceprint classifier.
In order to improve the accuracy of voiceprint recognition, in the embodiment of the present application, the first part in the voiceprint recognition backend, i.e. the voiceprint feature mapping module, is improved. The core point of the method is that a trained Linear Regression (LR) model is used for mapping a voiceprint feature vector extracted from a voiceprint feature extraction front end into a new voiceprint feature vector, and the new voiceprint feature vector is used as a basis for voiceprint classification so as to improve accuracy of subsequent voiceprint classification.
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The workflow of the back-end of the voiceprint recognition system shown in figure 1 can be divided into three phases: a training phase, a registration phase and a testing phase. The training of the LR model is performed during a training phase, and both the registration phase and the testing phase require the use of a trained LR model.
The above three stages are explained in detail below. Fig. 2 is a voiceprint recognition method based on linear regression, which includes the following steps:
first, training phase
S201: training data is prepared.
Suppose that the voiceprint database contains speech data for n speakers, each speaker corresponding to MnIn words, the voiceprint feature extraction front-end extracts a d-dimensional voiceprint feature vector x from each sentencei,jWhere i 1, …, n, j 1i. d is a predetermined value. The value may be 200 to 800, and in this embodiment, is set to 400, depending on the task.
Each of the n speakers is assigned a number, the number of the first speaker is 1, the number of the ith speaker is i … …, and the number of the nth speaker is n. Thus, the numbering of all speakers is the sequence 1.And n is a number. Expanding each number into 0 and 1 coded representation vector, i.e. the representation vector of the ith speaker is n-dimensional vector yi,j=[0,...,1,...,0]TWhere 1 appears at the ith bit (e.g., the representational vector for the speaker numbered 2 is y2,j=[0,1,...,0]T)。
In this embodiment, the supervised training data is
Figure BDA0001577612330000051
Where train represents the training phase.
S202: the LR model was trained using the supervised training data obtained above.
Specifically, the LR model is obtained using equation (1):
A=(XXT)-1XYT (1)
wherein
Figure BDA0001577612330000052
Formed as a voiceprint vector of training data
Figure BDA0001577612330000053
The matrix of (a) is,
Figure BDA0001577612330000054
formed as illustrative vectors of training data
Figure BDA0001577612330000055
Of the matrix of (a).
Second, registration stage
S203: acquiring voice data of registered personnel, and extracting registered data from the voice data of the registered personnel
Figure BDA0001577612330000056
Wherein enroll represents the registration phase.
The process of extracting the registration data may be the process of extracting the training data in the parameter S201, and is not described herein again.
S204: and mapping the registration data into a new voiceprint feature vector by using the LR model obtained by training, wherein the new voiceprint feature vector can be regarded as a voiceprint feature model of the registrant.
Specifically, the mapping is performed using equation (2):
z=ATx (2)
third, testing stage
S205: obtaining test voice data and extracting test data from the test voice data
Figure BDA0001577612330000061
Where test denotes the test phase.
S206: and mapping the test data into a new voiceprint feature vector by using the LR model obtained by training.
S207: and comparing the new voiceprint feature vector obtained in the step S206 with the voiceprint feature models of the registrants, and identifying the registrants corresponding to the test voice data. And the registrant corresponding to the test voice data is the registrant who sends the test voice data.
As can be seen from the steps in fig. 2, the back end of the voiceprint recognition system (i.e., the voiceprint recognition back end) adopts a mechanism of first registration and then recognition, so that the user can register in the system first, and the system obtains the voiceprint feature model of the registrant by using the trained LR model. In the testing stage, the system can recognize which registrant the collected voice is sent out by, so that the voice data can be recognized.
In the research process, the applicant finds the voiceprint feature vector mapped by the LR model through experiments by using a large number of machine learning models, so that the subsequent classification identification has higher accuracy.
The voiceprint recognition back end using the flow shown in fig. 2 can be used in combination with a conventional voiceprint feature extraction front end to constitute the voiceprint recognition system shown in fig. 1. The following will exemplify the working flow of three voiceprint recognition systems in which the voiceprint recognition back end of the flow shown in fig. 2 is combined with different voiceprint feature extraction front ends.
The GMM/i-vector + LR + cosine voiceprint recognition system comprises the following components:
the system adopts GMM/i-vector as a voiceprint recognition front end, adopts LR shown in FIG. 2 as a voiceprint feature mapping module at a voiceprint recognition rear end, and adopts cosine similarity as a voiceprint classifier. The three stages are as follows:
1) a training stage:
step 1: the voiceprint recognition front end filters a mute section and a noise section of each section of voice frequency by using voice endpoint detection, and reserves a voice frequency segment only containing the voice of the training speaker.
Step 2: the voiceprint recognition front-end segments all audio in the training database into fixed length segments of 3 to 30 seconds in length, the present embodiment segments the audio into 15 second segments.
And step 3: the voiceprint recognition front end divides each audio segment into a plurality of frames according to the setting that the frame length is 15-30 milliseconds and the frame shift is 5-15 milliseconds, and extracts the acoustic features from each frame. The frame length of the present embodiment is set to 25 msec and the frame shift is set to 10 msec. The acoustic feature of the present embodiment is 20-dimensional MFCC feature (including 1-dimensional energy feature) + 13-dimensional RASTA-PLP feature + first-order difference feature + second-order difference feature, which is 99-dimensional.
And 4, step 4: the voiceprint recognition front end trains a Gaussian mixture model with U Gaussian components by adopting the existing GMM-UBM method to obtain a sigma model. This embodiment trains a gaussian mixture model containing 2048 gaussian components.
And 5: the voiceprint recognition front end adopts a GMM-UBM method, the Gaussian mixture model is applied to calculate the zero order statistic and the first order statistic of each audio fragment, and the zero order statistic and the first order statistic form a high-dimensional feature vector. The high-dimensional feature vector extracted in this embodiment is 204800 dimensions.
Step 6: and training an i-vector model by adopting the existing i-vector method at the voiceprint recognition front end to obtain a T matrix.
And 7: and the voiceprint recognition front end adopts an i-vector method, and the T matrix is applied to reduce the dimension of the high-dimensional feature vector output by the GMM-UBM to a low-dimensional space. The feature output space of this embodiment is 400, i.e. 204800-dimensional features of each audio piece are mapped to 400-dimensional features.
And 8: the voiceprint feature mapping module trains a linear regression model by adopting a formula (1) in the linear regression method to obtain an A matrix. The a matrix of this embodiment is a 400 × n matrix.
2) Registration phase
Step 1: the voiceprint recognition front end filters out a mute section and a noise section of each section of registered audio by using voice endpoint detection, and reserves an audio segment only containing the voice of the registered speaker.
Step 2: the voiceprint recognition front-end segments all audio in the registered speaker into fixed length segments of 3 to 30 seconds in length, with this embodiment segmenting the audio into 15 second segments.
And step 3: the voiceprint recognition front end divides each audio segment into a plurality of frames according to the setting that the frame length is 15-30 milliseconds and the frame shift is 5-15 milliseconds, and extracts the acoustic features from each frame. The frame length of the present embodiment is set to 25 msec and the frame shift is set to 10 msec. The acoustic feature of the present embodiment is 20-dimensional MFCC feature (including 1-dimensional energy feature) + 13-dimensional RASTA-PLP feature + first-order difference feature + second-order difference feature, which is 99-dimensional.
And 4, step 4: the voiceprint recognition front end adopts a GMM-UBM method, a Gaussian mixture model obtained in a training stage is applied to calculate the zero order statistic and the first order statistic of each audio clip, and the zero order statistic and the first order statistic form a high-dimensional feature vector. The high-dimensional feature vector extracted in this embodiment is 204800 dimensions.
And 5: and the voiceprint recognition front end adopts an i-vector method, and reduces the dimension of the high-dimensional feature vector output by the GMM-UBM to a low-dimensional space by applying a T matrix obtained in a training stage. The feature output space of this embodiment is 400, i.e. 204800-dimensional features of each audio piece are mapped to 400-dimensional features.
Step 6: the voiceprint feature mapping module further maps the i-vector feature into n-dimensional voiceprint features (n is the number of speakers in the training set) by applying the A matrix obtained in the training stage by adopting the formula (2) in the linear regression method provided by the invention
Figure BDA0001577612330000081
And 7: the voiceprint feature mapping module is used for obtaining voiceprint feature vectors of all audio segments of the registered speaker
Figure BDA0001577612330000082
Averaging
Figure BDA0001577612330000083
A voiceprint feature model of the registered speaker is obtained.
3) Testing phase
Step 1: the voiceprint recognition front end filters out a mute section and a noise section of each section of test audio by using voice endpoint detection, and reserves an audio segment only containing the voice of the test speaker.
Step 2: the voiceprint recognition front-end segments all audio in the test speaker into fixed length segments of 3 to 30 seconds in length, with this embodiment segmenting the audio into 15 second segments.
And step 3: the voiceprint recognition front end divides each audio segment into a plurality of frames according to the setting that the frame length is 15-30 milliseconds and the frame shift is 5-15 milliseconds, and extracts the acoustic features from each frame. The frame length of the present embodiment is set to 25 msec and the frame shift is set to 10 msec. The acoustic feature of the present embodiment is 20-dimensional MFCC feature (including 1-dimensional energy feature) + 13-dimensional RASTA-PLP feature + first-order difference feature + second-order difference feature, which is 99-dimensional.
And 4, step 4: the voiceprint recognition front end adopts a GMM-UBM method, a Gaussian mixture model obtained in a training stage is applied to calculate the zero order statistic and the first order statistic of each audio clip, and the zero order statistic and the first order statistic form a high-dimensional feature vector. The high-dimensional feature vector extracted in this embodiment is 204800 dimensions.
And 5: and the voiceprint recognition front end adopts an i-vector method, and reduces the dimension of the high-dimensional feature vector output by the GMM-UBM to a low-dimensional space by applying a T matrix obtained in a training stage. The feature output space of this embodiment is 400, i.e. 204800-dimensional features of each audio piece are mapped to 400-dimensional features.
Step 6: voiceprint feature mapping module samplingUsing formula (2) and applying the A matrix obtained in the training stage to further map the i-vector characteristics into n-dimensional voiceprint characteristics (n is the number of speakers in the training set)
Figure BDA0001577612330000091
And 7: the voiceprint feature mapping module obtains the voiceprint feature vectors of all the audio frequency segments of any test speaker
Figure BDA0001577612330000092
Averaging
Figure BDA0001577612330000093
And obtaining a voiceprint characteristic model of the test speaker.
And 8: the voiceprint classifier adopts a cosine similarity classifier to calculate
Figure BDA0001577612330000094
And
Figure BDA0001577612330000095
similarity of (c):
Figure BDA0001577612330000096
and comparing with a decision threshold delta to decide
Figure BDA0001577612330000097
Whether or not to cooperate with
Figure BDA0001577612330000098
Are the same speaker.
(II) DNN/i-vector + LR + cosine voiceprint recognition system:
the system adopts DNN/i-vector as a voiceprint recognition front end, adopts LR shown in FIG. 2 as a voiceprint feature mapping module of a voiceprint recognition rear end, and adopts cosine similarity as a voiceprint classifier. The three stages are as follows:
1) a training stage:
step 1: the voiceprint recognition front end filters a mute section and a noise section of each section of voice frequency by using voice endpoint detection, and reserves a voice frequency segment only containing the voice of the training speaker.
Step 2: the voiceprint recognition front-end segments all audio in the training database into fixed length segments of 3 to 30 seconds in length, the present embodiment segments the audio into 15 second segments.
And step 3: the voiceprint recognition front end divides each audio segment into a plurality of frames according to the setting that the frame length is 15-30 milliseconds and the frame shift is 5-15 milliseconds, and extracts the acoustic features from each frame. The frame length of the present embodiment is set to 25 msec and the frame shift is set to 10 msec. The acoustic feature of the present embodiment is 20-dimensional MFCC feature (including 1-dimensional energy feature) + 13-dimensional RASTA-PLP feature + first-order difference feature + second-order difference feature, which is 99-dimensional.
And 4, step 4: the voiceprint recognition front end adopts a DNN-UBM method and trains a database containing U by using an independent voice recognition database containing voice content labeling informationDNNA deep neural network acoustic model Λ of each output state. The acoustic model used in this embodiment outputs 8073 states.
And 5: the voiceprint recognition front end adopts a DNN-UBM method, uses an acoustic model Lambda to recognize audio segments in a training database, and extracts U of each frame of dataDNNThe dimension posterior probability vector. The posterior probability vector of each frame of data obtained in this embodiment is 8073 dimensions.
Step 6: the voiceprint recognition front end adopts a DNN-UBM method, discards output states with lower posterior probability and only retains
Figure BDA0001577612330000101
And (4) an output state with a high posterior probability. Accordingly, the posterior probability vector of each frame of data is also adjusted to
Figure BDA0001577612330000102
And (5) maintaining. Of the present embodiment
Figure BDA0001577612330000103
3096 was set.
And 7: the front end of the voiceprint recognition adopts a DNN-UBM method, and the training comprises
Figure BDA0001577612330000104
A Gaussian mixture model of Gaussian components to obtain sigmaDNNAnd (4) modeling. This embodiment trains a gaussian mixture model containing 3096 gaussian components.
And 8: the voiceprint recognition front end adopts a GMM-UBM method and applies the Gaussian mixture model sigmaDNNThe zero order statistic and the first order statistic of each audio piece are calculated, and the zero order statistic and the first order statistic form a high-dimensional feature vector. The high-dimensional feature vector extracted in this embodiment is 309600 dimensions.
And step 9: training an i-vector model by an i-vector method at the front end of the voiceprint recognition to obtain TDNNAnd (4) matrix.
Step 10: the voiceprint recognition front end adopts an i-vector method and uses the TDNNThe matrix reduces the dimension of the high-dimensional feature vector output by the DNN-UBM to a low-dimensional space. The feature output space of this embodiment is 400, i.e. 309600 dimensional features of each audio clip are mapped to 400 dimensional features.
Step 11: the voiceprint feature mapping module trains a linear regression model by adopting a formula (1) to obtain ADNNAnd (4) matrix. A of the present exampleDNNThe matrix is a 400 × n matrix.
2) Registration phase
Step 1: the voiceprint recognition front end filters out a mute section and a noise section of each section of registered audio by using voice endpoint detection, and reserves an audio segment only containing the voice of the registered speaker.
Step 2: the voiceprint recognition front-end segments all audio in the registered speaker into fixed length segments of 3 to 30 seconds in length, with this embodiment segmenting the audio into 15 second segments.
And step 3: the voiceprint recognition front end divides each audio segment into a plurality of frames according to the setting that the frame length is 15-30 milliseconds and the frame shift is 5-15 milliseconds, and extracts the acoustic features from each frame. The frame length of the present embodiment is set to 25 msec and the frame shift is set to 10 msec. The acoustic feature of the present embodiment is 20-dimensional MFCC feature (including 1-dimensional energy feature) + 13-dimensional RASTA-PLP feature + first-order difference feature + second-order difference feature, which is 99-dimensional.
And 4, step 4: the voiceprint recognition front end adopts a DNN-UBM method, uses an acoustic model Lambda to recognize audio segments in the registered speaker, and extracts U of each frame of dataDNNThe dimension posterior probability vector. The posterior probability vector of each frame of data obtained in this embodiment is 8073 dimensions.
And 5: the voiceprint recognition front end adopts a DNN-UBM method, discards output states with lower posterior probability and only retains
Figure BDA0001577612330000111
The output state with higher a posteriori probability (which state is specifically retained by the training phase). Accordingly, the posterior probability vector of each frame of data is also adjusted to
Figure BDA0001577612330000112
And (5) maintaining. Of the present embodiment
Figure BDA0001577612330000113
3096 was set.
Step 6: the voiceprint recognition front end adopts a GMM-UBM method and a Gaussian mixture model sigmaDNNThe zero order statistic and the first order statistic of each audio piece are calculated, and the zero order statistic and the first order statistic form a high-dimensional feature vector. The high-dimensional feature vector extracted in this embodiment is 309600 dimensions.
And 7: the voiceprint recognition front end adopts an i-vector method and uses TDNNThe matrix reduces the dimension of the high-dimensional feature vector output by the DNN-UBM to a low-dimensional space. The feature output space of this embodiment is 400, i.e. 309600 dimensional features of each audio clip are mapped to 400 dimensional features.
And 8: the voiceprint feature mapping module adopts a formula (2) and applies A obtained in the training stageDNNThe matrix further maps the i-vector features to n-dimensional voiceprint features (n is the number of speakers in the training set)
Figure BDA0001577612330000121
And step 9: the voiceprint feature mapping module obtains the voiceprint feature vector from all the audio segments of any registered speaker
Figure BDA0001577612330000122
Averaging
Figure BDA0001577612330000123
A voiceprint feature model of the registered speaker is obtained.
3) Testing phase
Step 1: the voiceprint recognition front end filters out a mute section and a noise section of each section of test audio by using voice endpoint detection, and reserves an audio segment only containing the voice of the test speaker.
Step 2: the voiceprint recognition front-end segments all audio in the test speaker into fixed length segments of 3 to 30 seconds in length, with this embodiment segmenting the audio into 15 second segments.
And step 3: the voiceprint recognition front end divides each audio segment into a plurality of frames according to the setting that the frame length is 15-30 milliseconds and the frame shift is 5-15 milliseconds, and extracts the acoustic features from each frame. The frame length of the present embodiment is set to 25 msec and the frame shift is set to 10 msec. The acoustic feature of the present embodiment is 20-dimensional MFCC feature (including 1-dimensional energy feature) + 13-dimensional RASTA-PLP feature + first-order difference feature + second-order difference feature, which is 99-dimensional.
And 4, step 4: the voiceprint recognition front end adopts a DNN-UBM method, uses an acoustic model Lambda to recognize the audio frequency segment in the tested speaker, and extracts the U of each frame of dataDNNThe dimension posterior probability vector. The posterior probability vector of each frame of data obtained in this embodiment is 8073 dimensions.
And 5: the voiceprint recognition front end adopts a DNN-UBM method, discards output states with lower posterior probability and only retains
Figure BDA0001577612330000124
Output state with high posterior probability (specific guarantee is determined by training phase)Which states to leave). Accordingly, the posterior probability vector of each frame of data is also adjusted to
Figure BDA0001577612330000125
And (5) maintaining. Of the present embodiment
Figure BDA0001577612330000126
3096 was set.
Step 6: the voiceprint recognition front end adopts a GMM-UBM method and a Gaussian mixture model sigmaDNNThe zero order statistic and the first order statistic of each audio piece are calculated, and the zero order statistic and the first order statistic form a high-dimensional feature vector. The high-dimensional feature vector extracted in this embodiment is 309600 dimensions.
And 7: the voiceprint recognition front end adopts an i-vector method and uses TDNNThe matrix reduces the dimension of the high-dimensional feature vector output by the DNN-UBM to a low-dimensional space. The feature output space of this embodiment is 400, i.e. 309600 dimensional features of each audio clip are mapped to 400 dimensional features.
And 8: the voiceprint feature mapping module adopts a formula (2) and applies A obtained in the training stageDNNThe matrix further maps the i-vector features to n-dimensional voiceprint features (n is the number of speakers in the training set)
Figure BDA0001577612330000131
And step 9: the voice print characteristic vector obtained from all the audio frequency segments of any one test speaker
Figure BDA0001577612330000132
Averaging
Figure BDA0001577612330000133
And obtaining a voiceprint characteristic model of the test speaker.
Step 10: calculating by using cosine similarity classifier
Figure BDA0001577612330000134
And
Figure BDA0001577612330000135
similarity of (c):
Figure BDA0001577612330000136
and comparing with a decision threshold delta to decide
Figure BDA0001577612330000137
Whether or not to cooperate with
Figure BDA0001577612330000138
Are the same speaker.
(III) d-vector + LR + cosine voiceprint recognition system:
the system adopts d-vector as the voiceprint recognition front end, the LR of the invention as the voiceprint feature mapping module of the voiceprint recognition rear end and cosine similarity as the voiceprint classifier. The three stages are as follows:
1) a training stage:
step 1: the voiceprint recognition front end filters a mute section and a noise section of each section of voice frequency by using voice endpoint detection, and reserves a voice frequency segment only containing the voice of the training speaker.
Step 2: the voiceprint recognition front-end segments all audio in the training database into fixed length segments of 3 to 30 seconds in length, the present embodiment segments the audio into 15 second segments.
And step 3: the voiceprint recognition front end divides each audio segment into a plurality of frames according to the setting that the frame length is 15-30 milliseconds and the frame shift is 5-15 milliseconds, and extracts the acoustic features from each frame. The frame length of the present embodiment is set to 25 msec and the frame shift is set to 10 msec. The acoustic feature of the present embodiment is 20-dimensional MFCC feature (including 1-dimensional energy feature) + 13-dimensional RASTA-PLP feature + first-order difference feature + second-order difference feature, which is 99-dimensional.
And 4, step 4: the voiceprint recognition front end trains a deep neural network containing n output neurons by adopting the existing d-vector method to obtain sigmad-vectorModels where n is in the training datasetThe number of speakers. Suppose thatd-vectorThe highest hidden layer of the model contains Ud-vectorAnd (4) hiding the neurons. U of the embodimentd-vectorSet to 400.
And 5: the voiceprint recognition front end adopts a d-vector method and uses sigmad-vectorThe model predicts the voice of each frame and compares the sigma with the threshold valued-vectorThe output of the highest hidden layer of the model is taken as the feature of each frame of voice, and the features of all frames of each audio clip are averaged to obtain the U of each audio clipd-vectorA dimensional feature vector. U of the embodimentd-vectorSet to 400.
Step 6: the voiceprint feature mapping module trains a linear regression model by adopting a formula (1) in the linear regression method to obtain an A matrix. The a matrix of this embodiment is a 400 × n matrix.
2) Registration phase
Step 1: the voiceprint recognition front end filters out a mute section and a noise section of each section of registered audio by using voice endpoint detection, and reserves an audio segment only containing the voice of the registered speaker.
Step 2: the voiceprint recognition front-end segments all audio in the registered speaker into fixed length segments of 3 to 30 seconds in length, with this embodiment segmenting the audio into 15 second segments.
And step 3: the voiceprint recognition front end divides each audio segment into a plurality of frames according to the setting that the frame length is 15-30 milliseconds and the frame shift is 5-15 milliseconds, and extracts the acoustic features from each frame. The frame length of the present embodiment is set to 25 msec and the frame shift is set to 10 msec. The acoustic feature of the present embodiment is 20-dimensional MFCC feature (including 1-dimensional energy feature) + 13-dimensional RASTA-PLP feature + first-order difference feature + second-order difference feature, which is 99-dimensional.
And 4, step 4: the voiceprint recognition front end adopts a d-vector method and uses sigmad-vectorThe model predicts the voice of each frame and compares the sigma with the threshold valued-vectorThe output of the highest hidden layer of the model is taken as the feature of each frame of voice, and the features of all frames of each audio clip are averaged to obtain the U of each audio clipd-vectorA dimensional feature vector. Of the present embodimentUd-vectorSet to 400.
Step 6: the voiceprint feature mapping module further maps the i-vector features into n-dimensional voiceprint features (n is the number of speakers in the training set) by applying an A matrix obtained in a training stage by adopting a formula (2)
Figure BDA0001577612330000151
And 7: the voiceprint feature mapping module obtains the voiceprint feature vector from all the audio segments of any registered speaker
Figure BDA0001577612330000152
Averaging
Figure BDA0001577612330000153
A voiceprint feature model of the registered speaker is obtained.
3) Testing phase
Step 1: the voiceprint recognition front end filters out a mute section and a noise section of each section of test audio by using voice endpoint detection, and reserves an audio segment only containing the voice of the test speaker.
Step 2: the voiceprint recognition front-end segments all audio in the test speaker into fixed length segments of 3 to 30 seconds in length, with this embodiment segmenting the audio into 15 second segments.
And step 3: the voiceprint recognition front end divides each audio segment into a plurality of frames according to the setting that the frame length is 15-30 milliseconds and the frame shift is 5-15 milliseconds, and extracts the acoustic features from each frame. The frame length of the present embodiment is set to 25 msec and the frame shift is set to 10 msec. The acoustic feature of the present embodiment is 20-dimensional MFCC feature (including 1-dimensional energy feature) + 13-dimensional RASTA-PLP feature + first-order difference feature + second-order difference feature, which is 99-dimensional.
And 4, step 4: the voiceprint recognition front end adopts a d-vector method and uses sigmad-vectorThe model predicts the voice of each frame and compares the sigma with the threshold valued-vectorThe output of the highest hidden layer of the model is taken as the feature of each frame of voice, and the features of all frames of each audio clip are averaged to obtain the U of each audio clipd-vectorA dimensional feature vector. U of the embodimentd-vectorSet to 400.
And 5: the voiceprint feature mapping module further maps the i-vector features into n-dimensional voiceprint features (n is the number of speakers in the training set) by applying an A matrix obtained in a training stage by adopting a formula (2)
Figure BDA0001577612330000154
Step 6: the voiceprint feature mapping module obtains the voiceprint feature vectors of all the audio frequency segments of any test speaker
Figure BDA0001577612330000155
Averaging
Figure BDA0001577612330000156
And obtaining a voiceprint characteristic model of the test speaker.
And 7: the voiceprint classifier adopts a cosine similarity classifier to calculate
Figure BDA0001577612330000157
And
Figure BDA0001577612330000158
similarity of (c):
Figure BDA0001577612330000161
and comparing with a decision threshold delta to decide
Figure BDA0001577612330000162
Whether or not to cooperate with
Figure BDA0001577612330000163
Are the same speaker.
Experimental validation was performed on the NIST SRE 2006 and NIST SRE 2008 data sets for the three examples above. The 8 conversion's bitmap data in NIST SRE 2006 dataset is used as training set, 402 speakers in total, effective voice is about 100 hours. The bitmap data from 8 sessions in NIST SRE 2008 data set was used as the enrollment and testing sets for a total of 395 speakers. The Test speaker's voice length is fixed at 30 seconds (cut into 2 segments, 15 seconds each) and the Enrollment speaker's voice length is 150 seconds (cut into 10 segments, 15 seconds each). Approximately 15 million test samples were constructed for any enrolled speaker and test speaker. The DNN acoustic model in the second example was trained using the Switchboard-1 database, with a precise annotation of speech for about 300 hours.
Using the above test samples, the recognition error rates of the voiceprint recognition back end of LR + cosine used in the above three examples and other voiceprint recognition back ends were compared, and the comparison results are shown in table 1:
TABLE 1
Figure BDA0001577612330000164
As can be seen from table 1, the LR + cosine has a lower recognition error rate than the conventional cosine, WCCN + cosine, LDA + cosine and LDA + PLDA classifiers with the same front end.
In the above three examples, GMM/i-vector + LR + cosine achieves the optimal performance in all the methods participating in comparison, and is relatively improved by 27.19% compared with the optimal voiceprint recognition system GMM/i-vector + LDA + PLDA participating in comparison. The relative improvement of the DNN/i-vector + LR + cosine is 23.39% compared with that of the DNN/i-vector + LDA + cosine of the optimal voiceprint recognition system adopting the same voiceprint recognition front end DNN/i-vector. The d-vector + LR + cosine is relatively improved by 7.31 percent compared with the optimal voiceprint recognition system adopting the same voiceprint recognition front end d-vector.
It should be noted that the above embodiments are only specific examples of the patent disclosure, and all the algorithms that use a linear regression algorithm for obtaining a voiceprint feature vector in a voiceprint recognition system are within the scope of the patent protection.
The functions described in the method of the embodiment of the present application, if implemented in the form of software functional units and sold or used as independent products, may be stored in a storage medium readable by a computing device. Based on such understanding, part of the contribution to the prior art of the embodiments of the present application or part of the technical solution may be embodied in the form of a software product stored in a storage medium and including several instructions for causing a computing device (which may be a personal computer, a server, a mobile computing device or a network device) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (8)

1. A voiceprint recognition method based on linear regression is characterized by comprising the following steps:
acquiring a first voiceprint feature vector from voice data;
mapping the first voiceprint feature vector into a second voiceprint feature vector by using a pre-trained linear regression model;
performing classification identification on the second acoustic line feature vector;
wherein the training process of the linear regression model comprises the following steps:
obtaining training data from a voiceprint database
Figure FDA0003069230740000011
Wherein x isi,jFor a d-dimensional voiceprint feature vector extracted from each utterance in the voiceprint database, i 1iN is the number of speakers in the voiceprint database, and any speaker corresponds to MiA word; y isi,jIs an n-dimensional indicative vector y of the ith speakeri,j=[0,...,1,...,0]T(ii) a d is a preset value;
using A ═ (XX)T)-1XYTObtaining the linear regression model, wherein,
Figure FDA0003069230740000012
formed as a voiceprint vector of training data
Figure FDA0003069230740000013
The matrix of (a) is,
Figure FDA0003069230740000014
formed as illustrative vectors of training data
Figure FDA0003069230740000015
Of the matrix of (a).
2. The method of claim 1, wherein mapping the first voiceprint feature vector to a second voiceprint feature vector comprises:
using the mapping z ═ ATAnd x, mapping the first voiceprint feature vector to a second voiceprint feature vector, wherein A is the pre-trained linear regression model, x is the first voiceprint feature vector, and z is the second voiceprint feature vector.
3. The method of claim 1, wherein the performing classification identification on the second acoustic line feature vector comprises:
and using a cosine classifier to classify and identify the second acoustic line feature vector.
4. The method of claim 1, wherein obtaining the first voiceprint feature vector from the speech data comprises:
the first voiceprint feature vector is obtained from the speech data using a GMM/i-vector algorithm, a DNN/i-vector algorithm, or a d-vector algorithm.
5. A system for voiceprint recognition based on linear regression, comprising:
the voice print feature extraction front end is used for acquiring a first voice print feature vector from voice data;
a voiceprint recognition back end, the voiceprint recognition back end comprising a voiceprint feature mapping module and a voiceprint classifier, the voiceprint feature mapping module being configured to map the first voiceprint feature vector to a second voiceprint feature vector using a pre-trained linear regression model; the voiceprint classifier is used for classifying and identifying the second voiceprint feature vector;
wherein the voiceprint feature mapping module is further configured to:
obtaining training data from a voiceprint database
Figure FDA0003069230740000021
Wherein x isi,jFor a d-dimensional voiceprint feature vector extracted from each utterance in the voiceprint database, i 1iN is the number of speakers in the voiceprint database, and any speaker corresponds to MiA word; y isi,jIs an n-dimensional indicative vector y of the ith speakeri,j=[0,...,1,...,0]T(ii) a d is a preset value;
using A ═ (XX)T)-1XYTObtaining the linear regression model, wherein,
Figure FDA0003069230740000022
formed as a voiceprint vector of training data
Figure FDA0003069230740000023
The matrix of (a) is,
Figure FDA0003069230740000024
formed as illustrative vectors of training data
Figure FDA0003069230740000025
Of the matrix of (a).
6. The system of claim 5, wherein the voiceprint feature mapping module is configured to map the first voiceprint feature vector to a second voiceprint feature vector using a pre-trained linear regression model comprising:
the voiceprint feature mapping module is specifically configured to use a mapping relationship of z ═ aTAnd x, mapping the first voiceprint feature vector to a second voiceprint feature vector, wherein A is the pre-trained linear regression model, x is the first voiceprint feature vector, and z is the second voiceprint feature vector.
7. The system of claim 5, wherein the voiceprint classifier comprises: and a cosine classifier.
8. The system of claim 5, wherein the voiceprint feature extraction front end comprises:
a GMM/i-vector front end, a DNN/i-vector front end, or a d-vector front end.
CN201810141059.0A 2018-02-11 2018-02-11 Voiceprint recognition method and system based on linear regression Active CN108091326B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810141059.0A CN108091326B (en) 2018-02-11 2018-02-11 Voiceprint recognition method and system based on linear regression

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810141059.0A CN108091326B (en) 2018-02-11 2018-02-11 Voiceprint recognition method and system based on linear regression

Publications (2)

Publication Number Publication Date
CN108091326A CN108091326A (en) 2018-05-29
CN108091326B true CN108091326B (en) 2021-08-06

Family

ID=62194472

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810141059.0A Active CN108091326B (en) 2018-02-11 2018-02-11 Voiceprint recognition method and system based on linear regression

Country Status (1)

Country Link
CN (1) CN108091326B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109065028B (en) * 2018-06-11 2022-12-30 平安科技(深圳)有限公司 Speaker clustering method, speaker clustering device, computer equipment and storage medium
CN109119069B (en) * 2018-07-23 2020-08-14 深圳大学 Specific crowd identification method, electronic device and computer readable storage medium
CN109367350B (en) * 2018-10-11 2020-08-11 山东科技大学 Automatic starting method and system for vehicle air conditioner
CN111462760B (en) * 2019-01-21 2023-09-26 阿里巴巴集团控股有限公司 Voiceprint recognition system, voiceprint recognition method, voiceprint recognition device and electronic equipment
CN110517698B (en) * 2019-09-05 2022-02-01 科大讯飞股份有限公司 Method, device and equipment for determining voiceprint model and storage medium
CN110610709A (en) * 2019-09-26 2019-12-24 浙江百应科技有限公司 Identity distinguishing method based on voiceprint recognition
CN110853654B (en) * 2019-11-17 2021-12-21 西北工业大学 Model generation method, voiceprint recognition method and corresponding device
CN111933147B (en) * 2020-06-22 2023-02-14 厦门快商通科技股份有限公司 Voiceprint recognition method, system, mobile terminal and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1366295A (en) * 2000-07-05 2002-08-28 松下电器产业株式会社 Speaker's inspection and speaker's identification system and method based on prior knowledge
CN103971690A (en) * 2013-01-28 2014-08-06 腾讯科技(深圳)有限公司 Voiceprint recognition method and device
CN106601258A (en) * 2016-12-12 2017-04-26 广东顺德中山大学卡内基梅隆大学国际联合研究院 Speaker identification method capable of information channel compensation based on improved LSDA algorithm
CN107517207A (en) * 2017-03-13 2017-12-26 平安科技(深圳)有限公司 Server, auth method and computer-readable recording medium
CN107623614A (en) * 2017-09-19 2018-01-23 百度在线网络技术(北京)有限公司 Method and apparatus for pushed information
CN107633845A (en) * 2017-09-11 2018-01-26 清华大学 A kind of duscriminant local message distance keeps the method for identifying speaker of mapping

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100571574B1 (en) * 2004-07-26 2006-04-17 한양대학교 산학협력단 Similar Speaker Recognition Method Using Nonlinear Analysis and Its System

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1366295A (en) * 2000-07-05 2002-08-28 松下电器产业株式会社 Speaker's inspection and speaker's identification system and method based on prior knowledge
CN103971690A (en) * 2013-01-28 2014-08-06 腾讯科技(深圳)有限公司 Voiceprint recognition method and device
CN106601258A (en) * 2016-12-12 2017-04-26 广东顺德中山大学卡内基梅隆大学国际联合研究院 Speaker identification method capable of information channel compensation based on improved LSDA algorithm
CN107517207A (en) * 2017-03-13 2017-12-26 平安科技(深圳)有限公司 Server, auth method and computer-readable recording medium
CN107633845A (en) * 2017-09-11 2018-01-26 清华大学 A kind of duscriminant local message distance keeps the method for identifying speaker of mapping
CN107623614A (en) * 2017-09-19 2018-01-23 百度在线网络技术(北京)有限公司 Method and apparatus for pushed information

Also Published As

Publication number Publication date
CN108091326A (en) 2018-05-29

Similar Documents

Publication Publication Date Title
CN108091326B (en) Voiceprint recognition method and system based on linear regression
US11636860B2 (en) Word-level blind diarization of recorded calls with arbitrary number of speakers
US10109280B2 (en) Blind diarization of recorded calls with arbitrary number of speakers
Villalba et al. State-of-the-art speaker recognition with neural network embeddings in NIST SRE18 and speakers in the wild evaluations
JP5853029B2 (en) Passphrase modeling device and method for speaker verification, and speaker verification system
Soltane et al. Face and speech based multi-modal biometric authentication
US7475013B2 (en) Speaker recognition using local models
CN111524527A (en) Speaker separation method, device, electronic equipment and storage medium
US11837236B2 (en) Speaker recognition based on signal segments weighted by quality
US20120232900A1 (en) Speaker recognition from telephone calls
CN111091809B (en) Regional accent recognition method and device based on depth feature fusion
Haris et al. Robust speaker verification with joint sparse coding over learned dictionaries
Prasetio et al. Generalized Discriminant Methods for Improved X-Vector Back-end Based Stress Speech Recognition
US20220405363A1 (en) Methods for improving the performance of neural networks used for biometric authenticatio
Chandrakala et al. Combination of generative models and SVM based classifier for speech emotion recognition
Silovsky et al. Speech, speaker and speaker's gender identification in automatically processed broadcast stream
Dm et al. Speech based emotion recognition using combination of features 2-D HMM model
Valanchery Analysis of different classifier for the detection of double compressed AMR audio
Errity et al. A comparative study of linear and nonlinear dimensionality reduction for speaker identification
Trabelsi et al. Learning vector quantization for adapted gaussian mixture models in automatic speaker identification
Feng et al. Duration Normalization Algorithm Based on Feature Space Trajectory in Pathological Speech Recognition
Tashan et al. Two stage speaker verification using self organising map and multilayer perceptron neural network
Ye et al. Discriminant kernel learning for acoustic scene classification with multiple observations
Heryanto et al. A new direct access framework for speaker identification system
Madhusudhana Rao et al. Machine hearing system for teleconference authentication with effective speech analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant