CN108091326B

CN108091326B - Voiceprint recognition method and system based on linear regression

Info

Publication number: CN108091326B
Application number: CN201810141059.0A
Authority: CN
Inventors: 张晓雷
Original assignee: Individual
Current assignee: Individual
Priority date: 2018-02-11
Filing date: 2018-02-11
Publication date: 2021-08-06
Anticipated expiration: 2038-02-11
Also published as: CN108091326A

Abstract

The application provides a voiceprint recognition method and system based on linear regression, wherein a first voiceprint feature vector is obtained from voice data, a pre-trained linear regression model is used for mapping the first voiceprint feature vector to a second voiceprint feature vector, and the second voiceprint feature vector is subjected to classification recognition. The linear regression model is innovatively introduced into the field of voiceprint recognition, and experiments prove that the accuracy of voiceprint recognition can be effectively improved.

Description

Voiceprint recognition method and system based on linear regression

Technical Field

The application relates to the field of electronic information, in particular to a voiceprint recognition method and system based on linear regression.

Background

Voiceprint recognition systems typically include two parts, a voiceprint feature extraction front-end and a voiceprint recognition back-end.

The voiceprint feature extraction front end is used for extracting the voiceprint features of the speaker from the speaker sentences: that is, a sentence of speech with an arbitrary length is mapped into a vector with a fixed length by a model. Common algorithms used by the voiceprint feature extraction front-end include: the speaker recognition method comprises a Gaussian mixture model-based general background model (GMM-UBM)/identity vector (i-vector) algorithm (GMM/i-vector algorithm for short), a deep learning-based general background model/i-vector algorithm for voice recognition acoustic model (DNN/i-vector algorithm for short), and a d-vector algorithm for classifying speakers by using a deep learning model and outputting the top hidden layer as a voiceprint vector of the speakers.

The voiceprint recognition back end classifies the voiceprint vector of the speaker through a supervised machine learning algorithm. The method can be divided into two parts, wherein the first part is to map the voiceprint feature vector into another new voiceprint feature vector by a supervised machine learning method, and the second part is to classify the new voiceprint feature vector after dimension reduction by the supervised machine learning method. For the first part, common mapping methods include Linear Discriminant Analysis (LDA), intra-class variance normalization (WCCN), and disturbance attribute projection (NAP), among others. For the second part, common classifiers include cosine distance classifier, Support Vector Machine (SVM) classifier, Probabilistic Linear Discriminant Analysis (PLDA) classifier, and the like. The LDA + PLDA method in the back-end algorithm achieves the optimal performance in a plurality of standardized tests and is widely adopted by practical systems at present.

The voiceprint feature extraction front end and the voiceprint recognition rear end can be combined at will to form a voiceprint recognition system. However, the accuracy of the current voiceprint recognition still needs to be improved.

Disclosure of Invention

The application provides a voiceprint recognition method and system based on linear regression, and aims to solve the problem of how to improve the accuracy of voiceprint recognition.

In order to achieve the above object, the present application provides the following technical solutions:

a voiceprint recognition method based on linear regression comprises the following steps:

acquiring a first voiceprint feature vector from voice data;

mapping the first voiceprint feature vector into a second voiceprint feature vector by using a pre-trained linear regression model;

and carrying out classification identification on the second acoustic line feature vector.

Optionally, the mapping the first voiceprint feature vector to the second voiceprint feature vector includes:

using the mapping z ═ A^TAnd x, mapping the first voiceprint feature vector to a second voiceprint feature vector, wherein A is the pre-trained linear regression model, x is the first voiceprint feature vector, and z is the second voiceprint feature vector.

Optionally, the training process of the linear regression model includes:

obtaining training data from a voiceprint database

Wherein x is_i,jFor a d-dimensional voiceprint feature vector extracted from each sentence in the voiceprint database, i is 1, …, n, j is 1, …, M_iN is the number of speakers in the voiceprint database, and any speaker corresponds to M_nA word; y is_i,jIs an n-dimensional indicative vector y of the ith speaker_i,j＝[0,...,1,...,0]^T(ii) a d is a preset value;

using A ═ (XX)^T)^-1XY^TObtaining the linear regression model, wherein,

formed as a voiceprint vector of training data

The matrix of (a) is,

formed as illustrative vectors of training data

Of the matrix of (a).

Optionally, the classifying and identifying the second acoustic line feature vector includes:

and using a cosine classifier to classify and identify the second acoustic line feature vector.

Optionally, the obtaining the first voiceprint feature vector from the voice data includes:

the first voiceprint feature vector is obtained from the speech data using a GMM/i-vector algorithm, a DNN/i-vector algorithm, or a d-vector algorithm.

A system for voiceprint recognition based on linear regression, comprising:

the voice print feature extraction front end is used for acquiring a first voice print feature vector from voice data;

a voiceprint recognition back end, the voiceprint recognition back end comprising a voiceprint feature mapping module and a voiceprint classifier, the voiceprint feature mapping module being configured to map the first voiceprint feature vector to a second voiceprint feature vector using a pre-trained linear regression model; and the voiceprint classifier is used for classifying and identifying the second voiceprint feature vector.

Optionally, the voiceprint feature mapping module is configured to map the first voiceprint feature vector to a second voiceprint feature vector by using a pre-trained linear regression model, and includes:

the voiceprint feature mapping module is specifically configured to use a mapping relationship of z ═ a^TAnd x, mapping the first voiceprint feature vector to a second voiceprint feature vector, wherein A is the pre-trained linear regression model, x is the first voiceprint feature vector, and z is the second voiceprint feature vector.

Optionally, the voiceprint feature mapping module is further configured to:

obtaining training data from a voiceprint database

Wherein x is_i,jFor a d-dimensional voiceprint feature vector extracted from each utterance in the voiceprint database, i 1_iN is the number of speakers in the voiceprint database, and any speaker corresponds to M_nA word; y is_i,jIs an n-dimensional indicative vector y of the ith speaker_i,j＝[0,…,1,…,0]^T(ii) a d is a preset value;

using A ═ (XX)^T)^-1XY^TObtaining the linear regression model, wherein,

formed as a voiceprint vector of training data

The matrix of (a) is,

formed as illustrative vectors of training data

Of the matrix of (a).

Optionally, the voiceprint classifier includes: and a cosine classifier.

Optionally, the voiceprint feature extraction front end includes:

a GMM/i-vector front end, a DNN/i-vector front end, or a d-vector front end.

The method and the system for voiceprint recognition based on linear regression acquire a first voiceprint feature vector from voice data, map the first voiceprint feature vector into a second voiceprint feature vector by using a pre-trained linear regression model, and perform classification recognition on the second voiceprint feature vector. The linear regression model is innovatively introduced into the field of voiceprint recognition, and experiments prove that the accuracy of voiceprint recognition can be effectively improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic diagram of a voiceprint recognition system;

fig. 2 is a flowchart of a voiceprint recognition method based on linear regression disclosed in the embodiment of the present application.

Detailed Description

FIG. 1 is a schematic diagram of a voiceprint recognition system including a voiceprint feature extraction front end and a voiceprint recognition back end. The voiceprint recognition back end also comprises a voiceprint feature mapping module and a voiceprint classifier.

In order to improve the accuracy of voiceprint recognition, in the embodiment of the present application, the first part in the voiceprint recognition backend, i.e. the voiceprint feature mapping module, is improved. The core point of the method is that a trained Linear Regression (LR) model is used for mapping a voiceprint feature vector extracted from a voiceprint feature extraction front end into a new voiceprint feature vector, and the new voiceprint feature vector is used as a basis for voiceprint classification so as to improve accuracy of subsequent voiceprint classification.

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The workflow of the back-end of the voiceprint recognition system shown in figure 1 can be divided into three phases: a training phase, a registration phase and a testing phase. The training of the LR model is performed during a training phase, and both the registration phase and the testing phase require the use of a trained LR model.

The above three stages are explained in detail below. Fig. 2 is a voiceprint recognition method based on linear regression, which includes the following steps:

first, training phase

S201: training data is prepared.

Suppose that the voiceprint database contains speech data for n speakers, each speaker corresponding to M_nIn words, the voiceprint feature extraction front-end extracts a d-dimensional voiceprint feature vector x from each sentence_i,jWhere i 1, …, n, j 1_i. d is a predetermined value. The value may be 200 to 800, and in this embodiment, is set to 400, depending on the task.

Each of the n speakers is assigned a number, the number of the first speaker is 1, the number of the ith speaker is i … …, and the number of the nth speaker is n. Thus, the numbering of all speakers is the sequence 1.And n is a number. Expanding each number into 0 and 1 coded representation vector, i.e. the representation vector of the ith speaker is n-dimensional vector y_i,j＝[0,...,1,...,0]^TWhere 1 appears at the ith bit (e.g., the representational vector for the speaker numbered 2 is y_2,j＝[0,1,...,0]^T)。

In this embodiment, the supervised training data is

Where train represents the training phase.

S202: the LR model was trained using the supervised training data obtained above.

Specifically, the LR model is obtained using equation (1):

A＝(XX^T)^-1XY^T (1)

wherein

Formed as a voiceprint vector of training data

The matrix of (a) is,

formed as illustrative vectors of training data

Of the matrix of (a).

Second, registration stage

S203: acquiring voice data of registered personnel, and extracting registered data from the voice data of the registered personnel

Wherein enroll represents the registration phase.

The process of extracting the registration data may be the process of extracting the training data in the parameter S201, and is not described herein again.

S204: and mapping the registration data into a new voiceprint feature vector by using the LR model obtained by training, wherein the new voiceprint feature vector can be regarded as a voiceprint feature model of the registrant.

Specifically, the mapping is performed using equation (2):

z＝A^Tx (2)

third, testing stage

S205: obtaining test voice data and extracting test data from the test voice data

Where test denotes the test phase.

S206: and mapping the test data into a new voiceprint feature vector by using the LR model obtained by training.

S207: and comparing the new voiceprint feature vector obtained in the step S206 with the voiceprint feature models of the registrants, and identifying the registrants corresponding to the test voice data. And the registrant corresponding to the test voice data is the registrant who sends the test voice data.

As can be seen from the steps in fig. 2, the back end of the voiceprint recognition system (i.e., the voiceprint recognition back end) adopts a mechanism of first registration and then recognition, so that the user can register in the system first, and the system obtains the voiceprint feature model of the registrant by using the trained LR model. In the testing stage, the system can recognize which registrant the collected voice is sent out by, so that the voice data can be recognized.

In the research process, the applicant finds the voiceprint feature vector mapped by the LR model through experiments by using a large number of machine learning models, so that the subsequent classification identification has higher accuracy.

The voiceprint recognition back end using the flow shown in fig. 2 can be used in combination with a conventional voiceprint feature extraction front end to constitute the voiceprint recognition system shown in fig. 1. The following will exemplify the working flow of three voiceprint recognition systems in which the voiceprint recognition back end of the flow shown in fig. 2 is combined with different voiceprint feature extraction front ends.

The GMM/i-vector + LR + cosine voiceprint recognition system comprises the following components:

the system adopts GMM/i-vector as a voiceprint recognition front end, adopts LR shown in FIG. 2 as a voiceprint feature mapping module at a voiceprint recognition rear end, and adopts cosine similarity as a voiceprint classifier. The three stages are as follows:

1) a training stage:

step 1: the voiceprint recognition front end filters a mute section and a noise section of each section of voice frequency by using voice endpoint detection, and reserves a voice frequency segment only containing the voice of the training speaker.

Step 2: the voiceprint recognition front-end segments all audio in the training database into fixed length segments of 3 to 30 seconds in length, the present embodiment segments the audio into 15 second segments.

And step 3: the voiceprint recognition front end divides each audio segment into a plurality of frames according to the setting that the frame length is 15-30 milliseconds and the frame shift is 5-15 milliseconds, and extracts the acoustic features from each frame. The frame length of the present embodiment is set to 25 msec and the frame shift is set to 10 msec. The acoustic feature of the present embodiment is 20-dimensional MFCC feature (including 1-dimensional energy feature) + 13-dimensional RASTA-PLP feature + first-order difference feature + second-order difference feature, which is 99-dimensional.

And 4, step 4: the voiceprint recognition front end trains a Gaussian mixture model with U Gaussian components by adopting the existing GMM-UBM method to obtain a sigma model. This embodiment trains a gaussian mixture model containing 2048 gaussian components.

And 5: the voiceprint recognition front end adopts a GMM-UBM method, the Gaussian mixture model is applied to calculate the zero order statistic and the first order statistic of each audio fragment, and the zero order statistic and the first order statistic form a high-dimensional feature vector. The high-dimensional feature vector extracted in this embodiment is 204800 dimensions.

Step 6: and training an i-vector model by adopting the existing i-vector method at the voiceprint recognition front end to obtain a T matrix.

And 7: and the voiceprint recognition front end adopts an i-vector method, and the T matrix is applied to reduce the dimension of the high-dimensional feature vector output by the GMM-UBM to a low-dimensional space. The feature output space of this embodiment is 400, i.e. 204800-dimensional features of each audio piece are mapped to 400-dimensional features.

And 8: the voiceprint feature mapping module trains a linear regression model by adopting a formula (1) in the linear regression method to obtain an A matrix. The a matrix of this embodiment is a 400 × n matrix.

2) Registration phase

Step 1: the voiceprint recognition front end filters out a mute section and a noise section of each section of registered audio by using voice endpoint detection, and reserves an audio segment only containing the voice of the registered speaker.

Step 2: the voiceprint recognition front-end segments all audio in the registered speaker into fixed length segments of 3 to 30 seconds in length, with this embodiment segmenting the audio into 15 second segments.

And 4, step 4: the voiceprint recognition front end adopts a GMM-UBM method, a Gaussian mixture model obtained in a training stage is applied to calculate the zero order statistic and the first order statistic of each audio clip, and the zero order statistic and the first order statistic form a high-dimensional feature vector. The high-dimensional feature vector extracted in this embodiment is 204800 dimensions.

And 5: and the voiceprint recognition front end adopts an i-vector method, and reduces the dimension of the high-dimensional feature vector output by the GMM-UBM to a low-dimensional space by applying a T matrix obtained in a training stage. The feature output space of this embodiment is 400, i.e. 204800-dimensional features of each audio piece are mapped to 400-dimensional features.

Step 6: the voiceprint feature mapping module further maps the i-vector feature into n-dimensional voiceprint features (n is the number of speakers in the training set) by applying the A matrix obtained in the training stage by adopting the formula (2) in the linear regression method provided by the invention

And 7: the voiceprint feature mapping module is used for obtaining voiceprint feature vectors of all audio segments of the registered speaker

Averaging

A voiceprint feature model of the registered speaker is obtained.

3) Testing phase

Step 1: the voiceprint recognition front end filters out a mute section and a noise section of each section of test audio by using voice endpoint detection, and reserves an audio segment only containing the voice of the test speaker.

Step 2: the voiceprint recognition front-end segments all audio in the test speaker into fixed length segments of 3 to 30 seconds in length, with this embodiment segmenting the audio into 15 second segments.

Step 6: voiceprint feature mapping module samplingUsing formula (2) and applying the A matrix obtained in the training stage to further map the i-vector characteristics into n-dimensional voiceprint characteristics (n is the number of speakers in the training set)

And 7: the voiceprint feature mapping module obtains the voiceprint feature vectors of all the audio frequency segments of any test speaker

Averaging

And obtaining a voiceprint characteristic model of the test speaker.

And 8: the voiceprint classifier adopts a cosine similarity classifier to calculate

And

similarity of (c):

and comparing with a decision threshold delta to decide

Whether or not to cooperate with

Are the same speaker.

(II) DNN/i-vector + LR + cosine voiceprint recognition system:

the system adopts DNN/i-vector as a voiceprint recognition front end, adopts LR shown in FIG. 2 as a voiceprint feature mapping module of a voiceprint recognition rear end, and adopts cosine similarity as a voiceprint classifier. The three stages are as follows:

1) a training stage:

And 4, step 4: the voiceprint recognition front end adopts a DNN-UBM method and trains a database containing U by using an independent voice recognition database containing voice content labeling information^DNNA deep neural network acoustic model Λ of each output state. The acoustic model used in this embodiment outputs 8073 states.

And 5: the voiceprint recognition front end adopts a DNN-UBM method, uses an acoustic model Lambda to recognize audio segments in a training database, and extracts U of each frame of data^DNNThe dimension posterior probability vector. The posterior probability vector of each frame of data obtained in this embodiment is 8073 dimensions.

Step 6: the voiceprint recognition front end adopts a DNN-UBM method, discards output states with lower posterior probability and only retains

And (4) an output state with a high posterior probability. Accordingly, the posterior probability vector of each frame of data is also adjusted to

And (5) maintaining. Of the present embodiment

3096 was set.

And 7: the front end of the voiceprint recognition adopts a DNN-UBM method, and the training comprises

A Gaussian mixture model of Gaussian components to obtain sigma^DNNAnd (4) modeling. This embodiment trains a gaussian mixture model containing 3096 gaussian components.

And 8: the voiceprint recognition front end adopts a GMM-UBM method and applies the Gaussian mixture model sigma^DNNThe zero order statistic and the first order statistic of each audio piece are calculated, and the zero order statistic and the first order statistic form a high-dimensional feature vector. The high-dimensional feature vector extracted in this embodiment is 309600 dimensions.

And step 9: training an i-vector model by an i-vector method at the front end of the voiceprint recognition to obtain T^DNNAnd (4) matrix.

Step 10: the voiceprint recognition front end adopts an i-vector method and uses the T^DNNThe matrix reduces the dimension of the high-dimensional feature vector output by the DNN-UBM to a low-dimensional space. The feature output space of this embodiment is 400, i.e. 309600 dimensional features of each audio clip are mapped to 400 dimensional features.

Step 11: the voiceprint feature mapping module trains a linear regression model by adopting a formula (1) to obtain A^DNNAnd (4) matrix. A of the present example^DNNThe matrix is a 400 × n matrix.

2) Registration phase

And 4, step 4: the voiceprint recognition front end adopts a DNN-UBM method, uses an acoustic model Lambda to recognize audio segments in the registered speaker, and extracts U of each frame of data^DNNThe dimension posterior probability vector. The posterior probability vector of each frame of data obtained in this embodiment is 8073 dimensions.

And 5: the voiceprint recognition front end adopts a DNN-UBM method, discards output states with lower posterior probability and only retains

The output state with higher a posteriori probability (which state is specifically retained by the training phase). Accordingly, the posterior probability vector of each frame of data is also adjusted to

And (5) maintaining. Of the present embodiment

3096 was set.

Step 6: the voiceprint recognition front end adopts a GMM-UBM method and a Gaussian mixture model sigma^DNNThe zero order statistic and the first order statistic of each audio piece are calculated, and the zero order statistic and the first order statistic form a high-dimensional feature vector. The high-dimensional feature vector extracted in this embodiment is 309600 dimensions.

And 7: the voiceprint recognition front end adopts an i-vector method and uses T^DNNThe matrix reduces the dimension of the high-dimensional feature vector output by the DNN-UBM to a low-dimensional space. The feature output space of this embodiment is 400, i.e. 309600 dimensional features of each audio clip are mapped to 400 dimensional features.

And 8: the voiceprint feature mapping module adopts a formula (2) and applies A obtained in the training stage^DNNThe matrix further maps the i-vector features to n-dimensional voiceprint features (n is the number of speakers in the training set)

And step 9: the voiceprint feature mapping module obtains the voiceprint feature vector from all the audio segments of any registered speaker

Averaging

A voiceprint feature model of the registered speaker is obtained.

3) Testing phase

And 4, step 4: the voiceprint recognition front end adopts a DNN-UBM method, uses an acoustic model Lambda to recognize the audio frequency segment in the tested speaker, and extracts the U of each frame of data^DNNThe dimension posterior probability vector. The posterior probability vector of each frame of data obtained in this embodiment is 8073 dimensions.

Output state with high posterior probability (specific guarantee is determined by training phase)Which states to leave). Accordingly, the posterior probability vector of each frame of data is also adjusted to

And (5) maintaining. Of the present embodiment

3096 was set.

And step 9: the voice print characteristic vector obtained from all the audio frequency segments of any one test speaker

Averaging

And obtaining a voiceprint characteristic model of the test speaker.

Step 10: calculating by using cosine similarity classifier

And

similarity of (c):

and comparing with a decision threshold delta to decide

Whether or not to cooperate with

Are the same speaker.

(III) d-vector + LR + cosine voiceprint recognition system:

the system adopts d-vector as the voiceprint recognition front end, the LR of the invention as the voiceprint feature mapping module of the voiceprint recognition rear end and cosine similarity as the voiceprint classifier. The three stages are as follows:

1) a training stage:

And 4, step 4: the voiceprint recognition front end trains a deep neural network containing n output neurons by adopting the existing d-vector method to obtain sigma^d-vectorModels where n is in the training datasetThe number of speakers. Suppose that^d-vectorThe highest hidden layer of the model contains U^d-vectorAnd (4) hiding the neurons. U of the embodiment^d-vectorSet to 400.

And 5: the voiceprint recognition front end adopts a d-vector method and uses sigma^d-vectorThe model predicts the voice of each frame and compares the sigma with the threshold value^d-vectorThe output of the highest hidden layer of the model is taken as the feature of each frame of voice, and the features of all frames of each audio clip are averaged to obtain the U of each audio clip^d-vectorA dimensional feature vector. U of the embodiment^d-vectorSet to 400.

Step 6: the voiceprint feature mapping module trains a linear regression model by adopting a formula (1) in the linear regression method to obtain an A matrix. The a matrix of this embodiment is a 400 × n matrix.

2) Registration phase

And 4, step 4: the voiceprint recognition front end adopts a d-vector method and uses sigma^d-vectorThe model predicts the voice of each frame and compares the sigma with the threshold value^d-vectorThe output of the highest hidden layer of the model is taken as the feature of each frame of voice, and the features of all frames of each audio clip are averaged to obtain the U of each audio clip^d-vectorA dimensional feature vector. Of the present embodimentU^d-vectorSet to 400.

Step 6: the voiceprint feature mapping module further maps the i-vector features into n-dimensional voiceprint features (n is the number of speakers in the training set) by applying an A matrix obtained in a training stage by adopting a formula (2)

And 7: the voiceprint feature mapping module obtains the voiceprint feature vector from all the audio segments of any registered speaker

Averaging

A voiceprint feature model of the registered speaker is obtained.

3) Testing phase

And 4, step 4: the voiceprint recognition front end adopts a d-vector method and uses sigma^d-vectorThe model predicts the voice of each frame and compares the sigma with the threshold value^d-vectorThe output of the highest hidden layer of the model is taken as the feature of each frame of voice, and the features of all frames of each audio clip are averaged to obtain the U of each audio clip^d-vectorA dimensional feature vector. U of the embodiment^d-vectorSet to 400.

And 5: the voiceprint feature mapping module further maps the i-vector features into n-dimensional voiceprint features (n is the number of speakers in the training set) by applying an A matrix obtained in a training stage by adopting a formula (2)

Step 6: the voiceprint feature mapping module obtains the voiceprint feature vectors of all the audio frequency segments of any test speaker

Averaging

And obtaining a voiceprint characteristic model of the test speaker.

And 7: the voiceprint classifier adopts a cosine similarity classifier to calculate

And

similarity of (c):

and comparing with a decision threshold delta to decide

Whether or not to cooperate with

Are the same speaker.

Experimental validation was performed on the NIST SRE 2006 and NIST SRE 2008 data sets for the three examples above. The 8 conversion's bitmap data in NIST SRE 2006 dataset is used as training set, 402 speakers in total, effective voice is about 100 hours. The bitmap data from 8 sessions in NIST SRE 2008 data set was used as the enrollment and testing sets for a total of 395 speakers. The Test speaker's voice length is fixed at 30 seconds (cut into 2 segments, 15 seconds each) and the Enrollment speaker's voice length is 150 seconds (cut into 10 segments, 15 seconds each). Approximately 15 million test samples were constructed for any enrolled speaker and test speaker. The DNN acoustic model in the second example was trained using the Switchboard-1 database, with a precise annotation of speech for about 300 hours.

Using the above test samples, the recognition error rates of the voiceprint recognition back end of LR + cosine used in the above three examples and other voiceprint recognition back ends were compared, and the comparison results are shown in table 1:

TABLE 1

As can be seen from table 1, the LR + cosine has a lower recognition error rate than the conventional cosine, WCCN + cosine, LDA + cosine and LDA + PLDA classifiers with the same front end.

In the above three examples, GMM/i-vector + LR + cosine achieves the optimal performance in all the methods participating in comparison, and is relatively improved by 27.19% compared with the optimal voiceprint recognition system GMM/i-vector + LDA + PLDA participating in comparison. The relative improvement of the DNN/i-vector + LR + cosine is 23.39% compared with that of the DNN/i-vector + LDA + cosine of the optimal voiceprint recognition system adopting the same voiceprint recognition front end DNN/i-vector. The d-vector + LR + cosine is relatively improved by 7.31 percent compared with the optimal voiceprint recognition system adopting the same voiceprint recognition front end d-vector.

It should be noted that the above embodiments are only specific examples of the patent disclosure, and all the algorithms that use a linear regression algorithm for obtaining a voiceprint feature vector in a voiceprint recognition system are within the scope of the patent protection.

The functions described in the method of the embodiment of the present application, if implemented in the form of software functional units and sold or used as independent products, may be stored in a storage medium readable by a computing device. Based on such understanding, part of the contribution to the prior art of the embodiments of the present application or part of the technical solution may be embodied in the form of a software product stored in a storage medium and including several instructions for causing a computing device (which may be a personal computer, a server, a mobile computing device or a network device) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A voiceprint recognition method based on linear regression is characterized by comprising the following steps:

acquiring a first voiceprint feature vector from voice data;

performing classification identification on the second acoustic line feature vector;

wherein the training process of the linear regression model comprises the following steps:

obtaining training data from a voiceprint database

Wherein x is_i,jFor a d-dimensional voiceprint feature vector extracted from each utterance in the voiceprint database, i 1_iN is the number of speakers in the voiceprint database, and any speaker corresponds to M_iA word; y is_i,jIs an n-dimensional indicative vector y of the ith speaker_i,j＝[0,...,1,...,0]^T(ii) a d is a preset value;

using A ═ (XX)^T)^-1XY^TObtaining the linear regression model, wherein,

formed as a voiceprint vector of training data

The matrix of (a) is,

formed as illustrative vectors of training data

Of the matrix of (a).

2. The method of claim 1, wherein mapping the first voiceprint feature vector to a second voiceprint feature vector comprises:

3. The method of claim 1, wherein the performing classification identification on the second acoustic line feature vector comprises:

4. The method of claim 1, wherein obtaining the first voiceprint feature vector from the speech data comprises:

5. A system for voiceprint recognition based on linear regression, comprising:

a voiceprint recognition back end, the voiceprint recognition back end comprising a voiceprint feature mapping module and a voiceprint classifier, the voiceprint feature mapping module being configured to map the first voiceprint feature vector to a second voiceprint feature vector using a pre-trained linear regression model; the voiceprint classifier is used for classifying and identifying the second voiceprint feature vector;

wherein the voiceprint feature mapping module is further configured to:

obtaining training data from a voiceprint database

using A ═ (XX)^T)^-1XY^TObtaining the linear regression model, wherein,

formed as a voiceprint vector of training data

The matrix of (a) is,

formed as illustrative vectors of training data

Of the matrix of (a).

6. The system of claim 5, wherein the voiceprint feature mapping module is configured to map the first voiceprint feature vector to a second voiceprint feature vector using a pre-trained linear regression model comprising:

7. The system of claim 5, wherein the voiceprint classifier comprises: and a cosine classifier.

8. The system of claim 5, wherein the voiceprint feature extraction front end comprises:

a GMM/i-vector front end, a DNN/i-vector front end, or a d-vector front end.