CN105931646A

CN105931646A - Speaker identification method base on simple direct tolerance learning algorithm

Info

Publication number: CN105931646A
Application number: CN201610281884.1A
Authority: CN
Inventors: 雷震春; 杨印根; 朱明华
Original assignee: Jiangxi Normal University
Current assignee: Jiangxi Normal University
Priority date: 2016-04-29
Filing date: 2016-04-29
Publication date: 2016-09-07

Abstract

The invention provides a speaker identification method base on a simple direct tolerance learning algorithm. The method comprises the following steps: acquiring voice samples of multiple speakers, extracting i-vectors of all the samples, performing channel compensation processing by use of an LDA or WCCN method, performing length normalizing, and forming a training sample set; according to the i-vectors of the training sample set and speaker identity, constructing a similar sample pair set and a non-similar sample pair set; by use of a KISS algorithm, obtaining a tolerance matrix by performing training on the similar sample pair set and the non-similar sample pair set; and for two pieces of new voice, their i-vectors are extracted firstly, the channel compensation processing is carried out by use of the LDA or WCCN method, the length normalizing is performed, by use of the previously calculated tolerance matrix, a Mahalanobis distance between the two i-vectors is calculated and compared with a threshold, and thus whether the two pieces of new voice belong to the same speaker is determined. According to the invention, the obtained Mahalanobis distance tolerance matrix can better truly reflect similarities and distinctions of a sample space so as to improve the performance of a speaker identification system.

Description

A kind of method for distinguishing speek person based on simple directly metric learning algorithm

Technical field

The present invention is a kind of method for distinguishing speek person based on simple directly metric learning algorithm, can be widely used for the fields such as Speaker Identification, pattern-recognition, metric learning, machine learning.

Background technology

Speaker Identification (Speaker Recognition, SR), also known as Application on Voiceprint Recognition, is a kind of by the voice of speaker is processed and analyzed, thus the technology that speaker's identity is differentiated.The most effectively weigh the similarity between speaker's speech samples, be one of the hot issue in current Research of Speaker Recognition field.In area of pattern recognition, the method weighing similarity between sample has a lot, more common method has distance scoring, such as COS distance marking (cosine distance scoring) and mahalanobis distance marking (Mahalanobis distance scoring) etc..

COS distance scoring weighs the similarity between sample by the included angle cosine value calculating the sample vector inner product space, and it makes a distinction according to the difference in vector direction, it is impossible to weigh the difference of numerical value on vector dimension.COS distance d_C(x_i, x_j) computing formula be:

d_{C} (x_{i}, x_{j}) = \frac{{x_{i}}^{T} x_{j}}{\sqrt{{x_{i}}^{T} x_{i}} \sqrt{{x_{i}}^{T} x_{j}}}

Wherein, COS distance d_C(x_i, x_j), xi is the i-vector vector of i-th voice, and T represents transposition.

Two vector (x_i, x_jMahalanobis distance d between)_M(x_i, x_j) it is defined as:

d_{M} (x_{i}, x_{j}) = \sqrt{{(x_{i} - x_{j})}^{T} M (x_{i} - x_{j})}

Wherein, mahalanobis distance d_M(x_i, x_j), it is the i-vector vector of i-th voice, T represents transposition.

Only obtain can the similar sample similarity in reflected sample space, the positive semidefinite metric matrix M of non-similar sample distinction, the mahalanobis distance of calculating could effectively weigh Sample Similarity, but training sample is limited makes this metric matrix of acquisition difficult.

Metric learning method is typically based on the classification information that training sample contains, and by being automatically learned a distance matrix metric, is commonly used to calculate the mahalanobis distance score between target sample, thus is predicted the similarity of unknown data.The elementary object of metric learning algorithm is the prior information utilizing training sample, on the premise of meeting some condition as far as possible, finds overall, a linear transformed distances metric matrix M by optimization following formula:

\underset{M}{m i n} 1 (M) + l R (M)

1 (M) is loss function, and R (M) is the regular item during training distance matrix metric M, carries out constraint when loss function 1 (M) over-fitting in the training process and revises, balance parameters l³0.Metric matrix M is used for calculating sample (x_i, x_jMahalanobis distance between):

d_M(x_i, x_j)=(x_i-x_j)^TM(x_i-x_j)

Wherein, mahalanobis distance d_M(x_i, x_j), x_iIt it is the i-vector vector of i-th voice.

Number for training the training sample of metric matrix is increasing, and googol makes the analysis of large-scale data and process bring the biggest trouble according to amount, brings so-called " dimension disaster ".Along with the rising of data dimension, between these high dimensional datas, often there is bigger correlation and redundancy.

Summary of the invention

It is an object of the invention to provide a kind of method for distinguishing speek person based on simple directly metric learning algorithm, the metric matrix of the method gained can effectively reflect the similitude in speaker space and distinction, this metric matrix is used for the mahalanobis distance scores classification device of test target speaker's speech samples simultaneously, Speaker Recognition System can be made to obtain good recognition effect.

For reaching object above, the present invention adopts the technical scheme that:

A kind of based on simple directly metric learning algorithm (Keep it simple and straight！, KISS) method for distinguishing speek person, it is characterised in that: the i-vector after using KISS Algorithm for Training to process, calculate the mahalanobis distance between speaker's tone testing sample and target sample；

Keep simple directly metric learning algorithm (Keep it simple and straight！, KISS), the most effectively, there is globally optimal solution, can quickly try to achieve the metric matrix meeting condition, the sample for training belongs to similar to only knowing whether.The metric matrix solved does not haves over-fitting, and is easily obtained.The extensibility of KISS algorithm is good, it is not necessary to the iterative process of optimization, only need to calculate two covariance matrixes the least.This metric matrix can effectively reflect the similitude in speaker space and distinction, this metric matrix is used for the mahalanobis distance scores classification device of test target speaker's speech samples, makes Speaker Recognition System achieve good recognition effect.Performance is more excellent, and the speed of the training process of metric matrix.

The purpose of the present invention implements by the following technical programs:

Gather the speech samples of multiple speaker, extract the i-vector in all samples；

Using LDA or WCCN method to carry out channel compensation and process the i-vector in all samples, line length of going forward side by side is regular, forms training sample set；

Construct i-vector based on training sample set with the similar sample of speaker's identity to collecting with non-similar sample collection；

Use KISS algorithm, at similar sample, collection is obtained metric matrix with non-similar sample to training on collection；

For two new voices, by their i-vector after the process that the above extraction, channel compensation process and length are regular, based on the metric matrix calculated before, calculate the mahalanobis distance between two i-vector；

Mahalanobis distance and the threshold value of gained are compared, based on comparative result, whether these two new voices is belonged to same speaker and judges.

Further, use LDA method to carry out channel compensation and process the i-vector in all samples, specifically include:

By the similar sample separation of projection matrix algorithmic minimizing from maximize non-similar sample separation from.

Further, use WCCN method to carry out channel compensation and process the i-vector in all samples, specifically include:

Make the base in target sample space the most orthogonal.

Further, the method also includes:

The i-vector extracted in all samples is carried out length regular.

Further, it is characterised in that use KISS algorithm, at similar sample, collection is obtained metric matrix with non-similar sample to training on collection, specifically includes:

Solve the covariance of the covariance of similar sample pair and non-similar sample pair in described target sample respectively；

Calculate the covariance of described similar sample pair and the metric matrix of the covariance of non-similar sample pair；

Further, the method also includes:

Metric matrix according to gained calculates the mahalanobis distance between two i-vector.

Further, mahalanobis distance and the threshold value of gained are compared, based on comparative result, whether these two new voices are belonged to same speaker and judges, specifically include:

If the mahalanobis distance of gained is more than threshold value, then illustrate that these two new voices are not belonging to same speaker；

If the mahalanobis distance of gained is within threshold value, then illustrate that these two new voices are belonging to same speaker.

The present invention discloses a kind of method for distinguishing speek person based on simple directly metric learning algorithm.Simple directly metric learning algorithm (KISS) is kept to utilize constraint information one mahalanobis distance metric matrix of training of paired training sample, utilize the constraint information of paired training sample pair to instruct metric learning process, to marked similar sample to non-similar sample to being effectively utilized the tutorial message of similitude and non-similarity between training sample data when carrying out metric matrix training, the metric matrix obtained more truly reflects the distinction of speaker space, similitude between unknown speaker speech samples can preferably be predicted by mahalanobis distance scores classification device.During metric matrix is trained, covariance with non-similar sample pair is calculated by similar sample, and obtains the difference of two covariances, as mahalanobis distance metric matrix, training metric matrix out, for Speaker Recognition System, achieves good recognition effect.

Accompanying drawing explanation

Fig. 1 is a kind of based on a simple directly embodiment of the method for distinguishing speek person of metric learning algorithm the flow chart according to the present invention.

Detailed description of the invention

Below in conjunction with the accompanying drawings a kind of based on simple directly metric learning algorithm the method for distinguishing speek person of the embodiment of the present invention is described in detail.Show the flow chart of an embodiment of the method for the present invention with reference to Fig. 1, Fig. 1, the method comprises the following steps:

In step s 110, gather the speech samples of multiple speaker, extract the i-vector in all samples；

In the step s 120, using LDA or WCCN method to carry out channel compensation and process the i-vector in all samples, line length of going forward side by side is regular, forms training sample set；

In step s 130, i-vector based on training sample set is constructed with the similar sample of speaker's identity to collecting with non-similar sample collection；

In step S140, use KISS algorithm, at similar sample, collection is obtained metric matrix with non-similar sample to training on collection；

In step S150, for two new voices, by their i-vector after the process that the above extraction, channel compensation process and length are regular, based on the metric matrix calculated before, calculate the mahalanobis distance between two i-vector；

In a step 160, mahalanobis distance and the threshold value of gained are compared, based on comparative result, whether these two new voices are belonged to same speaker and judges.

Further, use linear discriminant analysis (Linear Discriminant Analysis, LDA) method to carry out channel compensation and process the i-vector in all samples, specifically include:

The target of linear discriminant analysis (LDA) be by the similar sample separation of projection matrix algorithmic minimizing from maximize non-similar sample separation from.

Particularly as follows: definition class scatter matrix S_bWith Scatter Matrix S in class_w；

S_bFor speaker's class scatter matrix, S_wFor Scatter Matrix, n in the mankind that speak_sIt it is voice number corresponding for speaker s；It is all speaker's i-vector averages；It is the s speaker's i-vector average.

Projection matrix A is made up of following formula characteristic value l characteristic of correspondence vector.

S_bV=lS_wv

S_bFor speaker's class scatter matrix, S_wFor Scatter Matrix in the mankind that speak, l is speaker characteristic value diagonal matrix, and v is speaker space direction.

Further, use variance within clusters regular (Within Class Covariance Normalization, WCCN) method to carry out channel compensation and process the i-vector in all samples, specifically include:

In class, the target of covariance regular (WCCN) is to make the base in sample space the most orthogonal.

Being calculated as follows of covariance matrix in class:

Total s speaker；n_sIt it is voice number corresponding for speaker s；It is all speaker's i-vector averages；It is the s speaker's i-vector average.

Characteristic vector is mapped:Wherein B is W^-1Qiao Laisiji decompose, i.e. W^-1=BB_T。

Further, the method also includes:

The i-vector extracted in all samples is carried out length regular.

Further, use KISS algorithm, at similar sample, collection obtained metric matrix with non-similar sample to training on collection, specifically include:

Solve the covariance of similar sample pair and the covariance of non-similar sample pair in described target sample respectively, calculate the covariance of described similar sample pair and the metric matrix of the covariance of non-similar sample pair.

Specifically, solve the covariance of all similar samples pair the most respectivelyCovariance with non-similar sample pair

Σ_{y_{i j} = 1} = \underset{y_{i j} = 1}{Σ} (x_{i} - x_{j}) {(x_{i} - x_{j})}^{T}

Σ_{y_{i j} = 0} = \underset{y_{i j} = 0}{Σ} (x_{i} - x_{j}) {(x_{i} - x_{j})}^{T}

x_iRepresent the i-vector vector of i-th voice, y_ij=0 represents that i-th voice and j-th strip voice are from different speakers, y_ij=1 represent i-th voice with j-th strip voice from identical speaker, can try to achieve metric matrix M:

M = (Σ_{y_{i j} = 1}^{- 1} - Σ_{y_{i j} = 0}^{- 1})

For the covariance of similar sample pair,For the covariance of non-similar sample pair, obtain M as the final required metric matrix solved.

Further, calculate the mahalanobis distance between two i-vector according to the metric matrix of gained, specifically include: according to the metric matrix M tried to achieve before, calculate two i-vector (x_i, x_jMahalanobis distance between):

d_M(x_i, x_j)=(x_i-x_j)^TM(x_i-x_j)

x_iRepresenting the i-vector vector of i-th voice, M is metric matrix, d_M(x_i, x_j) it is two i-vector (x_i, x_jMahalanobis distance between).

Mahalanobis distance according to gained is calculated as two i-vector (x_i, x_jSimilarity score between):

Score_M(x_i, x_j)=-(x_i-x_j)^TM(x_i-x_j)

Wherein, mahalanobis distance score Score_M(x_i, x_j), M is metric matrix, x_iIt it is the i-vector vector of i-th voice.

By mahalanobis distance score Score of gained_M(x_i, x_j) make comparisons with threshold value, if mahalanobis distance score is more than threshold value, then illustrate that these two new voices are not belonging to same speaker；If mahalanobis distance score is within threshold value, then illustrate that these two new voices are belonging to same speaker.

In the present embodiment, s is speaker's quantity；n_sIt it is voice number corresponding for speaker s；It is all speaker's i-vector averages；It is the s speaker's i-vector average.

For the ease of understanding technical scheme, the effect reached below by way of the method illustrating embodiment offer as a example by a concrete experiment test application scenarios and exploitativeness:

Experiment is carried out under MATLAB environment, and the experiment speech data of speaker's tone testing sample evaluates and tests (SRE) 04,05,06,08 year core voice library both from American National Standard with Technical Board (NIST) speaker.First Speaker Recognition System carries out de-redundancy and noise reduction process to the speech data of the target sample of the multiple speakers gathered, and voice analog signal is changed discrete voice data signal.With the window function of frame length 20ms, voice signal overlapped framing (frame moves 10ms).Extract 13 Jan Vermeer frequency cepstral coefficients (MFCC) to be combined into 39 dimensional feature vectors with its single order, second differnce voice signal is indicated.Use NISTSRE04,05 train, with 06 year speech data collection, the UBM that 512 rank sexes are relevant, train the i-vector vector (400 dimension) of the target sample of all speakers on this basis, and i-vector vector is carried out LDA, WCCN and length is regular etc. that robustness processes, for subsequent process.Wherein 08 year speech data carries out similarity evaluation and test as target sample and the tone testing sample of speaker.

Before carrying out metric learning experiment, first it is configured to the similar sample of training to collecting with non-similar sample collection.The present embodiment uses 6609 voices of 491 male sex that NIST SRE04,05,06 year voice is concentrated, and 9136 voices of 703 women construct similar sample to collecting S to non-similar sample to collection D.

The i-vector extracted from voice, after LDA or WCCN channel compensation processes, uses one mahalanobis distance metric matrix of KISS Algorithm for Training, calculates mahalanobis distance and calculates the similarity score between target i-vector and test i-vector.

If s speaker；n_sIt it is voice number corresponding for speaker s；It is all speaker's i-vector averages；It it is each speaker's i-vector average.

Wherein, the target of linear discriminant analysis (LDA) for by projection minimize similar sample separation from maximize non-similar sample separation from.Definition class scatter matrix S_bWith Scatter Matrix S in class_w:

S_bV=lS_wv

In class, the target of covariance regular (WCCN) is to make the base of sample space the most orthogonal.Being calculated as follows of covariance matrix in class:

Characteristic vector is mapped:Wherein B is W^-1Qiao Laisiji decompose, i.e. W^-1=B B^T。

I-vector vector is carried out that length is regular improves systematic function.

Wherein, KISS algorithm is as follows:

Solve the covariance of all similar samples pair respectivelyCovariance with non-similar sample pair

Σ_{y_{i j} = 1} = \underset{y_{i j} = 1}{Σ} (x_{i} - x_{j}) {(x_{i} - x_{j})}^{T}

Σ_{y_{i j} = 0} = \underset{y_{i j} = 0}{Σ} (x_{i} - x_{j}) {(x_{i} - x_{j})}^{T}

M = (Σ_{y_{i j} = 1}^{- 1} - Σ_{y_{i j} = 0}^{- 1})

For the covariance of similar sample pair,For the covariance of non-similar sample pair, obtain the M metric matrix as final required solution, be used for calculating speaker's tone testing sample and target sample (x_i, x_jMahalanobis distance between):

d_M(x_i, x_j)=(x_i-x_j)^TM(x_i-x_j)

x_iRepresenting the i-vector vector of i-th voice, M is metric matrix, d_M(x_i, x_j) it is speaker's tone testing sample and target sample (x_i, x_jMahalanobis distance between).

Speaker sample (x is calculated according to this distance_i, x_jSimilarity score between):

Score_M(x_i, x_j)=-(x_i-x_j)^TM(x_i-x_j)

The method that the present embodiment provides, keeps simple directly (KISS) algorithm, the most effectively, there is globally optimal solution, can quickly try to achieve the distance matrix metric meeting condition, and the sample for training belongs to similar to only knowing whether.Metric matrix to be solved does not haves over-fitting, and is easily obtained, and the extensibility of KISS algorithm is good, it is not necessary to the iterative process of optimization, only need to calculate two covariance matrixes the least.This metric matrix can effectively reflect the similitude in speaker space and distinction, this metric matrix is used for the mahalanobis distance scores classification device of test target speaker's speech samples, makes Speaker Recognition System achieve good recognition effect.Performance is close to the most currently a popular metric learning algorithm, and the speed of the training process of metric matrix is faster than other algorithms, training mahalanobis distance metric matrix out more can the truly similitude in reflected sample space and distinction, thus improve the performance of Speaker Recognition System.

It may be noted that according to the needs implemented, each step described in this application can be split as more multi-step, it is possible to two or more step part operations are combined into new step, to realize the purpose of the present invention.

Above-mentioned the method according to the invention can be at hardware, firmware realizes, or it is implemented as being storable in recording medium (such as CD ROM, RAM, floppy disk, hard disk or magneto-optic disk) in software or computer code, or the original storage being implemented through network download and will be stored in the computer code in local recording medium in remotely record medium or nonvolatile machine readable media, thus method described here can be stored in use all-purpose computer, application specific processor or able to programme or specialized hardware (such as ASIC or FPGA) the such software recorded on medium process.It is appreciated that, computer, processor, microprocessor controller or programmable hardware include to store or receive the storage assembly of software or computer code (such as, RAM, ROM, flash memory etc.), when described software or computer code are by computer, processor or hardware access and execution, it is achieved processing method described here.Additionally, when all-purpose computer accesses for the code of the process that realization is shown in which, all-purpose computer is converted to the special-purpose computer of the process being shown in which for execution by the execution of code.

The above; being only the detailed description of the invention of the present invention, but protection scope of the present invention is not limited thereto, any those familiar with the art is in the technical scope that the invention discloses; change can be readily occurred in or replace, all should contain within protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with described scope of the claims.

Claims

1. a method for distinguishing speek person based on simple directly metric learning algorithm, it is characterised in that the method comprises the following steps:

2. the method for claim 1, it is characterised in that use LDA or WCCN method to carry out channel compensation and process the i-vector in all samples, specifically include:

3. the method for claim 1, it is characterised in that use LDA or WCCN method to carry out channel compensation and process the i-vector in all samples, specifically include:

Make the base in target sample space the most orthogonal.

4. the method for claim 1, it is characterised in that the method also includes:

The i-vector extracted in all samples is carried out length regular.

5. the method for claim 1, it is characterised in that use KISS algorithm, obtains metric matrix with non-similar sample to training on collection to collection at similar sample, specifically includes:

Solve the covariance of the covariance of similar sample pair and non-similar sample pair in all samples respectively；

Calculate the covariance of described similar sample pair and the metric matrix of the covariance of non-similar sample pair.

6. the method for claim 1, it is characterised in that the method also includes:

7. whether these two new voices, based on comparative result, are belonged to same speaker and judge, specifically include by the method for claim 1, it is characterised in that mahalanobis distance and the threshold value of gained are compared: