CN105810199A

CN105810199A - Identity verification method and device for speakers

Info

Publication number: CN105810199A
Application number: CN201410844272.XA
Authority: CN
Inventors: 李志锋; 李娜; 乔宇
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2014-12-30
Filing date: 2014-12-30
Publication date: 2016-07-27

Abstract

The invention belongs to the technical field of voices, and provides an identity verification method and a device for speakers. The method comprises the steps of extracting JFA super vectors out of training voices to generate a first sub-vector; projecting the first sub-vector into a first subspace based on the PCA algorithm; randomly sampling the first subspace to obtain Q second subspaces; respectively mapping vectors in the Q second subspaces into Q third subspaces; analyzing and modeling the Q third subspaces based on the non-parametric linear discriminant analysis; respectively projecting the JFA super vectors of each training voice and each test voice into the Q third subspaces based on a projection matrix W2 * W3 to obtain Q target speaker reference vectors and Q test reference vectors; fusing the outputs of Q classifiers; and deeming a speaker of a training voice corresponding to a highest score of a fusion result as the speaker of a test voice. The method and the device well improve the system performance of a speaker identity confirmation system.

Description

The indentity identifying method of a kind of speaker and device

Technical field

The invention belongs to voice technology field, particularly relate to indentity identifying method and the device of a kind of speaker.

Background technology

The high speed development of the network information technology allows one to obtain various information easily, but consequently also creates various information security issue, and under this background, identity identifying technology is particularly important.Compared to authentication means such as fingerprint, iris, face, handwritten signatures, the voice of people becomes the emphasis of identity identifying technology development because it has the characteristic that collection is easy, is prone to storage and is difficult to imitate, and its key problem in technology is in that the identity validation of speaker.

It is a kind of popular at present method that speaker is carried out identity validation that the speech data of the different durations of speaker is converted to the high dimensional feature data with same dimension by certain algorithm, in order to solve " dimension disaster " problem and the small sample problem that high dimensional feature data are brought, researcheres propose the speaker ' s identity based on subspace analysis method and confirm algorithm, but, current subspace analysis method yet suffers from problems with: speaker ' s identity is confirmed that the performance impact of system is bigger by the dimension size of subspace.

Summary of the invention

The purpose of the embodiment of the present invention is in that to provide indentity identifying method and the device of a kind of speaker, aiming to solve the problem that in the method being currently based on subspace analysis speaker is carried out identity validation, speaker ' s identity is confirmed the problem that the performance impact of system is bigger by the dimension size of subspace.

The embodiment of the present invention is achieved in that the indentity identifying method of a kind of speaker, including:

Training voice is extracted simultaneous factor analysis JFA super vector M_ih=[m_ih1,m_ih2,...,m_ihN], wherein, described M_ihRepresent the JFA super vector of the h article training voice of i-th speaker in training set；

From the JFA super vector M of described training voice_ih=[m_ih1,m_ih2,...,m_ihN] in extract k mean vector, generate the first subvector S_ih=[m'_ih1,m'_ih2,...,m'_ihk]；

Utilize pivot analysis PCA algorithm by described first subvector S_ih=[m'_ih1,m'_ih2,...,m'_ihk] project in the first subspace that dimension is J；

Described first subspace is carried out stochastical sampling, obtains Q the second subspace；

Respectively the vector projecting to Q described second subspace is carried out covariance normalization WCCN process in class, train projection matrix W2, then the vector projected in Q described second subspace is respectively mapped in Q the 3rd subspace by described projection matrix W2；

Utilize nonparametric linearly to distinguish analysis and Q described 3rd subspace is analyzed modeling, obtain projection matrix W3；

Utilize projection matrix W2*W3, the JFA super vector that every is trained voice is projected to respectively Q described 3rd subspace, obtains Q target speaker's reference vector；

Extract the JFA super vector of tested speech；

Utilize described projection matrix W2*W3, the JFA super vector of described tested speech is projected to respectively Q described 3rd subspace, obtain Q test reference vector；

Calculate the COS distance between described test reference vector and Q described target speaker's reference vector respectively, obtain the output of Q grader；

By preset algorithm, the output of Q described grader is merged；

The speaker that speaker verification is described tested speech by training voice corresponding for the fusion results of highest scoring.

The another object of the embodiment of the present invention is in that to provide the identity confirmation device of a kind of speaker, including:

First extraction unit, for extracting simultaneous factor analysis JFA super vector M to training voice_ih=[m_ih1,m_ih2,...,m_ihN], wherein, described M_ihRepresent the JFA super vector of the h article training voice of i-th speaker in training set；

First dimensionality reduction unit, is used for the JFA super vector M from described training voice_ih=[m_ih1,m_ih2,...,m_ihN] in extract k mean vector, generate the first subvector S_ih=[m'_ih1,m'_ih2,...,m'_ihk]；

Second dimensionality reduction unit, is used for utilizing pivot analysis PCA algorithm by described first subvector S_ih=[m'_ih1,m'_ih2,...,m'_ihk] project in the first subspace that dimension is J；

Stochastical sampling unit, for described first subspace is carried out stochastical sampling, obtains Q the second subspace；

WCCN processing unit, for respectively the vector projecting to Q described second subspace being carried out covariance normalization WCCN process in class, train projection matrix W2, then the vector projected in Q described second subspace is respectively mapped in Q the 3rd subspace by described projection matrix W2；

Nonparametric linearly distinguishes analytic unit, is used for utilizing nonparametric linearly to distinguish analysis and Q described 3rd subspace is analyzed modeling, obtain projection matrix W3；

First reference vector generates unit, is used for utilizing projection matrix W2*W3, the JFA super vector that every is trained voice projects to Q described 3rd subspace respectively, obtains Q target speaker's reference vector；

Second extraction unit, for extracting the JFA super vector of tested speech；

Second reference vector generates unit, is used for utilizing described projection matrix W2*W3, and the JFA super vector of described tested speech projects to Q described 3rd subspace respectively, obtains Q test reference vector；

Output unit, for calculating the COS distance between described test reference vector and Q described target speaker's reference vector respectively, obtains the output of Q grader；

Integrated unit, for merging the output of Q described grader by preset algorithm；

Confirmation unit, for the speaker that speaker verification is described tested speech by training voice corresponding for the fusion results of highest scoring.

The embodiment of the present invention adopts the algorithm frame based on the sampling of double-deck subspace, except directly adopting subspace analysis method and original high-dimensional feature space carried out dimensionality reduction, additionally use stochastic subspace sampling method and construct the subspace that some dimensions are relatively low, then a grader is trained for every sub spaces, final court verdict exports fusion by multi-categorizer and obtains, and improves speaker ' s identity well and confirms the systematic function of system.

Accompanying drawing explanation

Fig. 1 is the flowchart of the indentity identifying method of the speaker that the embodiment of the present invention provides；

Fig. 2 is the algorithm frame figure of the indentity identifying method of the speaker that the embodiment of the present invention provides；

Fig. 3 is the structured flowchart of the identity confirmation device of the speaker that the embodiment of the present invention provides.

Detailed description of the invention

In order to make the purpose of the present invention, technical scheme and advantage clearly understand, below in conjunction with drawings and Examples, the present invention is further elaborated.Should be appreciated that specific embodiment described herein is only in order to explain the present invention, is not intended to limit the present invention.

What Fig. 1 illustrated the indentity identifying method of the speaker that the embodiment of the present invention provides realizes flow process, and details are as follows:

In S101, training voice is extracted simultaneous factor analysis (JointFactorAnalysis, JFA) super vector M_ih=[m_ih1,m_ih2,...,m_ihN], wherein, described M_ihRepresent the JFA super vector of the h article training voice of i-th speaker in training set.

nullJFA theory is thought，Based on " gauss hybrid models-universal background model " (GaussianMixtureModel-UniversalBackgroundModel，GMM-UBM) speaker ' s identity confirms in algorithm frame，By maximum a posteriori probability (MaximunAPosteriori，MAP) the average super vector of the speaker model that method obtains mainly contains speaker and channel two parts information，And this equal Gaussian distributed of two parts information，According to JFA method, the channel information in speaker model is removed，The performance of speaker identification system can be greatly improved，Therefore，The embodiment of the present invention utilizes JFA method advantage under solving channel mismatch situation，Using feature as speaker of the average super vector that adopts the speaker model after JFA method denoising，First，In S101，Utilize JFA method that the training voice in training set is carried out JFA super vector extraction one by one，This JFA super vector represents the super vector being stitched together in order by the average super vector of each gauss component in speaker model.

In S102, from the JFA super vector M of described training voice_ih=[m_ih1,m_ih2,...,m_ihN] in extract k mean vector, generate the first subvector S_ih=[m'_ih1,m'_ih2,...,m'_ihk]。

A part of redundancy is removed in order to preliminary in the higher-dimension original feature space that generates at S101, in S102, a part is have chosen from the mean vector of composition JFA super vector, form the subspace that a dimension is relatively low, this subspace contains the most of useful information in JFA super vector, at this, if corresponding to the first subvector S of JFA super vector in this subspace_ih=[m'_ih1,m'_ih2,...,m'_ihk]。

As one embodiment of the present of invention, S102 particularly as follows:

From the JFA super vector M of described training voice_ih=[m_ih1,m_ih2,...,m_ihN] in extract the mean vector being arranged in front k, generate the first subvector S_ih=[m'_ih1,m'_ih2,...,m'_ihk]。

In S103, utilize pivot analysis (PrincipalComponentAnalysis, PCA) algorithm by described first subvector S_ih=[m'_ih1,m'_ih2,...,m'_ihk] project in the first subspace that dimension is J, generate the second subvector O_ih=[o₁,o₂,...,o_J]。

Due to the first subvector S_ih=[m'_ih1,m'_ih2,...,m'_ihk] still there is higher dimension, and the numeric distribution ratio of each dimension is sparse, still comprises substantial amounts of redundancy, therefore, adopts PCA method to the first subvector S_ih=[m'_ih1,m'_ih2,...,m'_ihk] carry out optimum dimensionality reduction compression, it is projected in the lower-dimensional subspace that dimension is J by projection matrix W1, obtains the second subvector O_ih=[o₁,o₂,...,o_J]。

In S104, described first subspace is carried out stochastical sampling, obtain Q the second subspace T₁,T₂,...,T_Q。

In S105, respectively the vector projecting to Q described second subspace is carried out covariance normalization (WithinClassCovarianceNormalization in class, WCCN) process, train projection matrix W2, then the vector projected in Q described second subspace is respectively mapped in Q the 3rd subspace by described projection matrix W2.

In the present embodiment, each stochastic subspace that S104 is obtained carries out segment processing, first the development set data projected in each stochastic subspace are carried out WCCN process, WCCN projection matrix W2 is trained by development set data, again the low dimensional feature vector in stochastic subspace is mapped to a new subspace by WCCN projection matrix W2 further, namely in the 3rd subspace, thus obtaining Q new stochastic subspace.

In S106, utilize nonparametric linearly to distinguish analysis and Q described 3rd subspace is analyzed modeling, obtain projection matrix W3.

Now, the projection matrix of each 3rd subspace is W2*W3.

Subspace analysis result according to S101 to S106, completes speaker ' s identity and confirms the training process of system, and for Q the 3rd subspace, every sub spaces all can obtain a corresponding subspace grader.

Followed by be then the speaker ' s identity test process (or for categorizing process) that confirms system:

In S107, utilize projection matrix W2*W3, the JFA super vector that every is trained voice is projected to respectively Q described 3rd subspace, obtains Q target speaker reference vector R_train(q), q=1,2 ..., Q.

In S108, extract the JFA super vector of tested speech.

For tested speech, specifically, J=m+Vy+Dz can be utilized to convert the voice of speaker to JFA super vector, wherein, described J represents JFA super vector, m represents UBM average super vector, and V, D represent speaker space loading matrix and residual error space loading matrix, y, z respectively speaker's factor and the residual error factor respectively.

In S109, utilize described projection matrix W2*W3, the JFA super vector of described tested speech is projected to respectively Q described 3rd subspace, obtains Q test reference vector R_test(q), q=1,2 ..., Q.

The processing method that S109 and S107 adopts is identical.

In S110, calculate the COS distance between described test reference vector and Q described target speaker's reference vector respectively, obtain the output of Q grader.

The computing formula of S110 is:

D (R_{train}, R_{test}) = \frac{| | R_{train}^{T} R_{test} | |}{\sqrt{R_{train}^{T} R_{train} R_{test}^{T} R_{test}}} .

In S111, by preset algorithm, the output of Q described grader is merged.

In the present embodiment, Q result of calculation can be exported respectively by S110, Q grader, then, in S111, according to default algorithm, this Q result of calculation be merged.

As one embodiment of the present of invention, S111 is particularly as follows: carry out linear fusion by the output of Q described grader.

Or, it is also possible to by ballot method, the output of Q grader is merged.

In S112, by the speaker that speaker verification is described tested speech of training voice corresponding for the fusion results of highest scoring.

Corresponding to different training voices, through above-mentioned S107 to S111, correspondence can obtain different score output, then according to every height training the score that voice is corresponding to export, by the speaker that speaker verification is tested speech of training voice corresponding for the fusion results of highest scoring.

Corresponding to the indentity identifying method of speaker above and shown in Fig. 1, Fig. 2 illustrates the algorithm frame of the indentity identifying method of the speaker that the embodiment of the present invention provides.As can be seen from Figure 2, in embodiments of the present invention, is sampled in subspace in original feature space and combine with the stochastical sampling in the subspace obtained after dimensionality reduction, the average that ground floor subspace sampling in this algorithm frame is each gauss component for composition JFA super vector carries out, purpose is to remove a part of redundancy, determine the subspace of a suitable dimension, owing to JFA super vector and GMM average super vector are the same in composition structure, can regard as and be spliced in order by the mean vector of each gauss component in GMM model, therefore, it is what to carry out with the mean vector in JFA super vector for ultimate unit to the subspace sampling of ground floor in Fig. 2 algorithm frame；The second layer is then that the subspace to the more low dimensional obtained after PCA dimensionality reduction in ground floor subspace carries out stochastical sampling, forms some new subspaces.The mode of choosing of above two-layer subspace is different, and choosing of ground floor subspace is nonrandom, and choosing of second layer subspace is random.

It follows that evaluate and test the indentity identifying method of the speaker that the embodiment of the present invention provides by concrete experiment speaker ' s identity is confirmed the systematic function impact of system:

In this experiment, experimental data takes from National Institute of Standards and Technology (NationalInstituteofStandardsandTechnology, NIST) 2008 speakers evaluate and test data base, wherein, training voice and tested speech select the male's phone training part in core evaluation and test task and male's call test part；The training data of UBM from SwitchboardIIphase2, SwitchboardIIphase3, SwitchboardCellularPart2 and NISTSRE2004, the telephone voice data in 2005,2006, have 2048 gauss components；In order to train the data of projection matrix W3 that nonparametric linearly distinguishes analysis to be taken from NISTSRE2004,2005 and 2006 call voices in data base, comprising 563 speakers altogether, each speaker has 8 speech datas；The training data of the UBM in JFA system is same as above, the order of speaker space loading matrix V is 300, the order of eigenchannel space loading matrix U is 100, and residual error loading matrix D is spliced by the diagonal entry in the diagonal covariance matrix of each gauss component in UBM model.If no special instructions, in this experiment, PCA, WCCN and nonparametric are linearly distinguished the dimension of the projection matrix of analysis and are respectively as follows: (51 × k) × J, (E₁+E₂) × 799 and 799 × 550, the number of stochastic subspace and the number Q of fundamental classifier are set as 10, and nonparametric is linearly distinguished in analysis, and the number of neighbour's sample is set as 4.

In sampling in the subspace of algorithm frame ground floor, generate after original feature space through S101, S102 chooses front 1280 the Gaussian mean vectors in the JFA super vector after sequence, it is thus achieved that the first subvector S_ih, but, the dimension of this first subvector is still significantly high relative to the training sample that training data is concentrated, therefore, in order to train reliable and stable subspace grader, it is necessary to the first subvector projects to the PCA subspace of low-dimensional further.In this experiment, if the dimension of the characteristic vector after PCA dimensionality reduction is J.

In the subspace sampling of the second layer of algorithm frame, before carrying out stochastical sampling, in order to ensure the performance of each subspace, basis grader, first front E1 the pivot component containing maximum quantity of information in the first subspace is fixed up (namely choosing front E1 the pivot component containing maximum quantity of information in described first subspace), and random sampling algorithms is only applied to remaining J-E1 the pivot component in the first subspace, by random sampling algorithms, randomly selecting E2 pivot component dimension from this J-E1 pivot component is the stochastic subspace of E1+E2.

In the experimentation of second layer subspace sampling, the relatively figure of merit of J is determined by cross validation, the value of J is fixed as 1200 or 1300, the value of E1+E2 is fixed as 800, for different (E1, E2), random establishment 10 sub spaces, that is, creating 10 fundamental classifier, final fusion results is obtained by linear fusion.

Table 1 lists (E1, the speaker ' s identity confirmation method of two-layer sampling subspace when seven kinds of various combinations E2) experimental result on NISTSRE2008 data base, wherein list the result of wherein best experimental result, worst experimental result and emerging system, key assignments in table is (EER (%), minDCF × 100), EER is the error rates (EqualErrorRate) such as identification, minDCF is minimum detection cost (MinimumDetectionCostFunction), is the performance parameter of systematic function:

Table 1

From table 1, it is possible to observe and obtain:

(1) for every kind of combination of E1 and E2, the performance of single fundamental classifier is also unstable, this point from best single fundamental classifier and worst single fundamental classifier the result in EER and minDCF find out；

(2) in the first six group experiment, the EER of often best in group single fundamental classifier is below last group (800,0) system combined, E2 in last group experiment is 0, the experiment of this group does not adopt stochastic subspace, this shows, the size of characteristic value is not weigh the absolute standard of its separating capacity representated by corresponding pivot component, pivot component corresponding to some less characteristic values is likely to containing more differentiation information, this be also to ensure that constructed by the stochastic subspace fundamental classifier that goes out there is certain diversity and complementary immanent cause；

(3) experimental result after multiple Classifiers Combination is all little than fundamental classifier in EER and minDCF, and in each group experiment, the fusion results of multi-categorizer is more stable on the whole, and this shows that carrying out multiple Classifiers Combination can be effectively improved systematic function；

(4) speaker ' s identity based on two-layer sampling subspace that the present invention proposes confirms that system confirms system relative to the single speaker ' s identity being provided without stochastic subspace, and EER value falls below 4.01 from 4.32, and minDCF also has slight drop.

Corresponding to embodiment described above, Fig. 3 illustrates the structured flowchart of the identity confirmation device of the speaker that the embodiment of the present invention provides.For the ease of illustrating, illustrate only part related to the present embodiment.

With reference to Fig. 3, this device includes:

First extraction unit 301, extracts JFA super vector M to training voice_ih=[m_ih1,m_ih2,...,m_ihN], wherein, described M_ihRepresent the JFA super vector of the h article training voice of i-th speaker in training set.

First dimensionality reduction unit 302, from the JFA super vector M of described training voice_ih=[m_ih1,m_ih2,...,m_ihN] in extract k mean vector, generate the first subvector S_ih=[m'_ih1,m'_ih2,...,m'_ihk]。

Second dimensionality reduction unit 303, utilizes PCA algorithm by described first subvector S_ih=[m'_ih1,m'_ih2,...,m'_ihk] project in the first subspace that dimension is J.

Stochastical sampling unit 304, carries out stochastical sampling to described first subspace, obtains Q the second subspace.

WCCN processing unit 305, carries out WCCN process to the vector projecting to Q described second subspace respectively, trains projection matrix W2, then is respectively mapped in Q the 3rd subspace by described projection matrix W2 by the vector projected in Q described second subspace.

Nonparametric linearly distinguishes analytic unit 306, utilizes nonparametric linearly to distinguish analysis and Q described 3rd subspace is analyzed modeling, obtain projection matrix W3.

First reference vector generates unit 307, utilizes projection matrix W2*W3, the JFA super vector that every is trained voice projects to Q described 3rd subspace respectively, obtains Q target speaker's reference vector.

Second extraction unit 308, extracts the JFA super vector of tested speech.

Second reference vector generates unit 309, utilizes described projection matrix W2*W3, and the JFA super vector of described tested speech projects to Q described 3rd subspace respectively, obtains Q test reference vector.

Output unit 310, calculates the COS distance between described test reference vector and Q described target speaker's reference vector respectively, obtains the output of Q grader.

Integrated unit 311, is merged the output of Q described grader by preset algorithm.

Confirmation unit 312, by the speaker that speaker verification is described tested speech of training voice corresponding for the fusion results of highest scoring.

Alternatively, described first dimensionality reduction unit 302 specifically for:

Alternatively, described stochastical sampling unit 304 includes:

First chooses subelement, chooses front E1 the pivot component containing maximum quantity of information in described first subspace.

Second chooses subelement, randomly selects E2 pivot component by random sampling algorithms from residue J-E1 pivot component of described first subspace.

Generate unit, generate the second subspace that Q dimension is E1+E2.

Alternatively, described second extraction unit 308 specifically for:

Utilize J=m+Vy+Dz that tested speech converts to the JFA super vector of described tested speech, wherein, described J represents described JFA super vector, described m represents universal background model UBM average super vector, described V and described D represents speaker space loading matrix and residual error space loading matrix, described y and described z respectively speaker's factor and the residual error factor respectively.

Alternatively, described integrated unit 311 specifically for:

The output of Q described grader is carried out linear fusion.

The foregoing is only presently preferred embodiments of the present invention, not in order to limit the present invention, all any amendment, equivalent replacement and improvement etc. made within the spirit and principles in the present invention, should be included within protection scope of the present invention.

Claims

1. the indentity identifying method of a speaker, it is characterised in that including:

Extract the JFA super vector of tested speech；

By preset algorithm, the output of Q described grader is merged；

2. the method for claim 1, it is characterised in that the described JFA super vector M from described training voice_ih=[m_ih1,m_ih2,...,m_ihN] in extract k mean vector, generate the first subvector S_ih=[m'_ih1,m'_ih2,...,m'_ihk] including:

3. the method for claim 1, it is characterised in that described described first subspace is carried out stochastical sampling, obtains Q the second subspace and includes:

Described first subspace is chosen front E1 the pivot component containing maximum quantity of information；

From residue J-E1 pivot component of described first subspace, E2 pivot component is randomly selected by random sampling algorithms；

Generate the second subspace that Q dimension is E1+E2.

4. the method for claim 1, it is characterised in that the JFA super vector of described extraction tested speech includes:

5. the method for claim 1, it is characterised in that described by preset algorithm, the output of Q described grader carried out fusion and include:

The output of Q described grader is carried out linear fusion.

6. the identity confirmation device of a speaker, it is characterised in that including:

Second extraction unit, for extracting the JFA super vector of tested speech；

7. device as claimed in claim 6, it is characterised in that described first dimensionality reduction unit specifically for:

8. device as claimed in claim 6, it is characterised in that described stochastical sampling unit includes:

First chooses subelement, for choosing front E1 the pivot component containing maximum quantity of information in described first subspace；

Second chooses subelement, for randomly selecting E2 pivot component from residue J-E1 pivot component of described first subspace by random sampling algorithms；

Generate unit, for generating the second subspace that Q dimension is E1+E2.

9. device as claimed in claim 6, it is characterised in that described second extraction unit specifically for:

10. device as claimed in claim 6, it is characterised in that described integrated unit specifically for:

The output of Q described grader is carried out linear fusion.