CN105976819A

CN105976819A - Rnorm score normalization based speaker verification method

Info

Publication number: CN105976819A
Application number: CN201610172918.3A
Authority: CN
Inventors: 陈昊亮
Original assignee: Guangzhou Speakin Network Technology Co Ltd
Current assignee: Guangzhou Speakin Network Technology Co Ltd
Priority date: 2016-03-23
Filing date: 2016-03-23
Publication date: 2016-09-28

Abstract

The invention discloses an Rnorm score normalization based speaker verification method, which comprises the steps of acquiring an identity authentication vector omega<tar> of a target speaker and an identity authentication vector W<UBM> of a general background model of a training phase, acquiring an identity authentication vector omega<test> of tested voice of the training phase, calculating a score lambda6(omega<test>, omega<clm>) according to Rnorm score normalization through the identity authentication vector omega<tar> of the target speaker, the identity authentication vector W<UBM> of the general background model and the identity authentication vector omega<test> of the tested voice, judging whether the score lambda6(omega<test>, omega<clm>) is greater than a threshold, if so, indicating confirmation and receiving, and if not, refusing. By adopting the method disclosed by the invention, the calculation complexity is greatly simplified and the calculation time is greatly saved on the basis of ensuring a high confirmation accuracy rate.

Description

Method for identifying speaker based on Rnorm Score Normalization

Technical field

The invention belongs to speaker Recognition Technology field, be specifically related to a kind of speaker based on Rnorm Score Normalization Confirmation method.

Background technology

The final step saying speaker verification is to adjudicate, this process be actually by input speech signal with claim Speaker model compares the likelihood value drawn and a decision threshold being previously set compares, if likelihood value is higher than judgement Thresholding, then accept the speaker claimed, otherwise refuse.It is extremely difficult for adjusting decision threshold, choosing of general decision threshold Rule of thumb determine.

The change of score is influenced by many factors:

● speaker pronounces variant, is affected by mood, age, health status and sound channel own；

● different speaker's training data quality are different, content is different, the persistent period is different；

● environment noise when training data and test data acquisition is not mated, channel does not mates.

Traditional score normalization Znorm, Tnorm, ZTnorm, TZnorm are to speak towards based on GMM-UBM People confirms that system proposes, and these score normalization are the most successfully applied and spoken based on GMM-UBM People confirms system, but for speaker verification (i-SV) system of identity-based authentication vector ivector, based on The speaker identification system of ivector, wherein the main purpose of training stage is for each speaker tar, trains according to it Voice, training obtains a corresponding ivector model.The main purpose of test phase is given one section of voice test and claims Speaker clm, it is judged that whether test voice is that speaker clm sends, it is judged that condition is calculating and claims speaker modelWith tested speech modelBetween similarity.Training voice such as can be believed with a lot of noises Road noise etc., and these noises can cause the skew of the ivector vector model trained.Such as claim speaker modelIt is to obtain according to the training voice training claiming speaker, and ω '_testIt is that tested voice removal channel is made an uproar The ivector model obtained after sound, as it is shown in figure 1, definition θ_clm,testFor ω_clmWith ω_testBetween angle, θ_clm,test'For ω_clmWith ω '_testBetween angle, θ_non-clm,testFor ω_non-clmWith ω_testBetween angle.

ω ' in theory_testAnd ω_clmClose, if θ_clm,testSufficiently small, less than the threshold value that we are set, SV system is then Think that test voice is that speaker clm sends, but practical situation can exist interchannel noise, so ω '_testThen have ω may be offset to_test.The angle the most finally carrying out judging is θ_clm,test, as Fig. 2 (a) can be seen that θ_clm,testRelatively big, it is more than Threshold value, the most in this case, speaker identification system is just not considered as that test voice is that speaker clm sends, and this is letter The misjudgment that road mismatch causes.

Meanwhile, Fig. 2 (b) gives model ω_non-clmThe most non-ivector model claiming speaker, it can be seen that ω_test Distance ω_non-clmThe most far, different speakers can be also existed different impacts, this impact can bring what threshold value arranged to ask Topic, but different speakers are also existed different impacts, so needing to arrange everyone different threshold values, the most significantly Add the complexity of confirmation system.

Summary of the invention

In order to solve the problems referred to above, it is an object of the invention to provide a kind of speaker verification based on Rnorm Score Normalization Method, on the basis of ensureing to confirm that accuracy rate is higher, enormously simplify the complexity of calculating and saves the time of calculating.

For achieving the above object, the present invention is achieved by techniques below scheme:

Method for identifying speaker based on Rnorm Score Normalization of the present invention, it is characterised in that include walking as follows Rapid:

Obtain the authentication vector ω of the target speaker of training stage_tarAuthentication with common background module is vowed Amount W_UBM；

Obtain the authentication vector ω of the tested voice of test phase_test；

By the authentication vector ω of target speaker_tar, the authentication vector W of common background module_UBMWith tested The authentication vector ω of examination voice_testScore Λ is calculated by Rnorm Score Normalization₆(ω_test,ω_clm)；

Judge described score Λ₆(ω_test,ω_clm) whether higher than a threshold value, if it is, represent and confirm, then receive；Otherwise, Refusal.

Further, the authentication vector ω of the described target speaker obtaining the training stage_tarSpecifically comprise the following steps that

Calculate any one section of voice y of speaker undepandent J_JThe statistic of the Baume-Welch of (t)；

By the global disparity space matrix T trained, equation below is utilized to calculate the voice y of J_JThe identity of (t) Authentication vector ivector model；

Described formula is:

Further, described global disparity space matrix T calculation procedure is as follows:

Calculate the Baum-Welch statistic corresponding to each speaker S in training voice；

Randomly generate the initial value of global disparity space matrix T；

Calculate the Posterior distrbutionp of ω；

Maximum likelihood value revaluation, updates global disparity space matrix T；

Global disparity space matrix T more new formula is as follows:

T_iΦ_c=Ω_i

Φ_{c} = \underset{s}{Σ} \underset{h}{Σ} N_{c, h} (s) E [ω_{s, h} {ω_{s, h}}^{T}], c = 1, 2, ..., C

Ω = \underset{s}{Σ} \underset{h}{Σ} {\tilde{F}}_{h} (s) E [{ω_{s, h}}^{T}] .

Wherein T_iRepresent i-th row of T, Ω_iRepresent i-th row of Ω, i=1,2 ..., CP, repeats that " posteriority calculating ω divides Cloth " step and " maximum likelihood value revaluation, update global disparity space matrix T " step 10 time, then global disparity space matrix T instruction Practice complete.

Further, during voice is trained in calculating, the Baum-Welch statistic corresponding to each speaker S is specific as follows:

Given speaker s, s=1,2 ..., S and its h section voice y_s,h(t), h=1,2 ..., N_s, extract feature Sequence

X={x_t| t=1,2 ..., P}, for each Gaussian component c, define weight, average and covariance square herein Baum-Welch statistic corresponding to Zhen is as follows:

N_{c} (s) = \underset{t}{Σ} γ_{t} (c)

F_{c} (s) = \underset{t}{Σ} γ_{t} (c) x_{t}

S_{c} (s) = d i a g (\underset{t}{Σ} γ_{t} (c) x_{t} {x_{t}}^{T})

Wherein, for any one frame t, γ_t(c) representative feature vector x_tThe state occupation rate of the most each Gaussian component c, I.e. feature x of t frame_tFall into the posterior probability of c state, be expressed as:

γ_{t} (c) = \frac{w_{c} p_{c} (x_{t})}{Σ_{i = 1}^{C} w_{i} p_{i} (x_{t})}

w_cFor the mixed weight-value corresponding to the c Gauss model in common background UBM model；

Definition single order centre punch meteringWith second-order central statisticFor:

{\tilde{F}}_{c} (s) = \underset{t}{Σ} γ_{t} (c) (x_{t} - m_{c}) = F_{c} (s) - N_{c} (s) m_{c}

{\tilde{S}}_{c} (s) = d i a g (\underset{t}{Σ} γ_{t} (c) (x_{t} - m_{c}) {(x_{t} - m_{c})}^{T})

{\tilde{S}}_{c} (s) = S_{c} (s) - d i a g (F_{c} (s) {m_{c}}^{T} + m_{c} F_{c} {(s)}^{T} - N_{c} (s) m_{c} {m_{c}}^{T})

Wherein m_cFor the mean value vector corresponding to the c Gauss model in common background UBM model；

The diagonal matrix making N (s) be CP × CP, its diagonal blocks is N_c(s) I, c=1 ..., C,ForThe super vector being spliced into,For diagonal matrix, consisting of of diagonal blocksDiagonal element.

Further, the Posterior distrbutionp of described calculating ω specifically comprises the following steps that

Given speaker s, s=1,2 ..., S and its h section voice y_s,h(t), h=1,2 ..., N_sThe spy extracted Levy sequence X={ x_t| t=1,2 ..., P}, make l (s)=I+T^TΣ^-1N_hS () T, wherein Σ represents the super arrow of UBM covariance matrix Amount, then ω_s,hPosterior distrbutionp be that average isCovariance matrix is l^-1The Gauss distribution of (s), then:

E [ω_{s, h}] = l^{- 1} (s) T^{T} Σ^{- 1} {\tilde{F}}_{h} (s)

E[ω_s,hω_s,h ^T]=E [ω_s,h]E[ω_s,h ^T]+l^-1(s)。

Further, the authentication vector W of the universal background model of described training stage_UBMIt is by greatest hope EM Algorithm obtains.

Further, the authentication vector ω of the tested voice of described acquisition test phase_testStep as follows:

Calculate tested speech y_testThe statistic of the Baume-Welch of (t)；

By the global disparity space matrix T trained, equation below is utilized to calculate the voice y of test_test(t) Authentication vector ivector model；

Described formula is:

Calculate the Baum-Welch statistic corresponding to tested speech；

Randomly generate the initial value of global disparity space matrix T；

Calculate the Posterior distrbutionp of ω；

Global disparity space matrix T more new formula is as follows:

T_iΦ_c=Ω_i

Φ_{c} = \underset{s}{Σ} \underset{h}{Σ} N_{c, h} (s) E [ω_{s, h} {ω_{s, h}}^{T}], c = 1, 2, ..., C

Ω = \underset{s}{Σ} \underset{h}{Σ} {\tilde{F}}_{h} (s) E [{ω_{s, h}}^{T}] .

Further, tested speech y is calculated_testT the Baum-Welch statistic corresponding to () is specific as follows:

Given tested speech and its h section voice y_s,h(t), h=1,2 ..., N_s, extract characteristic sequence

X={x_t| t=1,2 ..., P}, for Gaussian component c, define weight, average and covariance matrix institute herein right The Baum-Welch statistic answered is as follows:

N_{c} (s) = \underset{t}{Σ} γ_{t} (c)

F_{c} (s) = \underset{t}{Σ} γ_{t} (c) x_{t}

S_{c} (s) = d i a g (\underset{t}{Σ} γ_{t} (c) x_{t} {x_{t}}^{T})

γ_{t} (c) = \frac{w_{c} p_{c} (x_{t})}{Σ_{i = 1}^{C} w_{i} p_{i} (x_{t})}

{\tilde{F}}_{c} (s) = \underset{t}{Σ} γ_{t} (c) (x_{t} - m_{c}) = F_{c} (s) - N_{c} (s) m_{c}

{\tilde{S}}_{c} (s) = d i a g (\underset{t}{Σ} γ_{t} (c) (x_{t} - m_{c}) {(x_{t} - m_{c})}^{T})

{\tilde{S}}_{c} (s) = S_{c} (s) - d i a g (F_{c} (s) {m_{c}}^{T} + m_{c} F_{c} {(s)}^{T} - N_{c} (s) m_{c} {m_{c}}^{T})

The diagonal matrix making N (s) be CP × CP, its diagonal blocks is N_c(s) I, c=1 ..., C,ForThe super vector being spliced into,For diagonal matrix, consisting of of diagonal blocksDiagonal element；

The Posterior distrbutionp of described calculating ω specifically comprises the following steps that

Tested speech and its h section voice y_s,h(t), h=1,2 ..., N_sThe characteristic sequence X={x extracted_t| t= 1,2 ..., P}, make l (s)=I+T^TΣ^-1N_hS () T, wherein Σ represents the super vector of UBM covariance matrix, then ω_s,hPosteriority Distribution is that average isCovariance matrix is l^-1The Gauss distribution of (s), then:

E [ω_{s, h}] = l^{- 1} (s) T^{T} Σ^{- 1} {\tilde{F}}_{h} (s)

E[ω_s,hω_s,h ^T]=E [ω_s,h]E[ω_s,h ^T]+l^-1(s)。

Further, described by Rnorm Score Normalization calculate score formula particularly as follows:

Λ_{6} (ω_{t e s t}, ω_{c l m}) = \frac{s c o r e (ω_{t e s t}, ω_{c l m})}{s c o r e (ω_{t e s t}, ω_{U B M})} .

Compared with prior art, the invention has the beneficial effects as follows:

First method for identifying speaker based on Rnorm Score Normalization of the present invention, by obtaining training clearing The authentication vector of tested voice of target speaker, universal background model and test test phase, then pass through Rnorm Score Normalization calculates score to compare with the threshold value set, if score is higher than threshold value, then it represents that confirm, then Accept, otherwise refuse.

A kind of method for identifying speaker based on Rnorm Score Normalization of the present invention, first combines authentication The advantage of the speaker identification system of vector, more non-claim speaker model by directly using universal background model to represent, Solve to set up a corresponding non-speaker model of claiming without each speaker, thus the most just simplify the complexity of calculating Degree, saves the time the most accordingly, and is confirming that in accuracy rate be also the highest.

Accompanying drawing explanation

Below in conjunction with the accompanying drawings the detailed description of the invention of the present invention is described in further detail, wherein:

Fig. 1 be method for identifying speaker based on Rnorm Score Normalization of the present invention background technology in Rnorm Algorithm design philosophy schematic diagram；

Fig. 2 (a) be method for identifying speaker based on Rnorm Score Normalization of the present invention background technology in survey Examination voice is to claim the scoring principle schematic that speaker sends；

Fig. 2 (b) be method for identifying speaker based on Rnorm Score Normalization of the present invention background technology in survey Examination voice is not to claim the scoring principle schematic that speaker sends；

Fig. 3 is the flow chart of method for identifying speaker based on Rnorm Score Normalization of the present invention；

Fig. 4 is that method for identifying speaker based on Rnorm Score Normalization of the present invention is under TIMIT data base DET curve chart；

Fig. 5 is that method for identifying speaker based on Rnorm Score Normalization of the present invention is at " 3convs-1conv " DET curve chart under task.

Detailed description of the invention

Below in conjunction with accompanying drawing, the preferred embodiments of the present invention are illustrated, it will be appreciated that preferred reality described herein Execute example be merely to illustrate and explain the present invention, be not intended to limit the present invention.

Method for identifying speaker based on Rnorm (Ratio normalization) Score Normalization of the present invention, On the basis of being built upon tradition Score Normalization, utilize its advantage, and the speaker verification system of identity-based authentication vector System combines, and can accomplish the confirmation rate that comparison is high.When but utilizing authentication vector to confirm, system combines, carry out finally After normalization score calculates, different speakers is needed to arrange different threshold values and differentiates, therefore can cause last Complexity during differentiation, and take considerable time.In order to solve this problem, by directly using by background model generation For non-speaker model of claiming, and then also the most only a threshold value need to be set, last differentiation can be completed, therefore greatly reduce The complexity calculated, saves the time.

Method for identifying speaker based on Rnorm Score Normalization of the present invention, flow chart as shown in Figure 3, its Specifically comprise the following steps that

S01: obtain the authentication vector ω of the target speaker of training stage_tarSpecifically comprise the following steps that

(1) any one section of voice y of speaker undepandent J is calculated_JThe statistic of the Baume-Welch of (t)；

(2) by the global disparity space matrix T trained, equation below is utilized to calculate the voice y of J_JThe body of (t) Part authentication vector ivector model；

Described formula is:

Described global disparity space matrix T calculation procedure is as follows:

A () calculates the Baum-Welch statistic in training voice corresponding to each speaker S: give speaker s, s= 1,2 ..., S and its h section voice y_s,h(t), h=1,2 ..., N_s, extract characteristic sequence X={x_t| t=1,2 ..., P}, for each Gaussian component c, defines weight, average and the Baum-Welch statistic corresponding to covariance matrix herein As follows:

N_{c} (s) = \underset{t}{Σ} γ_{t} (c)

F_{c} (s) = \underset{t}{Σ} γ_{t} (c) x_{t}

S_{c} (s) = d i a g (\underset{t}{Σ} γ_{t} (c) x_{t} {x_{t}}^{T})

γ_{t} (c) = \frac{w_{c} p_{c} (x_{t})}{Σ_{i = 1}^{C} w_{i} p_{i} (x_{t})}

{\tilde{F}}_{c} (s) = \underset{t}{Σ} γ_{t} (c) (x_{t} - m_{c}) = F_{c} (s) - N_{c} (s) m_{c}

{\tilde{S}}_{c} (s) = d i a g (\underset{t}{Σ} γ_{t} (c) (x_{t} - m_{c}) {(x_{t} - m_{c})}^{T})

{\tilde{S}}_{c} (s) = S_{c} (s) - d i a g (F_{c} (s) {m_{c}}^{T} + m_{c} F_{c} {(s)}^{T} - N_{c} (s) m_{c} {m_{c}}^{T})

B () randomly generates the initial value of global disparity space matrix T；

The Posterior distrbutionp of (c) calculating ω:

E [ω_{s, h}] = l^{- 1} (s) T^{T} Σ^{- 1} {\tilde{F}}_{h} (s)

E[ω_s,hω_s,h ^T]=E [ω_s,h]E[ω_s,h ^T]+l^-1(s)。

D the value revaluation of () maximum likelihood, updates global disparity space matrix T；

Global disparity space matrix T more new formula is as follows:

T_iΦ_c=Ω_i

Φ_{c} = \underset{s}{Σ} \underset{h}{Σ} N_{c, h} (s) E [ω_{s, h} {ω_{s, h}}^{T}], c = 1, 2, ..., C

Ω = \underset{s}{Σ} \underset{h}{Σ} {\tilde{F}}_{h} (s) E [{ω_{s, h}}^{T}] .

S02: obtain the authentication vector W of universal background model_UBM:

The authentication vector W of described universal background model_UBMObtained by greatest hope EM algorithm.

S03: obtain the authentication vector ω of the tested voice of test phase_test；

(1) tested speech y is calculated_testThe statistic of the Baume-Welch of (t)；

(2) by the global disparity space matrix T trained, equation below is utilized to calculate the voice y of test_test The authentication vector ivector model of (t)；

Described formula is:

Wherein, described global disparity space matrix T calculation procedure is as follows:

A () calculates the Baum-Welch statistic corresponding to tested speech；

Given tested speech and its h section voice y_s,h(t), h=1,2 ..., N_s, extract characteristic sequence X={x_t|t =1,2 ..., P}, for Gaussian component c, define weight, average and the Baum-Welch system corresponding to covariance matrix herein Measure as follows:

N_{c} (s) = \underset{t}{Σ} γ_{t} (c)

F_{c} (s) = \underset{t}{Σ} γ_{t} (c) x_{t}

S_{c} (s) = d i a g (\underset{t}{Σ} γ_{t} (c) x_{t} {x_{t}}^{T})

γ_{t} (c) = \frac{w_{c} p_{c} (x_{t})}{Σ_{i = 1}^{C} w_{i} p_{i} (x_{t})}

{\tilde{F}}_{c} (s) = \underset{t}{Σ} γ_{t} (c) (x_{t} - m_{c}) = F_{c} (s) - N_{c} (s) m_{c}

{\tilde{S}}_{c} (s) = d i a g (\underset{t}{Σ} γ_{t} (c) (x_{t} - m_{c}) {(x_{t} - m_{c})}^{T})

{\tilde{S}}_{c} (s) = S_{c} (s) - d i a g (F_{c} (s) {m_{c}}^{T} + m_{c} F_{c} {(s)}^{T} - N_{c} (s) m_{c} {m_{c}}^{T})

B () randomly generates the initial value of global disparity space matrix T；

C () calculates the Posterior distrbutionp of ω；

E [ω_{s, h}] = l^{- 1} (s) T^{T} Σ^{- 1} {\tilde{F}}_{h} (s)

E[ω_s,hω_s,h ^T]=E [ω_s,h]E[ω_s,h ^T]+l^-1(s)。

Global disparity space matrix T more new formula is as follows:

T_iΦ_c=Ω_i

Φ_{c} = \underset{s}{Σ} \underset{h}{Σ} N_{c, h} (s) E [ω_{s, h} {ω_{s, h}}^{T}], c = 1, 2, ..., C

Ω = \underset{s}{Σ} \underset{h}{Σ} {\tilde{F}}_{h} (s) E [{ω_{s, h}}^{T}] .

S04: by the authentication vector ω of target speaker_tar, the authentication vector W of universal background model_UBMWith The authentication vector ω of tested voice_testScore Λ is calculated by Rnorm Score Normalization₆(ω_test,ω_clm)；

WhereinWherein wclm and ω_tarBelong to the identical concept, simply ω_tar Being in the training stage, wclm is at test phase, say, that the two is also consistent in computational methods.

S05: judge described score Λ₆(ω_test,ω_clm) whether higher than a threshold value, if it is, represent and confirm, then receive； Otherwise, refusal.

MSR-toolbo tool kit is make use of to achieve the speaker identification system that text based on ivector is unrelated herein As baseline system, the speech database used in experiment has two, and one is TIMIT, and one is NIST SRE 2004. MFCC dimension is 20 dimensions, and wherein the first dimension is logarithmic energy, and the MFCC of 20 dimensions is asked first-order difference and second differnce, last feature Dimension is 60 dimensions.The feature of 60 dimensions is carried out feature bending and cepstrum mean normalization processes.The data of training UBM are from NIST 792 words in 8sides and 16 sides under SRE2004 data base, every words are about 3min～5min, TIMIT data base In 4620 words, every words be about 3s～5s, noise speech in 15 in Noise-92, the UBM of training be one 2028 dimension The GMM model of sex relevant (gender-dependent).The dimension of Ivector takes 400.

Wherein, the speech database TIMIT that international LDC tissue is issued is first available language having a large amount of speaker Sound data base, is by MIT, SRI international research institute and Texas Instruments's joint development, thus is widely used in and speaks The research that people identifies.The playback environ-ment of TIMIT is public place, and recording personnel have 630 people (male 438 people, female 192 people), these Containing eight kinds of dialects of English in the voice of speaker, each speaker reads 10 sentences, each sentence UL about 3s ～5s, record type and transmission channel are fixing mike, recording substance is English sentence, is spaced without record length.TIMIT language Sound database sampling rate is 16KHz, and quantitative rate is 16bit.

Since within 1996, starting NIST SRE evaluation and test, the follow-up data base evaluating and testing use each time of NIST is above On the basis of evaluation and test in several years, carry out what suitable adjustment was recorded according to current research level and practical situations, thus gradually Have accumulated substantial amounts of data base.NIST SRE 2004 data base samples for 8k, and 8bit quantifies, the sph file format of μ compression, main Mixer1 storehouse to be derived from, comprises 616 speakers, wherein women 370 people, male 246 people altogether.In NIST SRE 2004 The communicating data that is in daily life of data, sound pick-up outfit mainly includes wireless phone, landline telephone and mobile phone, with Time this data base consider the multilingual and bilingual problem of speaker, wherein the languages of this data base are evenly distributed in Arab On language, English, Russian, French, Chinese.Owing to the design synthesis of NIST SRE 2004 considers languages and channel, so this number It is commonly used for training UBM model or channel space etc. according to storehouse.Understanding from table one, NIST SRE 2004 includes 7 kinds of trainings With 4 kinds of test case.

Table one NIST SRE 2004 evaluates and tests task situation

In order to check method for identifying speaker based on Rnorm Score Normalization of the present invention without channel mismatch feelings Performance under condition, TIMIT data base is the speech database of a standard, records mode single, and recording substance is English, recording Environment is clean.Take 108 people in TIMIT test database, train 9 words, test 1 word, with 600sentences as imitative The person's of emitting voice.

Under table two TIMIT data base in i-SV system EER and minDCF of different methods of marking

Method for identifying speaker based on Rnorm Score Normalization of the present invention is can be seen that from Fig. 4 and Biao bis- On the EER of i-SV system, 0.4% is reduced, than CSS-Znorm, CSS-than original cosine similarity CSS methods of marking Tnorm and CSS-ZTnorm, CSS-TZnorm methods of marking will be outstanding, but minimum detection cost changes relatively comparatively speaking Little.But the change that as can be seen from Figure 4 i-SV system based on CSS-Rnorm and CSS-ZTnorm methods of marking is overall becomes Gesture is similar, similar nature, and reason is voice the cleanest due to TIMIT voice, without channel mismatch, on basis Good result, the impact on TIMIT data base's test result of all various method for normalizing can have been obtained under CSS scoring Not quite, but remain to the performance of change system.

For the i-CSS-checking method for identifying speaker based on Rnorm Score Normalization of the present invention to propose The performance in the case of having channel mismatch of Rnorm-SV system, of the present invention based on Rnorm Score Normalization speaks People's confirmation method have selected NIST SRE 2004 data base.NIST SRE 2004 data base has multiple voice channel, including Mike channel, telephone channel etc., gather environment simultaneously and have multiformity.Take the task " 3convs-in NIST SRE2004 1convs ", wherein everyone training voice is 3 words, is call voice dialogue, and double track is recorded, about 5min, test It is 1 word, totally 22899 test samples.

From figure 5 it can be seen that for " 3convs-1conv " task in NIST SRE 2004 data base, based on The i-SV system of CSS-Rnorm methods of marking achieves classic result, drops compared with i-SV system based on CSS methods of marking The low EER of 4.5%, CSS-Tnorm are better than CSS-Znorm, but i-SV system based on CSS-ZTnorm and CSS-TZnorm EER and the minDCF index of system is the most close, and difference is little.But as can be seen from Table III, i-CSS-Znorm-SV achieves Best minimum detection cost, i-CSS-Znorm-SV illustrates the advantage of self in the complexity and speed of system.Exist It is that off-line completes that the reason of this phenomenon is because the calculating of Znorm Score Normalization, so the minDCF of minimum can be realized, and The calculating of Tnorm Score Normalization completes, so minDCF is poorer than Znorm, due to institute of the present invention test when The Rnorm score normalization that the method for identifying speaker based on Rnorm Score Normalization stated proposes considers The feature of ivector model score and the impact on threshold value setting, so EER can minimize under channel mismatch conditions.

Under table three " 3convs-1conv " task in i-SV system EER and minDCF of different methods of marking

The above, be only presently preferred embodiments of the present invention, and the present invention not makees any pro forma restriction, therefore Every without departing from technical solution of the present invention content, any amendment of above example being made according to the technical spirit of the present invention, Equivalent variations and modification, all still fall within the range of technical solution of the present invention.

Claims

1. a method for identifying speaker based on Rnorm Score Normalization, it is characterised in that comprise the steps:

Obtain the authentication vector ω of the target speaker of training stage_tarAuthentication vector with universal background model W_UBM；

Obtain the authentication vector ω of the tested voice of test phase_test；

By the authentication vector ω of target speaker_tar, the authentication vector W of universal background model_UBMWith tested language The authentication vector ω of sound_testScore Λ is calculated by Rnorm Score Normalization₆(ω_test,ω_clm)；

Judge described score Λ₆(ω_test,ω_clm) whether higher than a threshold value, if it is, represent and confirm, then receive；Otherwise, refuse Absolutely.

Method for identifying speaker based on Rnorm Score Normalization the most according to claim 1, it is characterised in that:

The authentication vector ω of the described target speaker obtaining the training stage_tarSpecifically comprise the following steps that

By the global disparity space matrix T trained, equation below is utilized to calculate the voice y of J_JT the authentication of () is vowed Amount ivector model；

Described formula is:

Method for identifying speaker based on Rnorm Score Normalization the most according to claim 2, it is characterised in that:

Described global disparity space matrix T calculation procedure is as follows:

Randomly generate the initial value of global disparity space matrix T；

Calculate the Posterior distrbutionp of ω；

Global disparity space matrix T more new formula is as follows:

T_iΦ_c=Ω_i

Φ_{c} = \underset{s}{Σ} \underset{h}{Σ} N_{c, h} (s) E [ω_{s, h} {ω_{s, h}}^{T}], c = 1, 2, ..., C

Ω = \underset{s}{Σ} \underset{h}{Σ} {\tilde{F}}_{h} (s) E [{ω_{s, h}}^{T}] .

Wherein T_iRepresent i-th row of T, Ω_iRepresent i-th row of Ω, i=1,2 ..., CP, repeats " calculating the Posterior distrbutionp of ω " step Rapid secondary with " maximum likelihood value revaluation updates global disparity space matrix T " step 10, then global disparity space matrix T has trained Finish.

Method for identifying speaker based on Rnorm Score Normalization the most according to claim 3, it is characterised in that:

In calculating training voice, the Baum-Welch statistic corresponding to each speaker S is specific as follows:

Given speaker s, s=1,2 ..., S and its h section voice y_s,h(t), h=1,2 ..., N_s, extract characteristic sequence X ={ x_t| t=1,2 ..., P}, for each Gaussian component c, define herein corresponding to weight, average and covariance matrix Baum-Welch statistic is as follows:

N_{c} (s) = \underset{t}{Σ} γ_{t} (c)

F_{c} (s) = \underset{t}{Σ} γ_{t} (c) x_{t}

S_{c} (s) = d i a g (\underset{t}{Σ} γ_{t} (c) x_{t} {x_{t}}^{T})

Wherein, for any one frame t, γ_t(c) representative feature vector x_tThe state occupation rate of the most each Gaussian component c, i.e. t Feature x of frame_tFall into the posterior probability of c state, be expressed as:

γ_{t} (c) = \frac{w_{c} p_{c} (x_{t})}{Σ_{i = 1}^{C} w_{i} p_{i} (x_{t})}

{\tilde{F}}_{c} (s) = \underset{t}{Σ} γ_{t} (c) (x_{t} - m_{c}) = F_{c} (s) - N_{c} (s) m_{c}

{\tilde{S}}_{c} (s) = d i a g (\underset{t}{Σ} γ_{t} (c) (x_{t} - m_{c}) {(x_{t} - m_{c})}^{T})

{\tilde{S}}_{c} (s) = S_{c} (s) - d i a g (F_{c} (s) {m_{c}}^{T} + m_{c} F_{c} {(s)}^{T} - N_{c} (s) m_{c} {m_{c}}^{T})

The diagonal matrix making N (s) be CP × CP, its diagonal blocks is N_c(s) I, c=1 ..., C,ForC=1,2 ..., the super vector that C is spliced into,For diagonal matrix, consisting of of diagonal blocksC=1,2 ..., C diagonal element.

Given speaker s, s=1,2 ..., S and its h section voice y_s,h(t), h=1,2 ..., N_sThe characteristic sequence extracted X={x_t| t=1,2 ..., P}, make l (s)=I+T^TΣ^-1N_hS () T, wherein Σ represents the super vector of UBM covariance matrix, then ω_s,hPosterior distrbutionp be that average isCovariance matrix is l^-1The Gauss distribution of (s), then:

E [ω_{s, h}] = l^{- 1} (s) T^{T} Σ^{- 1} {\tilde{F}}_{h} (s)

E[ω_s,hω_s,h ^T]=E [ω_s,h]E[ω_s,h ^T]+l^-1(s)。

The authentication vector W of the universal background model of described training stage_UBMIt is to be obtained by greatest hope EM algorithm.

The authentication vector ω of the tested voice of described acquisition test phase_testStep as follows:

Calculate tested speech y_testThe statistic of the Baume-Welch of (t)；

By the global disparity space matrix T trained, equation below is utilized to calculate the voice y of test_testThe identity of (t) Authentication vector ivector model；

Described formula is:

Method for identifying speaker based on Rnorm Score Normalization the most according to claim 7, it is characterised in that:

Described global disparity space matrix T calculation procedure is as follows:

Calculate the Baum-Welch statistic corresponding to tested speech；

Randomly generate the initial value of global disparity space matrix T；

Calculate the Posterior distrbutionp of ω；

Global disparity space matrix T more new formula is as follows:

T_iΦ_c=Ω_i

Φ_{c} = \underset{s}{Σ} \underset{h}{Σ} N_{c, h} (s) E [ω_{s, h} {ω_{s, h}}^{T}], c = 1, 2, ..., C

Ω = \underset{s}{Σ} \underset{h}{Σ} {\tilde{F}}_{h} (s) E [{ω_{s, h}}^{T}] .

Calculate tested speech y_testT the Baum-Welch statistic corresponding to () is specific as follows:

Given tested speech and its h section voice y_s,h(t), h=1,2 ..., N_s, extract characteristic sequence X={x_t| t=1, 2 ..., P}, for Gaussian component c, define weight, average and the Baum-Welch statistic corresponding to covariance matrix herein As follows:

N_{c} (s) = \underset{t}{Σ} γ_{t} (c)

F_{c} (s) = \underset{t}{Σ} γ_{t} (c) x_{t}

S_{c} (s) = d i a g (\underset{t}{Σ} γ_{t} (c) x_{t} {x_{t}}^{T})

γ_{t} (c) = \frac{w_{c} p_{c} (x_{t})}{Σ_{i = 1}^{C} w_{i} p_{i} (x_{t})}

{\tilde{F}}_{c} (s) = \underset{t}{Σ} γ_{t} (c) (x_{t} - m_{c}) = F_{c} (s) - N_{c} (s) m_{c}

{\tilde{S}}_{c} (s) = d i a g (\underset{t}{Σ} γ_{t} (c) (x_{t} - m_{c}) {(x_{t} - m_{c})}^{T})

{\tilde{S}}_{c} (s) = S_{c} (s) - d i a g (F_{c} (s) {m_{c}}^{T} + m_{c} F_{c} {(s)}^{T} - N_{c} (s) m_{c} {m_{c}}^{T})

The diagonal matrix making N (s) be CP × CP, its diagonal blocks is N_c(s) I, c=1 ..., C,ForC=1,2 ..., the super vector that C is spliced into,For diagonal matrix, consisting of of diagonal blocksC=1,2 ..., C diagonal element；

Tested speech and its h section voice y_s,h(t), h=1,2 ..., N_sThe characteristic sequence X={x extracted_t| t=1, 2 ..., P}, make l (s)=I+T^TΣ^-1N_hS () T, wherein Σ represents the super vector of UBM covariance matrix, then ω_s,hPosteriority divide Cloth is that average isCovariance matrix is l^-1The Gauss distribution of (s), then:

E [ω_{s, h}] = l^{- 1} (s) T^{T} Σ^{- 1} {\tilde{F}}_{h} (s)

E[ω_s,hω_s,h ^T]=E [ω_s,h]E[ω_s,h ^T]+l^-1(s)。

Method for identifying speaker based on Rnorm Score Normalization the most according to claim 1, it is characterised in that: described By Rnorm Score Normalization calculate score formula particularly as follows:

Λ_{6} (ω_{t e s t}, ω_{c l m}) = \frac{s c o r e (ω_{t e s t}, ω_{c l m})}{s c o r e (ω_{t e s t}, ω_{U B M})} .