CN105976819A - Rnorm score normalization based speaker verification method - Google Patents

Rnorm score normalization based speaker verification method Download PDF

Info

Publication number
CN105976819A
CN105976819A CN201610172918.3A CN201610172918A CN105976819A CN 105976819 A CN105976819 A CN 105976819A CN 201610172918 A CN201610172918 A CN 201610172918A CN 105976819 A CN105976819 A CN 105976819A
Authority
CN
China
Prior art keywords
sigma
omega
rnorm
test
gamma
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610172918.3A
Other languages
Chinese (zh)
Inventor
陈昊亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Speakin Network Technology Co Ltd
Original Assignee
Guangzhou Speakin Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Speakin Network Technology Co Ltd filed Critical Guangzhou Speakin Network Technology Co Ltd
Priority to CN201610172918.3A priority Critical patent/CN105976819A/en
Publication of CN105976819A publication Critical patent/CN105976819A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/12Score normalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)

Abstract

The invention discloses an Rnorm score normalization based speaker verification method, which comprises the steps of acquiring an identity authentication vector omega<tar> of a target speaker and an identity authentication vector W<UBM> of a general background model of a training phase, acquiring an identity authentication vector omega<test> of tested voice of the training phase, calculating a score lambda6(omega<test>, omega<clm>) according to Rnorm score normalization through the identity authentication vector omega<tar> of the target speaker, the identity authentication vector W<UBM> of the general background model and the identity authentication vector omega<test> of the tested voice, judging whether the score lambda6(omega<test>, omega<clm>) is greater than a threshold, if so, indicating confirmation and receiving, and if not, refusing. By adopting the method disclosed by the invention, the calculation complexity is greatly simplified and the calculation time is greatly saved on the basis of ensuring a high confirmation accuracy rate.

Description

Method for identifying speaker based on Rnorm Score Normalization
Technical field
The invention belongs to speaker Recognition Technology field, be specifically related to a kind of speaker based on Rnorm Score Normalization Confirmation method.
Background technology
The final step saying speaker verification is to adjudicate, this process be actually by input speech signal with claim Speaker model compares the likelihood value drawn and a decision threshold being previously set compares, if likelihood value is higher than judgement Thresholding, then accept the speaker claimed, otherwise refuse.It is extremely difficult for adjusting decision threshold, choosing of general decision threshold Rule of thumb determine.
The change of score is influenced by many factors:
● speaker pronounces variant, is affected by mood, age, health status and sound channel own;
● different speaker's training data quality are different, content is different, the persistent period is different;
● environment noise when training data and test data acquisition is not mated, channel does not mates.
Traditional score normalization Znorm, Tnorm, ZTnorm, TZnorm are to speak towards based on GMM-UBM People confirms that system proposes, and these score normalization are the most successfully applied and spoken based on GMM-UBM People confirms system, but for speaker verification (i-SV) system of identity-based authentication vector ivector, based on The speaker identification system of ivector, wherein the main purpose of training stage is for each speaker tar, trains according to it Voice, training obtains a corresponding ivector model.The main purpose of test phase is given one section of voice test and claims Speaker clm, it is judged that whether test voice is that speaker clm sends, it is judged that condition is calculating and claims speaker modelWith tested speech modelBetween similarity.Training voice such as can be believed with a lot of noises Road noise etc., and these noises can cause the skew of the ivector vector model trained.Such as claim speaker modelIt is to obtain according to the training voice training claiming speaker, and ω 'testIt is that tested voice removal channel is made an uproar The ivector model obtained after sound, as it is shown in figure 1, definition θclm,testFor ωclmWith ωtestBetween angle, θclm,test'For ωclmWith ω 'testBetween angle, θnon-clm,testFor ωnon-clmWith ωtestBetween angle.
ω ' in theorytestAnd ωclmClose, if θclm,testSufficiently small, less than the threshold value that we are set, SV system is then Think that test voice is that speaker clm sends, but practical situation can exist interchannel noise, so ω 'testThen have ω may be offset totest.The angle the most finally carrying out judging is θclm,test, as Fig. 2 (a) can be seen that θclm,testRelatively big, it is more than Threshold value, the most in this case, speaker identification system is just not considered as that test voice is that speaker clm sends, and this is letter The misjudgment that road mismatch causes.
Meanwhile, Fig. 2 (b) gives model ωnon-clmThe most non-ivector model claiming speaker, it can be seen that ωtest Distance ωnon-clmThe most far, different speakers can be also existed different impacts, this impact can bring what threshold value arranged to ask Topic, but different speakers are also existed different impacts, so needing to arrange everyone different threshold values, the most significantly Add the complexity of confirmation system.
Summary of the invention
In order to solve the problems referred to above, it is an object of the invention to provide a kind of speaker verification based on Rnorm Score Normalization Method, on the basis of ensureing to confirm that accuracy rate is higher, enormously simplify the complexity of calculating and saves the time of calculating.
For achieving the above object, the present invention is achieved by techniques below scheme:
Method for identifying speaker based on Rnorm Score Normalization of the present invention, it is characterised in that include walking as follows Rapid:
Obtain the authentication vector ω of the target speaker of training stagetarAuthentication with common background module is vowed Amount WUBM
Obtain the authentication vector ω of the tested voice of test phasetest
By the authentication vector ω of target speakertar, the authentication vector W of common background moduleUBMWith tested The authentication vector ω of examination voicetestScore Λ is calculated by Rnorm Score Normalization6testclm);
Judge described score Λ6testclm) whether higher than a threshold value, if it is, represent and confirm, then receive;Otherwise, Refusal.
Further, the authentication vector ω of the described target speaker obtaining the training stagetarSpecifically comprise the following steps that
Calculate any one section of voice y of speaker undepandent JJThe statistic of the Baume-Welch of (t);
By the global disparity space matrix T trained, equation below is utilized to calculate the voice y of JJThe identity of (t) Authentication vector ivector model;
Described formula is:
Further, described global disparity space matrix T calculation procedure is as follows:
Calculate the Baum-Welch statistic corresponding to each speaker S in training voice;
Randomly generate the initial value of global disparity space matrix T;
Calculate the Posterior distrbutionp of ω;
Maximum likelihood value revaluation, updates global disparity space matrix T;
Global disparity space matrix T more new formula is as follows:
TiΦci
&Phi; c = &Sigma; s &Sigma; h N c , h ( s ) E &lsqb; &omega; s , h &omega; s , h T &rsqb; , c = 1 , 2 , ... , C
&Omega; = &Sigma; s &Sigma; h F ~ h ( s ) E &lsqb; &omega; s , h T &rsqb; .
Wherein TiRepresent i-th row of T, ΩiRepresent i-th row of Ω, i=1,2 ..., CP, repeats that " posteriority calculating ω divides Cloth " step and " maximum likelihood value revaluation, update global disparity space matrix T " step 10 time, then global disparity space matrix T instruction Practice complete.
Further, during voice is trained in calculating, the Baum-Welch statistic corresponding to each speaker S is specific as follows:
Given speaker s, s=1,2 ..., S and its h section voice ys,h(t), h=1,2 ..., Ns, extract feature Sequence
X={xt| t=1,2 ..., P}, for each Gaussian component c, define weight, average and covariance square herein Baum-Welch statistic corresponding to Zhen is as follows:
N c ( s ) = &Sigma; t &gamma; t ( c )
F c ( s ) = &Sigma; t &gamma; t ( c ) x t
S c ( s ) = d i a g ( &Sigma; t &gamma; t ( c ) x t x t T )
Wherein, for any one frame t, γt(c) representative feature vector xtThe state occupation rate of the most each Gaussian component c, I.e. feature x of t frametFall into the posterior probability of c state, be expressed as:
&gamma; t ( c ) = w c p c ( x t ) &Sigma; i = 1 C w i p i ( x t )
wcFor the mixed weight-value corresponding to the c Gauss model in common background UBM model;
Definition single order centre punch meteringWith second-order central statisticFor:
F ~ c ( s ) = &Sigma; t &gamma; t ( c ) ( x t - m c ) = F c ( s ) - N c ( s ) m c
S ~ c ( s ) = d i a g ( &Sigma; t &gamma; t ( c ) ( x t - m c ) ( x t - m c ) T )
S ~ c ( s ) = S c ( s ) - d i a g ( F c ( s ) m c T + m c F c ( s ) T - N c ( s ) m c m c T )
Wherein mcFor the mean value vector corresponding to the c Gauss model in common background UBM model;
The diagonal matrix making N (s) be CP × CP, its diagonal blocks is Nc(s) I, c=1 ..., C,ForThe super vector being spliced into,For diagonal matrix, consisting of of diagonal blocksDiagonal element.
Further, the Posterior distrbutionp of described calculating ω specifically comprises the following steps that
Given speaker s, s=1,2 ..., S and its h section voice ys,h(t), h=1,2 ..., NsThe spy extracted Levy sequence X={ xt| t=1,2 ..., P}, make l (s)=I+TTΣ-1NhS () T, wherein Σ represents the super arrow of UBM covariance matrix Amount, then ωs,hPosterior distrbutionp be that average isCovariance matrix is l-1The Gauss distribution of (s), then:
E &lsqb; &omega; s , h &rsqb; = l - 1 ( s ) T T &Sigma; - 1 F ~ h ( s )
E[ωs,hωs,h T]=E [ωs,h]E[ωs,h T]+l-1(s)。
Further, the authentication vector W of the universal background model of described training stageUBMIt is by greatest hope EM Algorithm obtains.
Further, the authentication vector ω of the tested voice of described acquisition test phasetestStep as follows:
Calculate tested speech ytestThe statistic of the Baume-Welch of (t);
By the global disparity space matrix T trained, equation below is utilized to calculate the voice y of testtest(t) Authentication vector ivector model;
Described formula is:
Further, described global disparity space matrix T calculation procedure is as follows:
Calculate the Baum-Welch statistic corresponding to tested speech;
Randomly generate the initial value of global disparity space matrix T;
Calculate the Posterior distrbutionp of ω;
Maximum likelihood value revaluation, updates global disparity space matrix T;
Global disparity space matrix T more new formula is as follows:
TiΦci
&Phi; c = &Sigma; s &Sigma; h N c , h ( s ) E &lsqb; &omega; s , h &omega; s , h T &rsqb; , c = 1 , 2 , ... , C
&Omega; = &Sigma; s &Sigma; h F ~ h ( s ) E &lsqb; &omega; s , h T &rsqb; .
Wherein TiRepresent i-th row of T, ΩiRepresent i-th row of Ω, i=1,2 ..., CP, repeats that " posteriority calculating ω divides Cloth " step and " maximum likelihood value revaluation, update global disparity space matrix T " step 10 time, then global disparity space matrix T instruction Practice complete.
Further, tested speech y is calculatedtestT the Baum-Welch statistic corresponding to () is specific as follows:
Given tested speech and its h section voice ys,h(t), h=1,2 ..., Ns, extract characteristic sequence
X={xt| t=1,2 ..., P}, for Gaussian component c, define weight, average and covariance matrix institute herein right The Baum-Welch statistic answered is as follows:
N c ( s ) = &Sigma; t &gamma; t ( c )
F c ( s ) = &Sigma; t &gamma; t ( c ) x t
S c ( s ) = d i a g ( &Sigma; t &gamma; t ( c ) x t x t T )
Wherein, for any one frame t, γt(c) representative feature vector xtThe state occupation rate of the most each Gaussian component c, I.e. feature x of t frametFall into the posterior probability of c state, be expressed as:
&gamma; t ( c ) = w c p c ( x t ) &Sigma; i = 1 C w i p i ( x t )
wcFor the mixed weight-value corresponding to the c Gauss model in common background UBM model;
Definition single order centre punch meteringWith second-order central statisticFor:
F ~ c ( s ) = &Sigma; t &gamma; t ( c ) ( x t - m c ) = F c ( s ) - N c ( s ) m c
S ~ c ( s ) = d i a g ( &Sigma; t &gamma; t ( c ) ( x t - m c ) ( x t - m c ) T )
S ~ c ( s ) = S c ( s ) - d i a g ( F c ( s ) m c T + m c F c ( s ) T - N c ( s ) m c m c T )
Wherein mcFor the mean value vector corresponding to the c Gauss model in common background UBM model;
The diagonal matrix making N (s) be CP × CP, its diagonal blocks is Nc(s) I, c=1 ..., C,ForThe super vector being spliced into,For diagonal matrix, consisting of of diagonal blocksDiagonal element;
The Posterior distrbutionp of described calculating ω specifically comprises the following steps that
Tested speech and its h section voice ys,h(t), h=1,2 ..., NsThe characteristic sequence X={x extractedt| t= 1,2 ..., P}, make l (s)=I+TTΣ-1NhS () T, wherein Σ represents the super vector of UBM covariance matrix, then ωs,hPosteriority Distribution is that average isCovariance matrix is l-1The Gauss distribution of (s), then:
E &lsqb; &omega; s , h &rsqb; = l - 1 ( s ) T T &Sigma; - 1 F ~ h ( s )
E[ωs,hωs,h T]=E [ωs,h]E[ωs,h T]+l-1(s)。
Further, described by Rnorm Score Normalization calculate score formula particularly as follows:
&Lambda; 6 ( &omega; t e s t , &omega; c l m ) = s c o r e ( &omega; t e s t , &omega; c l m ) s c o r e ( &omega; t e s t , &omega; U B M ) .
Compared with prior art, the invention has the beneficial effects as follows:
First method for identifying speaker based on Rnorm Score Normalization of the present invention, by obtaining training clearing The authentication vector of tested voice of target speaker, universal background model and test test phase, then pass through Rnorm Score Normalization calculates score to compare with the threshold value set, if score is higher than threshold value, then it represents that confirm, then Accept, otherwise refuse.
A kind of method for identifying speaker based on Rnorm Score Normalization of the present invention, first combines authentication The advantage of the speaker identification system of vector, more non-claim speaker model by directly using universal background model to represent, Solve to set up a corresponding non-speaker model of claiming without each speaker, thus the most just simplify the complexity of calculating Degree, saves the time the most accordingly, and is confirming that in accuracy rate be also the highest.
Accompanying drawing explanation
Below in conjunction with the accompanying drawings the detailed description of the invention of the present invention is described in further detail, wherein:
Fig. 1 be method for identifying speaker based on Rnorm Score Normalization of the present invention background technology in Rnorm Algorithm design philosophy schematic diagram;
Fig. 2 (a) be method for identifying speaker based on Rnorm Score Normalization of the present invention background technology in survey Examination voice is to claim the scoring principle schematic that speaker sends;
Fig. 2 (b) be method for identifying speaker based on Rnorm Score Normalization of the present invention background technology in survey Examination voice is not to claim the scoring principle schematic that speaker sends;
Fig. 3 is the flow chart of method for identifying speaker based on Rnorm Score Normalization of the present invention;
Fig. 4 is that method for identifying speaker based on Rnorm Score Normalization of the present invention is under TIMIT data base DET curve chart;
Fig. 5 is that method for identifying speaker based on Rnorm Score Normalization of the present invention is at " 3convs-1conv " DET curve chart under task.
Detailed description of the invention
Below in conjunction with accompanying drawing, the preferred embodiments of the present invention are illustrated, it will be appreciated that preferred reality described herein Execute example be merely to illustrate and explain the present invention, be not intended to limit the present invention.
Method for identifying speaker based on Rnorm (Ratio normalization) Score Normalization of the present invention, On the basis of being built upon tradition Score Normalization, utilize its advantage, and the speaker verification system of identity-based authentication vector System combines, and can accomplish the confirmation rate that comparison is high.When but utilizing authentication vector to confirm, system combines, carry out finally After normalization score calculates, different speakers is needed to arrange different threshold values and differentiates, therefore can cause last Complexity during differentiation, and take considerable time.In order to solve this problem, by directly using by background model generation For non-speaker model of claiming, and then also the most only a threshold value need to be set, last differentiation can be completed, therefore greatly reduce The complexity calculated, saves the time.
Method for identifying speaker based on Rnorm Score Normalization of the present invention, flow chart as shown in Figure 3, its Specifically comprise the following steps that
S01: obtain the authentication vector ω of the target speaker of training stagetarSpecifically comprise the following steps that
(1) any one section of voice y of speaker undepandent J is calculatedJThe statistic of the Baume-Welch of (t);
(2) by the global disparity space matrix T trained, equation below is utilized to calculate the voice y of JJThe body of (t) Part authentication vector ivector model;
Described formula is:
Described global disparity space matrix T calculation procedure is as follows:
A () calculates the Baum-Welch statistic in training voice corresponding to each speaker S: give speaker s, s= 1,2 ..., S and its h section voice ys,h(t), h=1,2 ..., Ns, extract characteristic sequence X={xt| t=1,2 ..., P}, for each Gaussian component c, defines weight, average and the Baum-Welch statistic corresponding to covariance matrix herein As follows:
N c ( s ) = &Sigma; t &gamma; t ( c )
F c ( s ) = &Sigma; t &gamma; t ( c ) x t
S c ( s ) = d i a g ( &Sigma; t &gamma; t ( c ) x t x t T )
Wherein, for any one frame t, γt(c) representative feature vector xtThe state occupation rate of the most each Gaussian component c, I.e. feature x of t frametFall into the posterior probability of c state, be expressed as:
&gamma; t ( c ) = w c p c ( x t ) &Sigma; i = 1 C w i p i ( x t )
wcFor the mixed weight-value corresponding to the c Gauss model in common background UBM model;
Definition single order centre punch meteringWith second-order central statisticFor:
F ~ c ( s ) = &Sigma; t &gamma; t ( c ) ( x t - m c ) = F c ( s ) - N c ( s ) m c
S ~ c ( s ) = d i a g ( &Sigma; t &gamma; t ( c ) ( x t - m c ) ( x t - m c ) T )
S ~ c ( s ) = S c ( s ) - d i a g ( F c ( s ) m c T + m c F c ( s ) T - N c ( s ) m c m c T )
Wherein mcFor the mean value vector corresponding to the c Gauss model in common background UBM model;
The diagonal matrix making N (s) be CP × CP, its diagonal blocks is Nc(s) I, c=1 ..., C,ForThe super vector being spliced into,For diagonal matrix, consisting of of diagonal blocksDiagonal element.
B () randomly generates the initial value of global disparity space matrix T;
The Posterior distrbutionp of (c) calculating ω:
Given speaker s, s=1,2 ..., S and its h section voice ys,h(t), h=1,2 ..., NsThe spy extracted Levy sequence X={ xt| t=1,2 ..., P}, make l (s)=I+TTΣ-1NhS () T, wherein Σ represents the super arrow of UBM covariance matrix Amount, then ωs,hPosterior distrbutionp be that average isCovariance matrix is l-1The Gauss distribution of (s), then:
E &lsqb; &omega; s , h &rsqb; = l - 1 ( s ) T T &Sigma; - 1 F ~ h ( s )
E[ωs,hωs,h T]=E [ωs,h]E[ωs,h T]+l-1(s)。
D the value revaluation of () maximum likelihood, updates global disparity space matrix T;
Global disparity space matrix T more new formula is as follows:
TiΦci
&Phi; c = &Sigma; s &Sigma; h N c , h ( s ) E &lsqb; &omega; s , h &omega; s , h T &rsqb; , c = 1 , 2 , ... , C
&Omega; = &Sigma; s &Sigma; h F ~ h ( s ) E &lsqb; &omega; s , h T &rsqb; .
Wherein TiRepresent i-th row of T, ΩiRepresent i-th row of Ω, i=1,2 ..., CP, repeats that " posteriority calculating ω divides Cloth " step and " maximum likelihood value revaluation, update global disparity space matrix T " step 10 time, then global disparity space matrix T instruction Practice complete.
S02: obtain the authentication vector W of universal background modelUBM:
The authentication vector W of described universal background modelUBMObtained by greatest hope EM algorithm.
S03: obtain the authentication vector ω of the tested voice of test phasetest
(1) tested speech y is calculatedtestThe statistic of the Baume-Welch of (t);
(2) by the global disparity space matrix T trained, equation below is utilized to calculate the voice y of testtest The authentication vector ivector model of (t);
Described formula is:
Wherein, described global disparity space matrix T calculation procedure is as follows:
A () calculates the Baum-Welch statistic corresponding to tested speech;
Given tested speech and its h section voice ys,h(t), h=1,2 ..., Ns, extract characteristic sequence X={xt|t =1,2 ..., P}, for Gaussian component c, define weight, average and the Baum-Welch system corresponding to covariance matrix herein Measure as follows:
N c ( s ) = &Sigma; t &gamma; t ( c )
F c ( s ) = &Sigma; t &gamma; t ( c ) x t
S c ( s ) = d i a g ( &Sigma; t &gamma; t ( c ) x t x t T )
Wherein, for any one frame t, γt(c) representative feature vector xtThe state occupation rate of the most each Gaussian component c, I.e. feature x of t frametFall into the posterior probability of c state, be expressed as:
&gamma; t ( c ) = w c p c ( x t ) &Sigma; i = 1 C w i p i ( x t )
wcFor the mixed weight-value corresponding to the c Gauss model in common background UBM model;
Definition single order centre punch meteringWith second-order central statisticFor:
F ~ c ( s ) = &Sigma; t &gamma; t ( c ) ( x t - m c ) = F c ( s ) - N c ( s ) m c
S ~ c ( s ) = d i a g ( &Sigma; t &gamma; t ( c ) ( x t - m c ) ( x t - m c ) T )
S ~ c ( s ) = S c ( s ) - d i a g ( F c ( s ) m c T + m c F c ( s ) T - N c ( s ) m c m c T )
Wherein mcFor the mean value vector corresponding to the c Gauss model in common background UBM model;
The diagonal matrix making N (s) be CP × CP, its diagonal blocks is Nc(s) I, c=1 ..., C,ForThe super vector being spliced into,For diagonal matrix, consisting of of diagonal blocksDiagonal element.
B () randomly generates the initial value of global disparity space matrix T;
C () calculates the Posterior distrbutionp of ω;
Tested speech and its h section voice ys,h(t), h=1,2 ..., NsThe characteristic sequence X={x extractedt| t= 1,2 ..., P}, make l (s)=I+TTΣ-1NhS () T, wherein Σ represents the super vector of UBM covariance matrix, then ωs,hPosteriority Distribution is that average isCovariance matrix is l-1The Gauss distribution of (s), then:
E &lsqb; &omega; s , h &rsqb; = l - 1 ( s ) T T &Sigma; - 1 F ~ h ( s )
E[ωs,hωs,h T]=E [ωs,h]E[ωs,h T]+l-1(s)。
D the value revaluation of () maximum likelihood, updates global disparity space matrix T;
Global disparity space matrix T more new formula is as follows:
TiΦci
&Phi; c = &Sigma; s &Sigma; h N c , h ( s ) E &lsqb; &omega; s , h &omega; s , h T &rsqb; , c = 1 , 2 , ... , C
&Omega; = &Sigma; s &Sigma; h F ~ h ( s ) E &lsqb; &omega; s , h T &rsqb; .
Wherein TiRepresent i-th row of T, ΩiRepresent i-th row of Ω, i=1,2 ..., CP, repeats that " posteriority calculating ω divides Cloth " step and " maximum likelihood value revaluation, update global disparity space matrix T " step 10 time, then global disparity space matrix T instruction Practice complete.
S04: by the authentication vector ω of target speakertar, the authentication vector W of universal background modelUBMWith The authentication vector ω of tested voicetestScore Λ is calculated by Rnorm Score Normalization6testclm);
WhereinWherein wclm and ωtarBelong to the identical concept, simply ωtar Being in the training stage, wclm is at test phase, say, that the two is also consistent in computational methods.
S05: judge described score Λ6testclm) whether higher than a threshold value, if it is, represent and confirm, then receive; Otherwise, refusal.
MSR-toolbo tool kit is make use of to achieve the speaker identification system that text based on ivector is unrelated herein As baseline system, the speech database used in experiment has two, and one is TIMIT, and one is NIST SRE 2004. MFCC dimension is 20 dimensions, and wherein the first dimension is logarithmic energy, and the MFCC of 20 dimensions is asked first-order difference and second differnce, last feature Dimension is 60 dimensions.The feature of 60 dimensions is carried out feature bending and cepstrum mean normalization processes.The data of training UBM are from NIST 792 words in 8sides and 16 sides under SRE2004 data base, every words are about 3min~5min, TIMIT data base In 4620 words, every words be about 3s~5s, noise speech in 15 in Noise-92, the UBM of training be one 2028 dimension The GMM model of sex relevant (gender-dependent).The dimension of Ivector takes 400.
Wherein, the speech database TIMIT that international LDC tissue is issued is first available language having a large amount of speaker Sound data base, is by MIT, SRI international research institute and Texas Instruments's joint development, thus is widely used in and speaks The research that people identifies.The playback environ-ment of TIMIT is public place, and recording personnel have 630 people (male 438 people, female 192 people), these Containing eight kinds of dialects of English in the voice of speaker, each speaker reads 10 sentences, each sentence UL about 3s ~5s, record type and transmission channel are fixing mike, recording substance is English sentence, is spaced without record length.TIMIT language Sound database sampling rate is 16KHz, and quantitative rate is 16bit.
Since within 1996, starting NIST SRE evaluation and test, the follow-up data base evaluating and testing use each time of NIST is above On the basis of evaluation and test in several years, carry out what suitable adjustment was recorded according to current research level and practical situations, thus gradually Have accumulated substantial amounts of data base.NIST SRE 2004 data base samples for 8k, and 8bit quantifies, the sph file format of μ compression, main Mixer1 storehouse to be derived from, comprises 616 speakers, wherein women 370 people, male 246 people altogether.In NIST SRE 2004 The communicating data that is in daily life of data, sound pick-up outfit mainly includes wireless phone, landline telephone and mobile phone, with Time this data base consider the multilingual and bilingual problem of speaker, wherein the languages of this data base are evenly distributed in Arab On language, English, Russian, French, Chinese.Owing to the design synthesis of NIST SRE 2004 considers languages and channel, so this number It is commonly used for training UBM model or channel space etc. according to storehouse.Understanding from table one, NIST SRE 2004 includes 7 kinds of trainings With 4 kinds of test case.
Table one NIST SRE 2004 evaluates and tests task situation
In order to check method for identifying speaker based on Rnorm Score Normalization of the present invention without channel mismatch feelings Performance under condition, TIMIT data base is the speech database of a standard, records mode single, and recording substance is English, recording Environment is clean.Take 108 people in TIMIT test database, train 9 words, test 1 word, with 600sentences as imitative The person's of emitting voice.
Under table two TIMIT data base in i-SV system EER and minDCF of different methods of marking
Method for identifying speaker based on Rnorm Score Normalization of the present invention is can be seen that from Fig. 4 and Biao bis- On the EER of i-SV system, 0.4% is reduced, than CSS-Znorm, CSS-than original cosine similarity CSS methods of marking Tnorm and CSS-ZTnorm, CSS-TZnorm methods of marking will be outstanding, but minimum detection cost changes relatively comparatively speaking Little.But the change that as can be seen from Figure 4 i-SV system based on CSS-Rnorm and CSS-ZTnorm methods of marking is overall becomes Gesture is similar, similar nature, and reason is voice the cleanest due to TIMIT voice, without channel mismatch, on basis Good result, the impact on TIMIT data base's test result of all various method for normalizing can have been obtained under CSS scoring Not quite, but remain to the performance of change system.
For the i-CSS-checking method for identifying speaker based on Rnorm Score Normalization of the present invention to propose The performance in the case of having channel mismatch of Rnorm-SV system, of the present invention based on Rnorm Score Normalization speaks People's confirmation method have selected NIST SRE 2004 data base.NIST SRE 2004 data base has multiple voice channel, including Mike channel, telephone channel etc., gather environment simultaneously and have multiformity.Take the task " 3convs-in NIST SRE2004 1convs ", wherein everyone training voice is 3 words, is call voice dialogue, and double track is recorded, about 5min, test It is 1 word, totally 22899 test samples.
From figure 5 it can be seen that for " 3convs-1conv " task in NIST SRE 2004 data base, based on The i-SV system of CSS-Rnorm methods of marking achieves classic result, drops compared with i-SV system based on CSS methods of marking The low EER of 4.5%, CSS-Tnorm are better than CSS-Znorm, but i-SV system based on CSS-ZTnorm and CSS-TZnorm EER and the minDCF index of system is the most close, and difference is little.But as can be seen from Table III, i-CSS-Znorm-SV achieves Best minimum detection cost, i-CSS-Znorm-SV illustrates the advantage of self in the complexity and speed of system.Exist It is that off-line completes that the reason of this phenomenon is because the calculating of Znorm Score Normalization, so the minDCF of minimum can be realized, and The calculating of Tnorm Score Normalization completes, so minDCF is poorer than Znorm, due to institute of the present invention test when The Rnorm score normalization that the method for identifying speaker based on Rnorm Score Normalization stated proposes considers The feature of ivector model score and the impact on threshold value setting, so EER can minimize under channel mismatch conditions.
Under table three " 3convs-1conv " task in i-SV system EER and minDCF of different methods of marking
The above, be only presently preferred embodiments of the present invention, and the present invention not makees any pro forma restriction, therefore Every without departing from technical solution of the present invention content, any amendment of above example being made according to the technical spirit of the present invention, Equivalent variations and modification, all still fall within the range of technical solution of the present invention.

Claims (10)

1. a method for identifying speaker based on Rnorm Score Normalization, it is characterised in that comprise the steps:
Obtain the authentication vector ω of the target speaker of training stagetarAuthentication vector with universal background model WUBM
Obtain the authentication vector ω of the tested voice of test phasetest
By the authentication vector ω of target speakertar, the authentication vector W of universal background modelUBMWith tested language The authentication vector ω of soundtestScore Λ is calculated by Rnorm Score Normalization6testclm);
Judge described score Λ6testclm) whether higher than a threshold value, if it is, represent and confirm, then receive;Otherwise, refuse Absolutely.
Method for identifying speaker based on Rnorm Score Normalization the most according to claim 1, it is characterised in that:
The authentication vector ω of the described target speaker obtaining the training stagetarSpecifically comprise the following steps that
Calculate any one section of voice y of speaker undepandent JJThe statistic of the Baume-Welch of (t);
By the global disparity space matrix T trained, equation below is utilized to calculate the voice y of JJT the authentication of () is vowed Amount ivector model;
Described formula is:
Method for identifying speaker based on Rnorm Score Normalization the most according to claim 2, it is characterised in that:
Described global disparity space matrix T calculation procedure is as follows:
Calculate the Baum-Welch statistic corresponding to each speaker S in training voice;
Randomly generate the initial value of global disparity space matrix T;
Calculate the Posterior distrbutionp of ω;
Maximum likelihood value revaluation, updates global disparity space matrix T;
Global disparity space matrix T more new formula is as follows:
TiΦci
&Phi; c = &Sigma; s &Sigma; h N c , h ( s ) E &lsqb; &omega; s , h &omega; s , h T &rsqb; , c = 1 , 2 , ... , C
&Omega; = &Sigma; s &Sigma; h F ~ h ( s ) E &lsqb; &omega; s , h T &rsqb; .
Wherein TiRepresent i-th row of T, ΩiRepresent i-th row of Ω, i=1,2 ..., CP, repeats " calculating the Posterior distrbutionp of ω " step Rapid secondary with " maximum likelihood value revaluation updates global disparity space matrix T " step 10, then global disparity space matrix T has trained Finish.
Method for identifying speaker based on Rnorm Score Normalization the most according to claim 3, it is characterised in that:
In calculating training voice, the Baum-Welch statistic corresponding to each speaker S is specific as follows:
Given speaker s, s=1,2 ..., S and its h section voice ys,h(t), h=1,2 ..., Ns, extract characteristic sequence X ={ xt| t=1,2 ..., P}, for each Gaussian component c, define herein corresponding to weight, average and covariance matrix Baum-Welch statistic is as follows:
N c ( s ) = &Sigma; t &gamma; t ( c )
F c ( s ) = &Sigma; t &gamma; t ( c ) x t
S c ( s ) = d i a g ( &Sigma; t &gamma; t ( c ) x t x t T )
Wherein, for any one frame t, γt(c) representative feature vector xtThe state occupation rate of the most each Gaussian component c, i.e. t Feature x of frametFall into the posterior probability of c state, be expressed as:
&gamma; t ( c ) = w c p c ( x t ) &Sigma; i = 1 C w i p i ( x t )
wcFor the mixed weight-value corresponding to the c Gauss model in common background UBM model;
Definition single order centre punch meteringWith second-order central statisticFor:
F ~ c ( s ) = &Sigma; t &gamma; t ( c ) ( x t - m c ) = F c ( s ) - N c ( s ) m c
S ~ c ( s ) = d i a g ( &Sigma; t &gamma; t ( c ) ( x t - m c ) ( x t - m c ) T )
S ~ c ( s ) = S c ( s ) - d i a g ( F c ( s ) m c T + m c F c ( s ) T - N c ( s ) m c m c T )
Wherein mcFor the mean value vector corresponding to the c Gauss model in common background UBM model;
The diagonal matrix making N (s) be CP × CP, its diagonal blocks is Nc(s) I, c=1 ..., C,ForC=1,2 ..., the super vector that C is spliced into,For diagonal matrix, consisting of of diagonal blocksC=1,2 ..., C diagonal element.
Method for identifying speaker based on Rnorm Score Normalization the most according to claim 3, it is characterised in that:
The Posterior distrbutionp of described calculating ω specifically comprises the following steps that
Given speaker s, s=1,2 ..., S and its h section voice ys,h(t), h=1,2 ..., NsThe characteristic sequence extracted X={xt| t=1,2 ..., P}, make l (s)=I+TTΣ-1NhS () T, wherein Σ represents the super vector of UBM covariance matrix, then ωs,hPosterior distrbutionp be that average isCovariance matrix is l-1The Gauss distribution of (s), then:
E &lsqb; &omega; s , h &rsqb; = l - 1 ( s ) T T &Sigma; - 1 F ~ h ( s )
E[ωs,hωs,h T]=E [ωs,h]E[ωs,h T]+l-1(s)。
Method for identifying speaker based on Rnorm Score Normalization the most according to claim 1, it is characterised in that:
The authentication vector W of the universal background model of described training stageUBMIt is to be obtained by greatest hope EM algorithm.
Method for identifying speaker based on Rnorm Score Normalization the most according to claim 1, it is characterised in that:
The authentication vector ω of the tested voice of described acquisition test phasetestStep as follows:
Calculate tested speech ytestThe statistic of the Baume-Welch of (t);
By the global disparity space matrix T trained, equation below is utilized to calculate the voice y of testtestThe identity of (t) Authentication vector ivector model;
Described formula is:
Method for identifying speaker based on Rnorm Score Normalization the most according to claim 7, it is characterised in that:
Described global disparity space matrix T calculation procedure is as follows:
Calculate the Baum-Welch statistic corresponding to tested speech;
Randomly generate the initial value of global disparity space matrix T;
Calculate the Posterior distrbutionp of ω;
Maximum likelihood value revaluation, updates global disparity space matrix T;
Global disparity space matrix T more new formula is as follows:
TiΦci
&Phi; c = &Sigma; s &Sigma; h N c , h ( s ) E &lsqb; &omega; s , h &omega; s , h T &rsqb; , c = 1 , 2 , ... , C
&Omega; = &Sigma; s &Sigma; h F ~ h ( s ) E &lsqb; &omega; s , h T &rsqb; .
Wherein TiRepresent i-th row of T, ΩiRepresent i-th row of Ω, i=1,2 ..., CP, repeats " calculating the Posterior distrbutionp of ω " step Rapid secondary with " maximum likelihood value revaluation updates global disparity space matrix T " step 10, then global disparity space matrix T has trained Finish.
Method for identifying speaker based on Rnorm Score Normalization the most according to claim 7, it is characterised in that:
Calculate tested speech ytestT the Baum-Welch statistic corresponding to () is specific as follows:
Given tested speech and its h section voice ys,h(t), h=1,2 ..., Ns, extract characteristic sequence X={xt| t=1, 2 ..., P}, for Gaussian component c, define weight, average and the Baum-Welch statistic corresponding to covariance matrix herein As follows:
N c ( s ) = &Sigma; t &gamma; t ( c )
F c ( s ) = &Sigma; t &gamma; t ( c ) x t
S c ( s ) = d i a g ( &Sigma; t &gamma; t ( c ) x t x t T )
Wherein, for any one frame t, γt(c) representative feature vector xtThe state occupation rate of the most each Gaussian component c, i.e. t Feature x of frametFall into the posterior probability of c state, be expressed as:
&gamma; t ( c ) = w c p c ( x t ) &Sigma; i = 1 C w i p i ( x t )
wcFor the mixed weight-value corresponding to the c Gauss model in common background UBM model;
Definition single order centre punch meteringWith second-order central statisticFor:
F ~ c ( s ) = &Sigma; t &gamma; t ( c ) ( x t - m c ) = F c ( s ) - N c ( s ) m c
S ~ c ( s ) = d i a g ( &Sigma; t &gamma; t ( c ) ( x t - m c ) ( x t - m c ) T )
S ~ c ( s ) = S c ( s ) - d i a g ( F c ( s ) m c T + m c F c ( s ) T - N c ( s ) m c m c T )
Wherein mcFor the mean value vector corresponding to the c Gauss model in common background UBM model;
The diagonal matrix making N (s) be CP × CP, its diagonal blocks is Nc(s) I, c=1 ..., C,ForC=1,2 ..., the super vector that C is spliced into,For diagonal matrix, consisting of of diagonal blocksC=1,2 ..., C diagonal element;
The Posterior distrbutionp of described calculating ω specifically comprises the following steps that
Tested speech and its h section voice ys,h(t), h=1,2 ..., NsThe characteristic sequence X={x extractedt| t=1, 2 ..., P}, make l (s)=I+TTΣ-1NhS () T, wherein Σ represents the super vector of UBM covariance matrix, then ωs,hPosteriority divide Cloth is that average isCovariance matrix is l-1The Gauss distribution of (s), then:
E &lsqb; &omega; s , h &rsqb; = l - 1 ( s ) T T &Sigma; - 1 F ~ h ( s )
E[ωs,hωs,h T]=E [ωs,h]E[ωs,h T]+l-1(s)。
Method for identifying speaker based on Rnorm Score Normalization the most according to claim 1, it is characterised in that: described By Rnorm Score Normalization calculate score formula particularly as follows:
&Lambda; 6 ( &omega; t e s t , &omega; c l m ) = s c o r e ( &omega; t e s t , &omega; c l m ) s c o r e ( &omega; t e s t , &omega; U B M ) .
CN201610172918.3A 2016-03-23 2016-03-23 Rnorm score normalization based speaker verification method Pending CN105976819A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610172918.3A CN105976819A (en) 2016-03-23 2016-03-23 Rnorm score normalization based speaker verification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610172918.3A CN105976819A (en) 2016-03-23 2016-03-23 Rnorm score normalization based speaker verification method

Publications (1)

Publication Number Publication Date
CN105976819A true CN105976819A (en) 2016-09-28

Family

ID=56989505

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610172918.3A Pending CN105976819A (en) 2016-03-23 2016-03-23 Rnorm score normalization based speaker verification method

Country Status (1)

Country Link
CN (1) CN105976819A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109997185A (en) * 2016-11-07 2019-07-09 思睿逻辑国际半导体有限公司 Method and apparatus for the biometric authentication in electronic equipment
CN110110790A (en) * 2019-05-08 2019-08-09 中国科学技术大学 Using the regular method for identifying speaker of Unsupervised clustering score
WO2020034628A1 (en) * 2018-08-14 2020-02-20 平安科技(深圳)有限公司 Accent identification method and device, computer device, and storage medium
CN111883142A (en) * 2020-07-30 2020-11-03 山东理工大学 Speaker confirmation method based on log-likelihood value normalization

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102486922A (en) * 2010-12-03 2012-06-06 株式会社理光 Speaker recognition method, device and system
CN103345923A (en) * 2013-07-26 2013-10-09 电子科技大学 Sparse representation based short-voice speaker recognition method
CN103730121A (en) * 2013-12-24 2014-04-16 中山大学 Method and device for recognizing disguised sounds
US20140244257A1 (en) * 2013-02-25 2014-08-28 Nuance Communications, Inc. Method and Apparatus for Automated Speaker Parameters Adaptation in a Deployed Speaker Verification System
US9165555B2 (en) * 2005-01-12 2015-10-20 At&T Intellectual Property Ii, L.P. Low latency real-time vocal tract length normalization

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9165555B2 (en) * 2005-01-12 2015-10-20 At&T Intellectual Property Ii, L.P. Low latency real-time vocal tract length normalization
CN102486922A (en) * 2010-12-03 2012-06-06 株式会社理光 Speaker recognition method, device and system
US20140244257A1 (en) * 2013-02-25 2014-08-28 Nuance Communications, Inc. Method and Apparatus for Automated Speaker Parameters Adaptation in a Deployed Speaker Verification System
CN103345923A (en) * 2013-07-26 2013-10-09 电子科技大学 Sparse representation based short-voice speaker recognition method
CN103730121A (en) * 2013-12-24 2014-04-16 中山大学 Method and device for recognizing disguised sounds

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
ELHOCINE BOUTELLAA 等: "IMPROVING ONLINE SIGNATURE VERIFICATION BY USER-SPECIFIC LIKELIHOOD RATIO SCORE NORMALIZATION", 《THE 8TH INTERNATIONAL WORKSHOP ON SYSTEMS, SIGNAL PROCESSING AND THEIR APPLICATIONS 2013: SPECIAL SESSIONS》 *
HONGKE NING等: "A New Score Normalization for Text-Independent Speaker Verification", 《PROCEEDINGS OF THE 19TH INTERNATIONAL CONFERENCEON DIGITAL SIGNAL PROCESSING》 *
蒋晔: "基于短语音和信道变化的说话人识别研究", 《中国博士学位论文全文数据库 信息科技辑》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109997185A (en) * 2016-11-07 2019-07-09 思睿逻辑国际半导体有限公司 Method and apparatus for the biometric authentication in electronic equipment
WO2020034628A1 (en) * 2018-08-14 2020-02-20 平安科技(深圳)有限公司 Accent identification method and device, computer device, and storage medium
CN110110790A (en) * 2019-05-08 2019-08-09 中国科学技术大学 Using the regular method for identifying speaker of Unsupervised clustering score
CN111883142A (en) * 2020-07-30 2020-11-03 山东理工大学 Speaker confirmation method based on log-likelihood value normalization

Similar Documents

Publication Publication Date Title
Zeinali et al. DeepMine Speech Processing Database: Text-Dependent and Independent Speaker Verification and Speech Recognition in Persian and English.
CN104143326B (en) A kind of voice command identification method and device
McLaren et al. Exploring the role of phonetic bottleneck features for speaker and language recognition
Lei et al. Dialect classification via text-independent training and testing for Arabic, Spanish, and Chinese
Heck et al. Robustness to telephone handset distortion in speaker recognition by discriminative feature design
Reynolds Automatic speaker recognition: Current approaches and future trends
Zeinali et al. Text-dependent speaker verification based on i-vectors, neural networks and hidden Markov models
CN105976819A (en) Rnorm score normalization based speaker verification method
CN104240706A (en) Speaker recognition method based on GMM Token matching similarity correction scores
Novotný et al. Analysis of Speaker Recognition Systems in Realistic Scenarios of the SITW 2016 Challenge.
CN110364168A (en) A kind of method for recognizing sound-groove and system based on environment sensing
Charisma et al. Speaker recognition using mel-frequency cepstrum coefficients and sum square error
Ajili Reliability of voice comparison for forensic applications
CN114220419A (en) Voice evaluation method, device, medium and equipment
Meyer et al. Autonomous measurement of speech intelligibility utilizing automatic speech recognition.
Wildermoth et al. GMM based speaker recognition on readily available databases
Sadeghian et al. Towards an automated screening tool for pediatric speech delay
Sarkar et al. Incorporating pass-phrase dependent background models for text-dependent speaker verification
CN109273012A (en) A kind of identity identifying method based on Speaker Identification and spoken digit recognition
Kenai et al. Forensic gender speaker recognition under clean and noisy environments
Dumpala et al. Analysis of the Effect of Speech-Laugh on Speaker Recognition System.
CN108694950A (en) A kind of method for identifying speaker based on depth mixed model
Karbasi et al. Blind Non-Intrusive Speech Intelligibility Prediction Using Twin-HMMs.
Rozi et al. Language-aware PLDA for multilingual speaker recognition
Beigi Effects of time lapse on speaker recognition results

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20160928