CN105976819A - Rnorm score normalization based speaker verification method - Google Patents
Rnorm score normalization based speaker verification method Download PDFInfo
- Publication number
- CN105976819A CN105976819A CN201610172918.3A CN201610172918A CN105976819A CN 105976819 A CN105976819 A CN 105976819A CN 201610172918 A CN201610172918 A CN 201610172918A CN 105976819 A CN105976819 A CN 105976819A
- Authority
- CN
- China
- Prior art keywords
- sigma
- omega
- rnorm
- test
- gamma
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 48
- 238000010606 normalization Methods 0.000 title claims abstract description 47
- 238000012795 verification Methods 0.000 title abstract description 6
- 238000012360 testing method Methods 0.000 claims abstract description 68
- 238000012549 training Methods 0.000 claims abstract description 27
- 238000004364 calculation method Methods 0.000 claims abstract description 8
- 239000011159 matrix material Substances 0.000 claims description 71
- 239000013256 coordination polymer Substances 0.000 claims description 18
- 238000007476 Maximum Likelihood Methods 0.000 claims description 12
- JOCBASBOOFNAJA-UHFFFAOYSA-N N-tris(hydroxymethyl)methyl-2-aminoethanesulfonic acid Chemical compound OCC(CO)(CO)NCCS(O)(=O)=O JOCBASBOOFNAJA-UHFFFAOYSA-N 0.000 claims description 6
- 239000004744 fabric Substances 0.000 claims description 5
- 238000012790 confirmation Methods 0.000 abstract description 6
- 238000005516 engineering process Methods 0.000 description 5
- 230000008901 benefit Effects 0.000 description 3
- 230000008859 change Effects 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 238000013461 design Methods 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 230000004069 differentiation Effects 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 239000000126 substance Substances 0.000 description 2
- 238000005452 bending Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000000205 computational method Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000003862 health status Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000036651 mood Effects 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/02—Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/06—Decision making techniques; Pattern matching strategies
- G10L17/12—Score normalisation
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Business, Economics & Management (AREA)
- Game Theory and Decision Science (AREA)
- Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)
Abstract
The invention discloses an Rnorm score normalization based speaker verification method, which comprises the steps of acquiring an identity authentication vector omega<tar> of a target speaker and an identity authentication vector W<UBM> of a general background model of a training phase, acquiring an identity authentication vector omega<test> of tested voice of the training phase, calculating a score lambda6(omega<test>, omega<clm>) according to Rnorm score normalization through the identity authentication vector omega<tar> of the target speaker, the identity authentication vector W<UBM> of the general background model and the identity authentication vector omega<test> of the tested voice, judging whether the score lambda6(omega<test>, omega<clm>) is greater than a threshold, if so, indicating confirmation and receiving, and if not, refusing. By adopting the method disclosed by the invention, the calculation complexity is greatly simplified and the calculation time is greatly saved on the basis of ensuring a high confirmation accuracy rate.
Description
Technical field
The invention belongs to speaker Recognition Technology field, be specifically related to a kind of speaker based on Rnorm Score Normalization
Confirmation method.
Background technology
The final step saying speaker verification is to adjudicate, this process be actually by input speech signal with claim
Speaker model compares the likelihood value drawn and a decision threshold being previously set compares, if likelihood value is higher than judgement
Thresholding, then accept the speaker claimed, otherwise refuse.It is extremely difficult for adjusting decision threshold, choosing of general decision threshold
Rule of thumb determine.
The change of score is influenced by many factors:
● speaker pronounces variant, is affected by mood, age, health status and sound channel own;
● different speaker's training data quality are different, content is different, the persistent period is different;
● environment noise when training data and test data acquisition is not mated, channel does not mates.
Traditional score normalization Znorm, Tnorm, ZTnorm, TZnorm are to speak towards based on GMM-UBM
People confirms that system proposes, and these score normalization are the most successfully applied and spoken based on GMM-UBM
People confirms system, but for speaker verification (i-SV) system of identity-based authentication vector ivector, based on
The speaker identification system of ivector, wherein the main purpose of training stage is for each speaker tar, trains according to it
Voice, training obtains a corresponding ivector model.The main purpose of test phase is given one section of voice test and claims
Speaker clm, it is judged that whether test voice is that speaker clm sends, it is judged that condition is calculating and claims speaker modelWith tested speech modelBetween similarity.Training voice such as can be believed with a lot of noises
Road noise etc., and these noises can cause the skew of the ivector vector model trained.Such as claim speaker modelIt is to obtain according to the training voice training claiming speaker, and ω 'testIt is that tested voice removal channel is made an uproar
The ivector model obtained after sound, as it is shown in figure 1, definition θclm,testFor ωclmWith ωtestBetween angle, θclm,test'For
ωclmWith ω 'testBetween angle, θnon-clm,testFor ωnon-clmWith ωtestBetween angle.
ω ' in theorytestAnd ωclmClose, if θclm,testSufficiently small, less than the threshold value that we are set, SV system is then
Think that test voice is that speaker clm sends, but practical situation can exist interchannel noise, so ω 'testThen have
ω may be offset totest.The angle the most finally carrying out judging is θclm,test, as Fig. 2 (a) can be seen that θclm,testRelatively big, it is more than
Threshold value, the most in this case, speaker identification system is just not considered as that test voice is that speaker clm sends, and this is letter
The misjudgment that road mismatch causes.
Meanwhile, Fig. 2 (b) gives model ωnon-clmThe most non-ivector model claiming speaker, it can be seen that ωtest
Distance ωnon-clmThe most far, different speakers can be also existed different impacts, this impact can bring what threshold value arranged to ask
Topic, but different speakers are also existed different impacts, so needing to arrange everyone different threshold values, the most significantly
Add the complexity of confirmation system.
Summary of the invention
In order to solve the problems referred to above, it is an object of the invention to provide a kind of speaker verification based on Rnorm Score Normalization
Method, on the basis of ensureing to confirm that accuracy rate is higher, enormously simplify the complexity of calculating and saves the time of calculating.
For achieving the above object, the present invention is achieved by techniques below scheme:
Method for identifying speaker based on Rnorm Score Normalization of the present invention, it is characterised in that include walking as follows
Rapid:
Obtain the authentication vector ω of the target speaker of training stagetarAuthentication with common background module is vowed
Amount WUBM;
Obtain the authentication vector ω of the tested voice of test phasetest;
By the authentication vector ω of target speakertar, the authentication vector W of common background moduleUBMWith tested
The authentication vector ω of examination voicetestScore Λ is calculated by Rnorm Score Normalization6(ωtest,ωclm);
Judge described score Λ6(ωtest,ωclm) whether higher than a threshold value, if it is, represent and confirm, then receive;Otherwise,
Refusal.
Further, the authentication vector ω of the described target speaker obtaining the training stagetarSpecifically comprise the following steps that
Calculate any one section of voice y of speaker undepandent JJThe statistic of the Baume-Welch of (t);
By the global disparity space matrix T trained, equation below is utilized to calculate the voice y of JJThe identity of (t)
Authentication vector ivector model;
Described formula is:
Further, described global disparity space matrix T calculation procedure is as follows:
Calculate the Baum-Welch statistic corresponding to each speaker S in training voice;
Randomly generate the initial value of global disparity space matrix T;
Calculate the Posterior distrbutionp of ω;
Maximum likelihood value revaluation, updates global disparity space matrix T;
Global disparity space matrix T more new formula is as follows:
TiΦc=Ωi
Wherein TiRepresent i-th row of T, ΩiRepresent i-th row of Ω, i=1,2 ..., CP, repeats that " posteriority calculating ω divides
Cloth " step and " maximum likelihood value revaluation, update global disparity space matrix T " step 10 time, then global disparity space matrix T instruction
Practice complete.
Further, during voice is trained in calculating, the Baum-Welch statistic corresponding to each speaker S is specific as follows:
Given speaker s, s=1,2 ..., S and its h section voice ys,h(t), h=1,2 ..., Ns, extract feature
Sequence
X={xt| t=1,2 ..., P}, for each Gaussian component c, define weight, average and covariance square herein
Baum-Welch statistic corresponding to Zhen is as follows:
Wherein, for any one frame t, γt(c) representative feature vector xtThe state occupation rate of the most each Gaussian component c,
I.e. feature x of t frametFall into the posterior probability of c state, be expressed as:
wcFor the mixed weight-value corresponding to the c Gauss model in common background UBM model;
Definition single order centre punch meteringWith second-order central statisticFor:
Wherein mcFor the mean value vector corresponding to the c Gauss model in common background UBM model;
The diagonal matrix making N (s) be CP × CP, its diagonal blocks is Nc(s) I, c=1 ..., C,ForThe super vector being spliced into,For diagonal matrix, consisting of of diagonal blocksDiagonal element.
Further, the Posterior distrbutionp of described calculating ω specifically comprises the following steps that
Given speaker s, s=1,2 ..., S and its h section voice ys,h(t), h=1,2 ..., NsThe spy extracted
Levy sequence X={ xt| t=1,2 ..., P}, make l (s)=I+TTΣ-1NhS () T, wherein Σ represents the super arrow of UBM covariance matrix
Amount, then ωs,hPosterior distrbutionp be that average isCovariance matrix is l-1The Gauss distribution of (s), then:
E[ωs,hωs,h T]=E [ωs,h]E[ωs,h T]+l-1(s)。
Further, the authentication vector W of the universal background model of described training stageUBMIt is by greatest hope EM
Algorithm obtains.
Further, the authentication vector ω of the tested voice of described acquisition test phasetestStep as follows:
Calculate tested speech ytestThe statistic of the Baume-Welch of (t);
By the global disparity space matrix T trained, equation below is utilized to calculate the voice y of testtest(t)
Authentication vector ivector model;
Described formula is:
Further, described global disparity space matrix T calculation procedure is as follows:
Calculate the Baum-Welch statistic corresponding to tested speech;
Randomly generate the initial value of global disparity space matrix T;
Calculate the Posterior distrbutionp of ω;
Maximum likelihood value revaluation, updates global disparity space matrix T;
Global disparity space matrix T more new formula is as follows:
TiΦc=Ωi
Wherein TiRepresent i-th row of T, ΩiRepresent i-th row of Ω, i=1,2 ..., CP, repeats that " posteriority calculating ω divides
Cloth " step and " maximum likelihood value revaluation, update global disparity space matrix T " step 10 time, then global disparity space matrix T instruction
Practice complete.
Further, tested speech y is calculatedtestT the Baum-Welch statistic corresponding to () is specific as follows:
Given tested speech and its h section voice ys,h(t), h=1,2 ..., Ns, extract characteristic sequence
X={xt| t=1,2 ..., P}, for Gaussian component c, define weight, average and covariance matrix institute herein right
The Baum-Welch statistic answered is as follows:
Wherein, for any one frame t, γt(c) representative feature vector xtThe state occupation rate of the most each Gaussian component c,
I.e. feature x of t frametFall into the posterior probability of c state, be expressed as:
wcFor the mixed weight-value corresponding to the c Gauss model in common background UBM model;
Definition single order centre punch meteringWith second-order central statisticFor:
Wherein mcFor the mean value vector corresponding to the c Gauss model in common background UBM model;
The diagonal matrix making N (s) be CP × CP, its diagonal blocks is Nc(s) I, c=1 ..., C,ForThe super vector being spliced into,For diagonal matrix, consisting of of diagonal blocksDiagonal element;
The Posterior distrbutionp of described calculating ω specifically comprises the following steps that
Tested speech and its h section voice ys,h(t), h=1,2 ..., NsThe characteristic sequence X={x extractedt| t=
1,2 ..., P}, make l (s)=I+TTΣ-1NhS () T, wherein Σ represents the super vector of UBM covariance matrix, then ωs,hPosteriority
Distribution is that average isCovariance matrix is l-1The Gauss distribution of (s), then:
E[ωs,hωs,h T]=E [ωs,h]E[ωs,h T]+l-1(s)。
Further, described by Rnorm Score Normalization calculate score formula particularly as follows:
Compared with prior art, the invention has the beneficial effects as follows:
First method for identifying speaker based on Rnorm Score Normalization of the present invention, by obtaining training clearing
The authentication vector of tested voice of target speaker, universal background model and test test phase, then pass through
Rnorm Score Normalization calculates score to compare with the threshold value set, if score is higher than threshold value, then it represents that confirm, then
Accept, otherwise refuse.
A kind of method for identifying speaker based on Rnorm Score Normalization of the present invention, first combines authentication
The advantage of the speaker identification system of vector, more non-claim speaker model by directly using universal background model to represent,
Solve to set up a corresponding non-speaker model of claiming without each speaker, thus the most just simplify the complexity of calculating
Degree, saves the time the most accordingly, and is confirming that in accuracy rate be also the highest.
Accompanying drawing explanation
Below in conjunction with the accompanying drawings the detailed description of the invention of the present invention is described in further detail, wherein:
Fig. 1 be method for identifying speaker based on Rnorm Score Normalization of the present invention background technology in Rnorm
Algorithm design philosophy schematic diagram;
Fig. 2 (a) be method for identifying speaker based on Rnorm Score Normalization of the present invention background technology in survey
Examination voice is to claim the scoring principle schematic that speaker sends;
Fig. 2 (b) be method for identifying speaker based on Rnorm Score Normalization of the present invention background technology in survey
Examination voice is not to claim the scoring principle schematic that speaker sends;
Fig. 3 is the flow chart of method for identifying speaker based on Rnorm Score Normalization of the present invention;
Fig. 4 is that method for identifying speaker based on Rnorm Score Normalization of the present invention is under TIMIT data base
DET curve chart;
Fig. 5 is that method for identifying speaker based on Rnorm Score Normalization of the present invention is at " 3convs-1conv "
DET curve chart under task.
Detailed description of the invention
Below in conjunction with accompanying drawing, the preferred embodiments of the present invention are illustrated, it will be appreciated that preferred reality described herein
Execute example be merely to illustrate and explain the present invention, be not intended to limit the present invention.
Method for identifying speaker based on Rnorm (Ratio normalization) Score Normalization of the present invention,
On the basis of being built upon tradition Score Normalization, utilize its advantage, and the speaker verification system of identity-based authentication vector
System combines, and can accomplish the confirmation rate that comparison is high.When but utilizing authentication vector to confirm, system combines, carry out finally
After normalization score calculates, different speakers is needed to arrange different threshold values and differentiates, therefore can cause last
Complexity during differentiation, and take considerable time.In order to solve this problem, by directly using by background model generation
For non-speaker model of claiming, and then also the most only a threshold value need to be set, last differentiation can be completed, therefore greatly reduce
The complexity calculated, saves the time.
Method for identifying speaker based on Rnorm Score Normalization of the present invention, flow chart as shown in Figure 3, its
Specifically comprise the following steps that
S01: obtain the authentication vector ω of the target speaker of training stagetarSpecifically comprise the following steps that
(1) any one section of voice y of speaker undepandent J is calculatedJThe statistic of the Baume-Welch of (t);
(2) by the global disparity space matrix T trained, equation below is utilized to calculate the voice y of JJThe body of (t)
Part authentication vector ivector model;
Described formula is:
Described global disparity space matrix T calculation procedure is as follows:
A () calculates the Baum-Welch statistic in training voice corresponding to each speaker S: give speaker s, s=
1,2 ..., S and its h section voice ys,h(t), h=1,2 ..., Ns, extract characteristic sequence X={xt| t=1,2 ...,
P}, for each Gaussian component c, defines weight, average and the Baum-Welch statistic corresponding to covariance matrix herein
As follows:
Wherein, for any one frame t, γt(c) representative feature vector xtThe state occupation rate of the most each Gaussian component c,
I.e. feature x of t frametFall into the posterior probability of c state, be expressed as:
wcFor the mixed weight-value corresponding to the c Gauss model in common background UBM model;
Definition single order centre punch meteringWith second-order central statisticFor:
Wherein mcFor the mean value vector corresponding to the c Gauss model in common background UBM model;
The diagonal matrix making N (s) be CP × CP, its diagonal blocks is Nc(s) I, c=1 ..., C,ForThe super vector being spliced into,For diagonal matrix, consisting of of diagonal blocksDiagonal element.
B () randomly generates the initial value of global disparity space matrix T;
The Posterior distrbutionp of (c) calculating ω:
Given speaker s, s=1,2 ..., S and its h section voice ys,h(t), h=1,2 ..., NsThe spy extracted
Levy sequence X={ xt| t=1,2 ..., P}, make l (s)=I+TTΣ-1NhS () T, wherein Σ represents the super arrow of UBM covariance matrix
Amount, then ωs,hPosterior distrbutionp be that average isCovariance matrix is l-1The Gauss distribution of (s), then:
E[ωs,hωs,h T]=E [ωs,h]E[ωs,h T]+l-1(s)。
D the value revaluation of () maximum likelihood, updates global disparity space matrix T;
Global disparity space matrix T more new formula is as follows:
TiΦc=Ωi
Wherein TiRepresent i-th row of T, ΩiRepresent i-th row of Ω, i=1,2 ..., CP, repeats that " posteriority calculating ω divides
Cloth " step and " maximum likelihood value revaluation, update global disparity space matrix T " step 10 time, then global disparity space matrix T instruction
Practice complete.
S02: obtain the authentication vector W of universal background modelUBM:
The authentication vector W of described universal background modelUBMObtained by greatest hope EM algorithm.
S03: obtain the authentication vector ω of the tested voice of test phasetest;
(1) tested speech y is calculatedtestThe statistic of the Baume-Welch of (t);
(2) by the global disparity space matrix T trained, equation below is utilized to calculate the voice y of testtest
The authentication vector ivector model of (t);
Described formula is:
Wherein, described global disparity space matrix T calculation procedure is as follows:
A () calculates the Baum-Welch statistic corresponding to tested speech;
Given tested speech and its h section voice ys,h(t), h=1,2 ..., Ns, extract characteristic sequence X={xt|t
=1,2 ..., P}, for Gaussian component c, define weight, average and the Baum-Welch system corresponding to covariance matrix herein
Measure as follows:
Wherein, for any one frame t, γt(c) representative feature vector xtThe state occupation rate of the most each Gaussian component c,
I.e. feature x of t frametFall into the posterior probability of c state, be expressed as:
wcFor the mixed weight-value corresponding to the c Gauss model in common background UBM model;
Definition single order centre punch meteringWith second-order central statisticFor:
Wherein mcFor the mean value vector corresponding to the c Gauss model in common background UBM model;
The diagonal matrix making N (s) be CP × CP, its diagonal blocks is Nc(s) I, c=1 ..., C,ForThe super vector being spliced into,For diagonal matrix, consisting of of diagonal blocksDiagonal element.
B () randomly generates the initial value of global disparity space matrix T;
C () calculates the Posterior distrbutionp of ω;
Tested speech and its h section voice ys,h(t), h=1,2 ..., NsThe characteristic sequence X={x extractedt| t=
1,2 ..., P}, make l (s)=I+TTΣ-1NhS () T, wherein Σ represents the super vector of UBM covariance matrix, then ωs,hPosteriority
Distribution is that average isCovariance matrix is l-1The Gauss distribution of (s), then:
E[ωs,hωs,h T]=E [ωs,h]E[ωs,h T]+l-1(s)。
D the value revaluation of () maximum likelihood, updates global disparity space matrix T;
Global disparity space matrix T more new formula is as follows:
TiΦc=Ωi
Wherein TiRepresent i-th row of T, ΩiRepresent i-th row of Ω, i=1,2 ..., CP, repeats that " posteriority calculating ω divides
Cloth " step and " maximum likelihood value revaluation, update global disparity space matrix T " step 10 time, then global disparity space matrix T instruction
Practice complete.
S04: by the authentication vector ω of target speakertar, the authentication vector W of universal background modelUBMWith
The authentication vector ω of tested voicetestScore Λ is calculated by Rnorm Score Normalization6(ωtest,ωclm);
WhereinWherein wclm and ωtarBelong to the identical concept, simply ωtar
Being in the training stage, wclm is at test phase, say, that the two is also consistent in computational methods.
S05: judge described score Λ6(ωtest,ωclm) whether higher than a threshold value, if it is, represent and confirm, then receive;
Otherwise, refusal.
MSR-toolbo tool kit is make use of to achieve the speaker identification system that text based on ivector is unrelated herein
As baseline system, the speech database used in experiment has two, and one is TIMIT, and one is NIST SRE 2004.
MFCC dimension is 20 dimensions, and wherein the first dimension is logarithmic energy, and the MFCC of 20 dimensions is asked first-order difference and second differnce, last feature
Dimension is 60 dimensions.The feature of 60 dimensions is carried out feature bending and cepstrum mean normalization processes.The data of training UBM are from NIST
792 words in 8sides and 16 sides under SRE2004 data base, every words are about 3min~5min, TIMIT data base
In 4620 words, every words be about 3s~5s, noise speech in 15 in Noise-92, the UBM of training be one 2028 dimension
The GMM model of sex relevant (gender-dependent).The dimension of Ivector takes 400.
Wherein, the speech database TIMIT that international LDC tissue is issued is first available language having a large amount of speaker
Sound data base, is by MIT, SRI international research institute and Texas Instruments's joint development, thus is widely used in and speaks
The research that people identifies.The playback environ-ment of TIMIT is public place, and recording personnel have 630 people (male 438 people, female 192 people), these
Containing eight kinds of dialects of English in the voice of speaker, each speaker reads 10 sentences, each sentence UL about 3s
~5s, record type and transmission channel are fixing mike, recording substance is English sentence, is spaced without record length.TIMIT language
Sound database sampling rate is 16KHz, and quantitative rate is 16bit.
Since within 1996, starting NIST SRE evaluation and test, the follow-up data base evaluating and testing use each time of NIST is above
On the basis of evaluation and test in several years, carry out what suitable adjustment was recorded according to current research level and practical situations, thus gradually
Have accumulated substantial amounts of data base.NIST SRE 2004 data base samples for 8k, and 8bit quantifies, the sph file format of μ compression, main
Mixer1 storehouse to be derived from, comprises 616 speakers, wherein women 370 people, male 246 people altogether.In NIST SRE 2004
The communicating data that is in daily life of data, sound pick-up outfit mainly includes wireless phone, landline telephone and mobile phone, with
Time this data base consider the multilingual and bilingual problem of speaker, wherein the languages of this data base are evenly distributed in Arab
On language, English, Russian, French, Chinese.Owing to the design synthesis of NIST SRE 2004 considers languages and channel, so this number
It is commonly used for training UBM model or channel space etc. according to storehouse.Understanding from table one, NIST SRE 2004 includes 7 kinds of trainings
With 4 kinds of test case.
Table one NIST SRE 2004 evaluates and tests task situation
In order to check method for identifying speaker based on Rnorm Score Normalization of the present invention without channel mismatch feelings
Performance under condition, TIMIT data base is the speech database of a standard, records mode single, and recording substance is English, recording
Environment is clean.Take 108 people in TIMIT test database, train 9 words, test 1 word, with 600sentences as imitative
The person's of emitting voice.
Under table two TIMIT data base in i-SV system EER and minDCF of different methods of marking
Method for identifying speaker based on Rnorm Score Normalization of the present invention is can be seen that from Fig. 4 and Biao bis-
On the EER of i-SV system, 0.4% is reduced, than CSS-Znorm, CSS-than original cosine similarity CSS methods of marking
Tnorm and CSS-ZTnorm, CSS-TZnorm methods of marking will be outstanding, but minimum detection cost changes relatively comparatively speaking
Little.But the change that as can be seen from Figure 4 i-SV system based on CSS-Rnorm and CSS-ZTnorm methods of marking is overall becomes
Gesture is similar, similar nature, and reason is voice the cleanest due to TIMIT voice, without channel mismatch, on basis
Good result, the impact on TIMIT data base's test result of all various method for normalizing can have been obtained under CSS scoring
Not quite, but remain to the performance of change system.
For the i-CSS-checking method for identifying speaker based on Rnorm Score Normalization of the present invention to propose
The performance in the case of having channel mismatch of Rnorm-SV system, of the present invention based on Rnorm Score Normalization speaks
People's confirmation method have selected NIST SRE 2004 data base.NIST SRE 2004 data base has multiple voice channel, including
Mike channel, telephone channel etc., gather environment simultaneously and have multiformity.Take the task " 3convs-in NIST SRE2004
1convs ", wherein everyone training voice is 3 words, is call voice dialogue, and double track is recorded, about 5min, test
It is 1 word, totally 22899 test samples.
From figure 5 it can be seen that for " 3convs-1conv " task in NIST SRE 2004 data base, based on
The i-SV system of CSS-Rnorm methods of marking achieves classic result, drops compared with i-SV system based on CSS methods of marking
The low EER of 4.5%, CSS-Tnorm are better than CSS-Znorm, but i-SV system based on CSS-ZTnorm and CSS-TZnorm
EER and the minDCF index of system is the most close, and difference is little.But as can be seen from Table III, i-CSS-Znorm-SV achieves
Best minimum detection cost, i-CSS-Znorm-SV illustrates the advantage of self in the complexity and speed of system.Exist
It is that off-line completes that the reason of this phenomenon is because the calculating of Znorm Score Normalization, so the minDCF of minimum can be realized, and
The calculating of Tnorm Score Normalization completes, so minDCF is poorer than Znorm, due to institute of the present invention test when
The Rnorm score normalization that the method for identifying speaker based on Rnorm Score Normalization stated proposes considers
The feature of ivector model score and the impact on threshold value setting, so EER can minimize under channel mismatch conditions.
Under table three " 3convs-1conv " task in i-SV system EER and minDCF of different methods of marking
The above, be only presently preferred embodiments of the present invention, and the present invention not makees any pro forma restriction, therefore
Every without departing from technical solution of the present invention content, any amendment of above example being made according to the technical spirit of the present invention,
Equivalent variations and modification, all still fall within the range of technical solution of the present invention.
Claims (10)
1. a method for identifying speaker based on Rnorm Score Normalization, it is characterised in that comprise the steps:
Obtain the authentication vector ω of the target speaker of training stagetarAuthentication vector with universal background model
WUBM;
Obtain the authentication vector ω of the tested voice of test phasetest;
By the authentication vector ω of target speakertar, the authentication vector W of universal background modelUBMWith tested language
The authentication vector ω of soundtestScore Λ is calculated by Rnorm Score Normalization6(ωtest,ωclm);
Judge described score Λ6(ωtest,ωclm) whether higher than a threshold value, if it is, represent and confirm, then receive;Otherwise, refuse
Absolutely.
Method for identifying speaker based on Rnorm Score Normalization the most according to claim 1, it is characterised in that:
The authentication vector ω of the described target speaker obtaining the training stagetarSpecifically comprise the following steps that
Calculate any one section of voice y of speaker undepandent JJThe statistic of the Baume-Welch of (t);
By the global disparity space matrix T trained, equation below is utilized to calculate the voice y of JJT the authentication of () is vowed
Amount ivector model;
Described formula is:
Method for identifying speaker based on Rnorm Score Normalization the most according to claim 2, it is characterised in that:
Described global disparity space matrix T calculation procedure is as follows:
Calculate the Baum-Welch statistic corresponding to each speaker S in training voice;
Randomly generate the initial value of global disparity space matrix T;
Calculate the Posterior distrbutionp of ω;
Maximum likelihood value revaluation, updates global disparity space matrix T;
Global disparity space matrix T more new formula is as follows:
TiΦc=Ωi
Wherein TiRepresent i-th row of T, ΩiRepresent i-th row of Ω, i=1,2 ..., CP, repeats " calculating the Posterior distrbutionp of ω " step
Rapid secondary with " maximum likelihood value revaluation updates global disparity space matrix T " step 10, then global disparity space matrix T has trained
Finish.
Method for identifying speaker based on Rnorm Score Normalization the most according to claim 3, it is characterised in that:
In calculating training voice, the Baum-Welch statistic corresponding to each speaker S is specific as follows:
Given speaker s, s=1,2 ..., S and its h section voice ys,h(t), h=1,2 ..., Ns, extract characteristic sequence X
={ xt| t=1,2 ..., P}, for each Gaussian component c, define herein corresponding to weight, average and covariance matrix
Baum-Welch statistic is as follows:
Wherein, for any one frame t, γt(c) representative feature vector xtThe state occupation rate of the most each Gaussian component c, i.e. t
Feature x of frametFall into the posterior probability of c state, be expressed as:
wcFor the mixed weight-value corresponding to the c Gauss model in common background UBM model;
Definition single order centre punch meteringWith second-order central statisticFor:
Wherein mcFor the mean value vector corresponding to the c Gauss model in common background UBM model;
The diagonal matrix making N (s) be CP × CP, its diagonal blocks is Nc(s) I, c=1 ..., C,ForC=1,2 ..., the super vector that C is spliced into,For diagonal matrix, consisting of of diagonal blocksC=1,2 ..., C diagonal element.
Method for identifying speaker based on Rnorm Score Normalization the most according to claim 3, it is characterised in that:
The Posterior distrbutionp of described calculating ω specifically comprises the following steps that
Given speaker s, s=1,2 ..., S and its h section voice ys,h(t), h=1,2 ..., NsThe characteristic sequence extracted
X={xt| t=1,2 ..., P}, make l (s)=I+TTΣ-1NhS () T, wherein Σ represents the super vector of UBM covariance matrix, then
ωs,hPosterior distrbutionp be that average isCovariance matrix is l-1The Gauss distribution of (s), then:
E[ωs,hωs,h T]=E [ωs,h]E[ωs,h T]+l-1(s)。
Method for identifying speaker based on Rnorm Score Normalization the most according to claim 1, it is characterised in that:
The authentication vector W of the universal background model of described training stageUBMIt is to be obtained by greatest hope EM algorithm.
Method for identifying speaker based on Rnorm Score Normalization the most according to claim 1, it is characterised in that:
The authentication vector ω of the tested voice of described acquisition test phasetestStep as follows:
Calculate tested speech ytestThe statistic of the Baume-Welch of (t);
By the global disparity space matrix T trained, equation below is utilized to calculate the voice y of testtestThe identity of (t)
Authentication vector ivector model;
Described formula is:
Method for identifying speaker based on Rnorm Score Normalization the most according to claim 7, it is characterised in that:
Described global disparity space matrix T calculation procedure is as follows:
Calculate the Baum-Welch statistic corresponding to tested speech;
Randomly generate the initial value of global disparity space matrix T;
Calculate the Posterior distrbutionp of ω;
Maximum likelihood value revaluation, updates global disparity space matrix T;
Global disparity space matrix T more new formula is as follows:
TiΦc=Ωi
Wherein TiRepresent i-th row of T, ΩiRepresent i-th row of Ω, i=1,2 ..., CP, repeats " calculating the Posterior distrbutionp of ω " step
Rapid secondary with " maximum likelihood value revaluation updates global disparity space matrix T " step 10, then global disparity space matrix T has trained
Finish.
Method for identifying speaker based on Rnorm Score Normalization the most according to claim 7, it is characterised in that:
Calculate tested speech ytestT the Baum-Welch statistic corresponding to () is specific as follows:
Given tested speech and its h section voice ys,h(t), h=1,2 ..., Ns, extract characteristic sequence X={xt| t=1,
2 ..., P}, for Gaussian component c, define weight, average and the Baum-Welch statistic corresponding to covariance matrix herein
As follows:
Wherein, for any one frame t, γt(c) representative feature vector xtThe state occupation rate of the most each Gaussian component c, i.e. t
Feature x of frametFall into the posterior probability of c state, be expressed as:
wcFor the mixed weight-value corresponding to the c Gauss model in common background UBM model;
Definition single order centre punch meteringWith second-order central statisticFor:
Wherein mcFor the mean value vector corresponding to the c Gauss model in common background UBM model;
The diagonal matrix making N (s) be CP × CP, its diagonal blocks is Nc(s) I, c=1 ..., C,ForC=1,2 ..., the super vector that C is spliced into,For diagonal matrix, consisting of of diagonal blocksC=1,2 ..., C diagonal element;
The Posterior distrbutionp of described calculating ω specifically comprises the following steps that
Tested speech and its h section voice ys,h(t), h=1,2 ..., NsThe characteristic sequence X={x extractedt| t=1,
2 ..., P}, make l (s)=I+TTΣ-1NhS () T, wherein Σ represents the super vector of UBM covariance matrix, then ωs,hPosteriority divide
Cloth is that average isCovariance matrix is l-1The Gauss distribution of (s), then:
E[ωs,hωs,h T]=E [ωs,h]E[ωs,h T]+l-1(s)。
Method for identifying speaker based on Rnorm Score Normalization the most according to claim 1, it is characterised in that: described
By Rnorm Score Normalization calculate score formula particularly as follows:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610172918.3A CN105976819A (en) | 2016-03-23 | 2016-03-23 | Rnorm score normalization based speaker verification method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610172918.3A CN105976819A (en) | 2016-03-23 | 2016-03-23 | Rnorm score normalization based speaker verification method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105976819A true CN105976819A (en) | 2016-09-28 |
Family
ID=56989505
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610172918.3A Pending CN105976819A (en) | 2016-03-23 | 2016-03-23 | Rnorm score normalization based speaker verification method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105976819A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109997185A (en) * | 2016-11-07 | 2019-07-09 | 思睿逻辑国际半导体有限公司 | Method and apparatus for the biometric authentication in electronic equipment |
CN110110790A (en) * | 2019-05-08 | 2019-08-09 | 中国科学技术大学 | Using the regular method for identifying speaker of Unsupervised clustering score |
WO2020034628A1 (en) * | 2018-08-14 | 2020-02-20 | 平安科技(深圳)有限公司 | Accent identification method and device, computer device, and storage medium |
CN111883142A (en) * | 2020-07-30 | 2020-11-03 | 山东理工大学 | Speaker confirmation method based on log-likelihood value normalization |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102486922A (en) * | 2010-12-03 | 2012-06-06 | 株式会社理光 | Speaker recognition method, device and system |
CN103345923A (en) * | 2013-07-26 | 2013-10-09 | 电子科技大学 | Sparse representation based short-voice speaker recognition method |
CN103730121A (en) * | 2013-12-24 | 2014-04-16 | 中山大学 | Method and device for recognizing disguised sounds |
US20140244257A1 (en) * | 2013-02-25 | 2014-08-28 | Nuance Communications, Inc. | Method and Apparatus for Automated Speaker Parameters Adaptation in a Deployed Speaker Verification System |
US9165555B2 (en) * | 2005-01-12 | 2015-10-20 | At&T Intellectual Property Ii, L.P. | Low latency real-time vocal tract length normalization |
-
2016
- 2016-03-23 CN CN201610172918.3A patent/CN105976819A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9165555B2 (en) * | 2005-01-12 | 2015-10-20 | At&T Intellectual Property Ii, L.P. | Low latency real-time vocal tract length normalization |
CN102486922A (en) * | 2010-12-03 | 2012-06-06 | 株式会社理光 | Speaker recognition method, device and system |
US20140244257A1 (en) * | 2013-02-25 | 2014-08-28 | Nuance Communications, Inc. | Method and Apparatus for Automated Speaker Parameters Adaptation in a Deployed Speaker Verification System |
CN103345923A (en) * | 2013-07-26 | 2013-10-09 | 电子科技大学 | Sparse representation based short-voice speaker recognition method |
CN103730121A (en) * | 2013-12-24 | 2014-04-16 | 中山大学 | Method and device for recognizing disguised sounds |
Non-Patent Citations (3)
Title |
---|
ELHOCINE BOUTELLAA 等: "IMPROVING ONLINE SIGNATURE VERIFICATION BY USER-SPECIFIC LIKELIHOOD RATIO SCORE NORMALIZATION", 《THE 8TH INTERNATIONAL WORKSHOP ON SYSTEMS, SIGNAL PROCESSING AND THEIR APPLICATIONS 2013: SPECIAL SESSIONS》 * |
HONGKE NING等: "A New Score Normalization for Text-Independent Speaker Verification", 《PROCEEDINGS OF THE 19TH INTERNATIONAL CONFERENCEON DIGITAL SIGNAL PROCESSING》 * |
蒋晔: "基于短语音和信道变化的说话人识别研究", 《中国博士学位论文全文数据库 信息科技辑》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109997185A (en) * | 2016-11-07 | 2019-07-09 | 思睿逻辑国际半导体有限公司 | Method and apparatus for the biometric authentication in electronic equipment |
WO2020034628A1 (en) * | 2018-08-14 | 2020-02-20 | 平安科技(深圳)有限公司 | Accent identification method and device, computer device, and storage medium |
CN110110790A (en) * | 2019-05-08 | 2019-08-09 | 中国科学技术大学 | Using the regular method for identifying speaker of Unsupervised clustering score |
CN111883142A (en) * | 2020-07-30 | 2020-11-03 | 山东理工大学 | Speaker confirmation method based on log-likelihood value normalization |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zeinali et al. | DeepMine Speech Processing Database: Text-Dependent and Independent Speaker Verification and Speech Recognition in Persian and English. | |
CN104143326B (en) | A kind of voice command identification method and device | |
McLaren et al. | Exploring the role of phonetic bottleneck features for speaker and language recognition | |
Lei et al. | Dialect classification via text-independent training and testing for Arabic, Spanish, and Chinese | |
Heck et al. | Robustness to telephone handset distortion in speaker recognition by discriminative feature design | |
Reynolds | Automatic speaker recognition: Current approaches and future trends | |
Zeinali et al. | Text-dependent speaker verification based on i-vectors, neural networks and hidden Markov models | |
CN105976819A (en) | Rnorm score normalization based speaker verification method | |
CN104240706A (en) | Speaker recognition method based on GMM Token matching similarity correction scores | |
Novotný et al. | Analysis of Speaker Recognition Systems in Realistic Scenarios of the SITW 2016 Challenge. | |
CN110364168A (en) | A kind of method for recognizing sound-groove and system based on environment sensing | |
Charisma et al. | Speaker recognition using mel-frequency cepstrum coefficients and sum square error | |
Ajili | Reliability of voice comparison for forensic applications | |
CN114220419A (en) | Voice evaluation method, device, medium and equipment | |
Meyer et al. | Autonomous measurement of speech intelligibility utilizing automatic speech recognition. | |
Wildermoth et al. | GMM based speaker recognition on readily available databases | |
Sadeghian et al. | Towards an automated screening tool for pediatric speech delay | |
Sarkar et al. | Incorporating pass-phrase dependent background models for text-dependent speaker verification | |
CN109273012A (en) | A kind of identity identifying method based on Speaker Identification and spoken digit recognition | |
Kenai et al. | Forensic gender speaker recognition under clean and noisy environments | |
Dumpala et al. | Analysis of the Effect of Speech-Laugh on Speaker Recognition System. | |
CN108694950A (en) | A kind of method for identifying speaker based on depth mixed model | |
Karbasi et al. | Blind Non-Intrusive Speech Intelligibility Prediction Using Twin-HMMs. | |
Rozi et al. | Language-aware PLDA for multilingual speaker recognition | |
Beigi | Effects of time lapse on speaker recognition results |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20160928 |