EP1072035A1 - Fixation de seuils et apprentissage d'un systeme de verification de locuteur - Google Patents
Fixation de seuils et apprentissage d'un systeme de verification de locuteurInfo
- Publication number
- EP1072035A1 EP1072035A1 EP99924813A EP99924813A EP1072035A1 EP 1072035 A1 EP1072035 A1 EP 1072035A1 EP 99924813 A EP99924813 A EP 99924813A EP 99924813 A EP99924813 A EP 99924813A EP 1072035 A1 EP1072035 A1 EP 1072035A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- speaker
- model
- speech
- utterances
- verification system
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 238000012795 verification Methods 0.000 title claims abstract description 32
- 238000012549 training Methods 0.000 title description 3
- 238000000034 method Methods 0.000 claims description 67
- 238000009826 distribution Methods 0.000 claims description 35
- 238000012937 correction Methods 0.000 claims description 7
- 238000012360 testing method Methods 0.000 description 19
- 230000001419 dependent effect Effects 0.000 description 13
- 238000011161 development Methods 0.000 description 12
- 238000002474 experimental method Methods 0.000 description 11
- 230000006870 function Effects 0.000 description 10
- 238000013459 approach Methods 0.000 description 5
- 238000011160 research Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 3
- 238000011156 evaluation Methods 0.000 description 3
- 150000002500 ions Chemical class 0.000 description 3
- 239000000463 material Substances 0.000 description 3
- 238000010606 normalization Methods 0.000 description 3
- 241000220010 Rhode Species 0.000 description 2
- 230000003542 behavioural effect Effects 0.000 description 2
- 230000000052 comparative effect Effects 0.000 description 2
- JILPJDVXYVTZDQ-UHFFFAOYSA-N lithium methoxide Chemical compound [Li+].[O-]C JILPJDVXYVTZDQ-UHFFFAOYSA-N 0.000 description 2
- 101000822695 Clostridium perfringens (strain 13 / Type A) Small, acid-soluble spore protein C1 Proteins 0.000 description 1
- 101000655262 Clostridium perfringens (strain 13 / Type A) Small, acid-soluble spore protein C2 Proteins 0.000 description 1
- 101100536354 Drosophila melanogaster tant gene Proteins 0.000 description 1
- 101000655256 Paraclostridium bifermentans Small, acid-soluble spore protein alpha Proteins 0.000 description 1
- 101000655264 Paraclostridium bifermentans Small, acid-soluble spore protein beta Proteins 0.000 description 1
- 150000001768 cations Chemical class 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000012827 research and development Methods 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/04—Training, enrolment or model building
Definitions
- Speaker verification (SV) systems are systems in which models of each customer must be built during an enrolment process, accept/reject thresholds must be established during the same enrolment process and speech of customers who claim a certain identity must be compared to the claimed speaker's model, to determine whether the identity claim is likely to be true
- Speech is a behavioural biometric measure. As all other behaviour, speech behaviour is variable. Therefore, it is not possible to build exact models of a person's speech behaviour. Rather, models must always consist of some combination of central tendencies and the attendant variance around the central tendency value of all parameters with which the speech is characterised. By consequence, the process of verifying a claimed identity is always statistical in nature: one must test what the likelihood is that the newly observed speech pattern is indeed produced by the person who has enrolled the model (i.e., the person whose identity is claimed by the speaker) .
- Speaker verification systems may use a wide range of parameters to characterise the speech, including spectral coefficients, Mel Frequency coeffeicients, Cepstral coefficients, Mel Cepstral coefficients. Pitch, Loudness, etc. All these different parameter representations are used in essentially the same process during model enrolment: for all individual coefficients central tendencies and variances must be estimated.
- Another pair of statistical distributions must be estimated during the enrolment process, viz. the distribution of the distances to the speaker model of new utterances of the same speaker, and the distribution the distances of suitable utterances produced by impostor speakers to this speaker's model.
- This pair of distributions is needed to enable the system to determine whether a new utterance is more likely to have been produced by the speaker who has enrolled the model or by an impostor speaker.
- estimating the distribution of the distances of impostor speaker utterances to the newly enrolled speaker's model it may be possible to use speech of many speakers that has been recorded well before the start of the enrolment session.
- a false accept decision means that the distance to the model of an impostor utterance is so small, that it falls well within the distribution of the distances of the true customer to her/his own model, and must therefore be accepted as if it was indeed produced by the true customer.
- False reject means that the distance between an utterance of the true customer and her/his own model happened to be so large that it falls well within the distribution of impostor utterances, and therefore must be considered as an utterance produced by a speaker different from the true customer.
- both classes can be combined, so as to obtain even better results.
- Both classes of techniques address the issue of improving the estimates of the distance between the newly built model and utterances of the true customer.
- Th(new) b * CTi + (1 -b) * CTt
- Th (new) is the optimal threshold
- CTi is the central tendency obtained from pre-recorded impostor speech
- CTt is the central tendency estimated from the enrolment speech of the new customer
- b is the interpolation parameter, that is optimised using additional pre-recorded impostor utterances that were not used in estimating Cti.
- the distance distributions of both true customers and impostors approach the Gaussian distribution.
- enrolment speech and pre-recorded impostor utterances are segmented into a large number of theoretically independent parts, for each of which the distance to the newly enrolled model is computed.
- the Central Tendencies of the distance distributions are then corrected to remove the bias caused by the fact that the enrolment speech has been used both for building the model and for computing the distances to the model.
- the optimal correction parameter h is optimised using additional pre-recorded impostor speech.
- a receiving module 1 receives utterances of a speaker 2 during an enrolment process, during which speaker 2 produces n tokens of some set of phrases.
- Model building module 3 builds one or more models consisting of explicit or implicit sets of central tendencies and variances of the speech coefficients of the utterances received via receiving module 1.
- Threshold module 4 establishes accept/reject thresholds during said enrolment process, while estimating module 5.
- Model building module 3 builds n different speaker models, each based on n-1 tokens, for each model one independent token being available for estimating, by the estimating module 5, the distance between the model and an utterance that has not been used to build the model.
- the estimation module 5 estimates the central tendency of the distance between the speaker's model and newly produced utterances of the same speaker, and also its variance.
- the estimation of the accept/reject threshold from enrolment speech is combined, by combining module 6, with pre-recorded impostor speech, whereby the central tendency of the distances between the model and utterances is optimised by optimising module 7.
- Optimisation in the optimisation module 7, is executed by linear interpolation: Th(new) - b * CTi + (1 -b) * CTt, where Th(new) is the optimal threshold, CTi is the central tendency obtained from pre-recorded impostor speech, CTt is the central tendency estimated from the enrolment speech of the new customer, and b is the interpolation parameter, that is optimised using additional pre-recorded utterances not used in estimating Cti.
- Enrolment speech and pre-recorded impostor utterances are segmented, in said optimising module 7, into a large number of theoretically independent parts, for each of which the distance to the newly enrolled model is computed.
- the central tendencies of the distance distributions are corrected, in the optimising module 7, removing a bias caused by the fact that the enrolment speech has been used both for building the model and for computing the distances to the model.
- a single optimal value is computed that applies to all speakers.
- an optimal correction factor is estimated for each newly enrolled speaker.
- ABSTRACT The EER gives a good estimate of die modeling module ot
- a key pio lem for field applications in speaker verification is the SV system.
- the EER does, however, not give much the issue ot a priori threshold setting
- the decision threshold(s) must be independent and speaker-dependent decision thiesholds weie estimated a priori during the enrollment phase
- Bayesian compaied Relevant parameters are estimated fiom theory indicate that the decision threshold; s) could be development data only.
- the CAV1- pi iecl (CAller VF. ⁇ fication in Banking and If we denote as X (resp XI the acceptance desp leiection) Telecommunications) was a 2-year pro
- FRANCF -FU 10 - isiiibutions the minimisation of C in equation ( 1 ) is If n is large enough, the utterance log-likelihood latio can be obtained bv implementing the PDF Ratio (PR) test [4] assumed to follow a Gaussian distribution Tins distribution accept is different depending on whether the speech utterance Y was i P ⁇ Y ) > pronounced by speaker X or by an impostor X log LRJY ⁇ X ) ⁇ G(M X ; S X ) reject (8) wlieie R is the Bavesian threshold log LR ⁇ ( ⁇ x) ⁇ G ⁇ M ,S ⁇ ) and similarly
- the fourth SD method can be viewed as a speaker dependent
- SD 1 consists ot estimating ⁇ (R) as a neai combination ot places Du ⁇ ng each call, the speaker was asked to utter a the log I R mean M . and variance S , following an number of items, including a speaker-dependent sequence ol appioach similai to the one proposed by Furui [6] 14 digits (twice) and a few other sequences ot 14 digits conesponding to other speakers
- the second method lelies on an estimation of ⁇ (R) using enrollment matenal
- the rest of the calls weie used as test also the client scoie obtained with the eniollment data
- ⁇ (R) is obtained as a linear combination of
- TS we have split the SFSP data into 2 estimates ol M and M. sub-populations which we denote SESP a and SFSP b SESP a contains 11 male and 10 female speakeis while
- SESP-b contains 10 male and 10 female speakers
- aus data set is composed of approximateh 800 genuine d ials and 250 wheie M x is obtained fiom pseudo-impostoi data wheieas impostor attempts trom odier clients (out ol which about 75 f M . is the (biased) estimate ol M Paiameter ⁇ is optimised are same-sex attempts)
- SESP b as pseudo-nnposiois on a development population and development data for SESP-a and vice veisa
- Acoustic lcatures are 16 LPC cepstral coetficienls with log
- ABSTRACT Laboratory evaluations of SV systems are usually based on the Equal Error Rate (EER), obtained by
- the C ⁇ Y ⁇ project (CAller VErification in Banking real data distributions requires the adjiistciiient of the and Telecommunications) is a 2-year project, supported threshold for an efficient decision.
- I he Language Engineering Sector of the Telemat This paper reports on a series of comparative experics Applications Programme of the European Union, iments on a priori Threshold Setting (TS) carried out and for the Swiss partners by the Office Federal de by WP4.
- TS Threshold Setting
- the logarithm of LR ⁇ (Y) is obtained as while ⁇ " ( ⁇ - ⁇ - 1 and C j .,... represent the corresponding the sum of the logarithm of the frame-based likelihood costs (assuming a null cost for a true acceptance and a ratio scores lr ⁇ (y,) : true rejection).
- the optimal threshold log LRx (Y ⁇ X) (8) (M ⁇ - Sx should only depend on the false acceptance / rejection co i ratio and the impostor / client a priori probability and log Lfifcv(V
- UtviY ⁇ ⁇ -( ⁇ ) P (Y rejcrt 3.
- SI SPEAKER-INDEPENDENT
- l ami P ⁇ denotes the respective model likelihood functions for the speaker and the non-speaker
- SI without normalisation
- SI-N with normaliHowever, the obvious factor that make them signifisation. cantly different from those that could be expected from a field test data collection, is the lack of intentional impostor attempts.
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Telephonic Communication Services (AREA)
- Stereophonic System (AREA)
Abstract
L'invention concerne un système de vérification de locuteur comportant un dispositif de construction de modèles permettant de construire n modèles de locuteurs différents, chaque modèle étant basé sur n-1 jetons. Pour chaque modèle, un jeton indépendant est disponible pour estimer la distance entre le modèle et une émission de son qui n'a pas été utilisée pour construire le modèle. Un moyen d'estimation estime la tendance centrale de la distance entre le modèle de locuteur et les nouvelles émissions de son produites par le même locuteur, ainsi que sa variance. La tendance centrale des distances entre le modèle et les émissions de son est optimisée sur la base d'une interpolation linéaire : Th(nouveau) = b * CTi + (1-b) * CTt, Th (nouveau) représentant le seuil optimal, CTi représentant la tendance centrale obtenue à partir de la parole d'imposteur préenregistrée, CTt représentant la tendance centrale estimée à partir de la parole enregistrée du nouveau client, et b représentant le paramètre d'interpolation qui est optimisé au moyen d'émissions de son préenregistrées supplémentaires non utilisées dans l'estimation de CTi.
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
NL1008930 | 1998-04-20 | ||
NL1008930 | 1998-04-20 | ||
PCT/EP1999/002641 WO1999054868A1 (fr) | 1998-04-20 | 1999-04-16 | Fixation de seuils et apprentissage d'un systeme de verification de locuteur |
Publications (1)
Publication Number | Publication Date |
---|---|
EP1072035A1 true EP1072035A1 (fr) | 2001-01-31 |
Family
ID=19766981
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP99924813A Withdrawn EP1072035A1 (fr) | 1998-04-20 | 1999-04-16 | Fixation de seuils et apprentissage d'un systeme de verification de locuteur |
Country Status (3)
Country | Link |
---|---|
EP (1) | EP1072035A1 (fr) |
AU (1) | AU4135199A (fr) |
WO (1) | WO1999054868A1 (fr) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8725514B2 (en) | 2005-02-22 | 2014-05-13 | Nuance Communications, Inc. | Verifying a user using speaker verification and a multimodal web-based interface |
US9251792B2 (en) | 2012-06-15 | 2016-02-02 | Sri International | Multi-sample conversational voice verification |
CN110838295B (zh) * | 2019-11-17 | 2021-11-23 | 西北工业大学 | 一种模型生成方法、声纹识别方法及对应装置 |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5839103A (en) * | 1995-06-07 | 1998-11-17 | Rutgers, The State University Of New Jersey | Speaker verification system using decision fusion logic |
-
1999
- 1999-04-16 EP EP99924813A patent/EP1072035A1/fr not_active Withdrawn
- 1999-04-16 AU AU41351/99A patent/AU4135199A/en not_active Abandoned
- 1999-04-16 WO PCT/EP1999/002641 patent/WO1999054868A1/fr not_active Application Discontinuation
Non-Patent Citations (1)
Title |
---|
See references of WO9954868A1 * |
Also Published As
Publication number | Publication date |
---|---|
AU4135199A (en) | 1999-11-08 |
WO1999054868A1 (fr) | 1999-10-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106782507B (zh) | 语音分割的方法及装置 | |
CN108766441B (zh) | 一种基于离线声纹识别和语音识别的语音控制方法及装置 | |
JPH0354600A (ja) | 不明人物の同一性検証方法 | |
US5216720A (en) | Voice verification circuit for validating the identity of telephone calling card customers | |
Lindberg et al. | Techniques for a priori decision threshold estimation in speaker verification | |
EP1159737B1 (fr) | Reconnaissance du locuteur | |
Li et al. | Automatic verbal information verification for user authentication | |
CN102324232A (zh) | 基于高斯混合模型的声纹识别方法及系统 | |
EP0528990A1 (fr) | Reconnaissance et verification de la voix simultanees et multilocuteur par l'intermediairre d'un reseau telephonique | |
TW546632B (en) | System and method for efficient storage of voice recognition models | |
Pierrot et al. | A comparison of a priori threshold setting procedures for speaker verification in the CAVE project | |
US8050920B2 (en) | Biometric control method on the telephone network with speaker verification technology by using an intra speaker variability and additive noise unsupervised compensation | |
Bimbot et al. | Speaker verification in the telephone network: research activities in the CAVE project | |
KR100779242B1 (ko) | 음성 인식/화자 인식 통합 시스템에서의 화자 인식 방법 | |
EP1072035A1 (fr) | Fixation de seuils et apprentissage d'un systeme de verification de locuteur | |
Bimbot et al. | An overview of the PICASSO project research activities in speaker verification for telephone applications | |
Naik et al. | Evaluation of a high performance speaker verification system for access Control | |
Chenafa et al. | Biometric system based on voice recognition using multiclassifiers | |
Olsson | Text dependent speaker verification with a hybrid HMM/ANN system | |
Jokinen et al. | Comparison of Gaussian process regression and Gaussian mixture models in spectral tilt modelling for intelligibility enhancement of telephone speech. | |
Vivaracho et al. | A comparative study of MLP-based artificial neural networks in text-independent speaker verification against GMM-based systems | |
Bellegarda et al. | Language-independent, short-enrollment voice verification over a far-field microphone | |
Rosenberg et al. | Small group speaker identification with common password phrases | |
Ali et al. | A comparative study of Arabic speech recognition | |
Burnett | Rapid speaker adaptation for neural network speech recognizers |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
17P | Request for examination filed |
Effective date: 20001120 |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AT BE CH DE DK ES FI FR GB GR IE IT LI LU NL PT SE |
|
17Q | First examination report despatched |
Effective date: 20010406 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN |
|
18D | Application deemed to be withdrawn |
Effective date: 20011017 |