WO1999054868A1 - Fixation de seuils et apprentissage d'un systeme de verification de locuteur - Google Patents

Fixation de seuils et apprentissage d'un systeme de verification de locuteur Download PDF

Info

Publication number
WO1999054868A1
WO1999054868A1 PCT/EP1999/002641 EP9902641W WO9954868A1 WO 1999054868 A1 WO1999054868 A1 WO 1999054868A1 EP 9902641 W EP9902641 W EP 9902641W WO 9954868 A1 WO9954868 A1 WO 9954868A1
Authority
WO
WIPO (PCT)
Prior art keywords
speaker
model
speech
utterances
verification system
Prior art date
Application number
PCT/EP1999/002641
Other languages
English (en)
Inventor
Lodewijk Willem Johan Boves
Original Assignee
Koninklijke Kpn N.V.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Koninklijke Kpn N.V. filed Critical Koninklijke Kpn N.V.
Priority to EP99924813A priority Critical patent/EP1072035A1/fr
Priority to AU41351/99A priority patent/AU4135199A/en
Publication of WO1999054868A1 publication Critical patent/WO1999054868A1/fr

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/04Training, enrolment or model building

Definitions

  • Speaker verification (SV) systems are systems in which models of each customer must be built during an enrolment process, accept/reject thresholds must be established during the same enrolment process and speech of customers who claim a certain identity must be compared to the claimed speaker's model, to determine whether the identity claim is likely to be true
  • Speech is a behavioural biometric measure. As all other behaviour, speech behaviour is variable. Therefore, it is not possible to build exact models of a person's speech behaviour. Rather, models must always consist of some combination of central tendencies and the attendant variance around the central tendency value of all parameters with which the speech is characterised. By consequence, the process of verifying a claimed identity is always statistical in nature: one must test what the likelihood is that the newly observed speech pattern is indeed produced by the person who has enrolled the model (i.e., the person whose identity is claimed by the speaker) .
  • Speaker verification systems may use a wide range of parameters to characterise the speech, including spectral coefficients, Mel Frequency coeffeicients, Cepstral coefficients, Mel Cepstral coefficients. Pitch, Loudness, etc. All these different parameter representations are used in essentially the same process during model enrolment: for all individual coefficients central tendencies and variances must be estimated.
  • Another pair of statistical distributions must be estimated during the enrolment process, viz. the distribution of the distances to the speaker model of new utterances of the same speaker, and the distribution the distances of suitable utterances produced by impostor speakers to this speaker's model.
  • This pair of distributions is needed to enable the system to determine whether a new utterance is more likely to have been produced by the speaker who has enrolled the model or by an impostor speaker.
  • estimating the distribution of the distances of impostor speaker utterances to the newly enrolled speaker's model it may be possible to use speech of many speakers that has been recorded well before the start of the enrolment session.
  • a false accept decision means that the distance to the model of an impostor utterance is so small, that it falls well within the distribution of the distances of the true customer to her/his own model, and must therefore be accepted as if it was indeed produced by the true customer.
  • False reject means that the distance between an utterance of the true customer and her/his own model happened to be so large that it falls well within the distribution of impostor utterances, and therefore must be considered as an utterance produced by a speaker different from the true customer.
  • both classes can be combined, so as to obtain even better results.
  • Both classes of techniques address the issue of improving the estimates of the distance between the newly built model and utterances of the true customer.
  • Th(new) b * CTi + (1 -b) * CTt
  • Th (new) is the optimal threshold
  • CTi is the central tendency obtained from pre-recorded impostor speech
  • CTt is the central tendency estimated from the enrolment speech of the new customer
  • b is the interpolation parameter, that is optimised using additional pre-recorded impostor utterances that were not used in estimating Cti.
  • the distance distributions of both true customers and impostors approach the Gaussian distribution.
  • enrolment speech and pre-recorded impostor utterances are segmented into a large number of theoretically independent parts, for each of which the distance to the newly enrolled model is computed.
  • the Central Tendencies of the distance distributions are then corrected to remove the bias caused by the fact that the enrolment speech has been used both for building the model and for computing the distances to the model.
  • the optimal correction parameter h is optimised using additional pre-recorded impostor speech.
  • a receiving module 1 receives utterances of a speaker 2 during an enrolment process, during which speaker 2 produces n tokens of some set of phrases.
  • Model building module 3 builds one or more models consisting of explicit or implicit sets of central tendencies and variances of the speech coefficients of the utterances received via receiving module 1.
  • Threshold module 4 establishes accept/reject thresholds during said enrolment process, while estimating module 5.
  • Model building module 3 builds n different speaker models, each based on n-1 tokens, for each model one independent token being available for estimating, by the estimating module 5, the distance between the model and an utterance that has not been used to build the model.
  • the estimation module 5 estimates the central tendency of the distance between the speaker's model and newly produced utterances of the same speaker, and also its variance.
  • the estimation of the accept/reject threshold from enrolment speech is combined, by combining module 6, with pre-recorded impostor speech, whereby the central tendency of the distances between the model and utterances is optimised by optimising module 7.
  • Optimisation in the optimisation module 7, is executed by linear interpolation: Th(new) - b * CTi + (1 -b) * CTt, where Th(new) is the optimal threshold, CTi is the central tendency obtained from pre-recorded impostor speech, CTt is the central tendency estimated from the enrolment speech of the new customer, and b is the interpolation parameter, that is optimised using additional pre-recorded utterances not used in estimating Cti.
  • Enrolment speech and pre-recorded impostor utterances are segmented, in said optimising module 7, into a large number of theoretically independent parts, for each of which the distance to the newly enrolled model is computed.
  • the central tendencies of the distance distributions are corrected, in the optimising module 7, removing a bias caused by the fact that the enrolment speech has been used both for building the model and for computing the distances to the model.
  • a single optimal value is computed that applies to all speakers.
  • an optimal correction factor is estimated for each newly enrolled speaker.
  • ABSTRACT The EER gives a good estimate of die modeling module ot
  • a key pio lem for field applications in speaker verification is the SV system.
  • the EER does, however, not give much the issue ot a priori threshold setting
  • the decision threshold(s) must be independent and speaker-dependent decision thiesholds weie estimated a priori during the enrollment phase
  • Bayesian compaied Relevant parameters are estimated fiom theory indicate that the decision threshold; s) could be development data only.
  • the CAV1- pi iecl (CAller VF. ⁇ fication in Banking and If we denote as X (resp XI the acceptance desp leiection) Telecommunications) was a 2-year pro
  • FRANCF -FU 10 - isiiibutions the minimisation of C in equation ( 1 ) is If n is large enough, the utterance log-likelihood latio can be obtained bv implementing the PDF Ratio (PR) test [4] assumed to follow a Gaussian distribution Tins distribution accept is different depending on whether the speech utterance Y was i P ⁇ Y ) > pronounced by speaker X or by an impostor X log LRJY ⁇ X ) ⁇ G(M X ; S X ) reject (8) wlieie R is the Bavesian threshold log LR ⁇ ( ⁇ x) ⁇ G ⁇ M ,S ⁇ ) and similarly
  • the fourth SD method can be viewed as a speaker dependent
  • SD 1 consists ot estimating ⁇ (R) as a neai combination ot places Du ⁇ ng each call, the speaker was asked to utter a the log I R mean M . and variance S , following an number of items, including a speaker-dependent sequence ol appioach similai to the one proposed by Furui [6] 14 digits (twice) and a few other sequences ot 14 digits conesponding to other speakers
  • the second method lelies on an estimation of ⁇ (R) using enrollment matenal
  • the rest of the calls weie used as test also the client scoie obtained with the eniollment data
  • ⁇ (R) is obtained as a linear combination of
  • TS we have split the SFSP data into 2 estimates ol M and M. sub-populations which we denote SESP a and SFSP b SESP a contains 11 male and 10 female speakeis while
  • SESP-b contains 10 male and 10 female speakers
  • aus data set is composed of approximateh 800 genuine d ials and 250 wheie M x is obtained fiom pseudo-impostoi data wheieas impostor attempts trom odier clients (out ol which about 75 f M . is the (biased) estimate ol M Paiameter ⁇ is optimised are same-sex attempts)
  • SESP b as pseudo-nnposiois on a development population and development data for SESP-a and vice veisa
  • Acoustic lcatures are 16 LPC cepstral coetficienls with log
  • ABSTRACT Laboratory evaluations of SV systems are usually based on the Equal Error Rate (EER), obtained by
  • the C ⁇ Y ⁇ project (CAller VErification in Banking real data distributions requires the adjiistciiient of the and Telecommunications) is a 2-year project, supported threshold for an efficient decision.
  • I he Language Engineering Sector of the Telemat This paper reports on a series of comparative experics Applications Programme of the European Union, iments on a priori Threshold Setting (TS) carried out and for the Swiss partners by the Office Federal de by WP4.
  • TS Threshold Setting
  • the logarithm of LR ⁇ (Y) is obtained as while ⁇ " ( ⁇ - ⁇ - 1 and C j .,... represent the corresponding the sum of the logarithm of the frame-based likelihood costs (assuming a null cost for a true acceptance and a ratio scores lr ⁇ (y,) : true rejection).
  • the optimal threshold log LRx (Y ⁇ X) (8) (M ⁇ - Sx should only depend on the false acceptance / rejection co i ratio and the impostor / client a priori probability and log Lfifcv(V
  • UtviY ⁇ ⁇ -( ⁇ ) P (Y rejcrt 3.
  • SI SPEAKER-INDEPENDENT
  • l ami P ⁇ denotes the respective model likelihood functions for the speaker and the non-speaker
  • SI without normalisation
  • SI-N with normaliHowever, the obvious factor that make them signifisation. cantly different from those that could be expected from a field test data collection, is the lack of intentional impostor attempts.

Abstract

L'invention concerne un système de vérification de locuteur comportant un dispositif de construction de modèles permettant de construire n modèles de locuteurs différents, chaque modèle étant basé sur n-1 jetons. Pour chaque modèle, un jeton indépendant est disponible pour estimer la distance entre le modèle et une émission de son qui n'a pas été utilisée pour construire le modèle. Un moyen d'estimation estime la tendance centrale de la distance entre le modèle de locuteur et les nouvelles émissions de son produites par le même locuteur, ainsi que sa variance. La tendance centrale des distances entre le modèle et les émissions de son est optimisée sur la base d'une interpolation linéaire : Th(nouveau) = b * CTi + (1-b) * CTt, Th (nouveau) représentant le seuil optimal, CTi représentant la tendance centrale obtenue à partir de la parole d'imposteur préenregistrée, CTt représentant la tendance centrale estimée à partir de la parole enregistrée du nouveau client, et b représentant le paramètre d'interpolation qui est optimisé au moyen d'émissions de son préenregistrées supplémentaires non utilisées dans l'estimation de CTi.
PCT/EP1999/002641 1998-04-20 1999-04-16 Fixation de seuils et apprentissage d'un systeme de verification de locuteur WO1999054868A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP99924813A EP1072035A1 (fr) 1998-04-20 1999-04-16 Fixation de seuils et apprentissage d'un systeme de verification de locuteur
AU41351/99A AU4135199A (en) 1998-04-20 1999-04-16 Threshold setting and training of a speaker verification system

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
NL1008930 1998-04-20
NL1008930 1998-04-20

Publications (1)

Publication Number Publication Date
WO1999054868A1 true WO1999054868A1 (fr) 1999-10-28

Family

ID=19766981

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP1999/002641 WO1999054868A1 (fr) 1998-04-20 1999-04-16 Fixation de seuils et apprentissage d'un systeme de verification de locuteur

Country Status (3)

Country Link
EP (1) EP1072035A1 (fr)
AU (1) AU4135199A (fr)
WO (1) WO1999054868A1 (fr)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8725514B2 (en) 2005-02-22 2014-05-13 Nuance Communications, Inc. Verifying a user using speaker verification and a multimodal web-based interface
US9443523B2 (en) 2012-06-15 2016-09-13 Sri International Multi-sample conversational voice verification
CN110838295A (zh) * 2019-11-17 2020-02-25 西北工业大学 一种模型生成方法、声纹识别方法及对应装置

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1996041334A1 (fr) * 1995-06-07 1996-12-19 Rutgers University Systeme de verification de locuteur

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1996041334A1 (fr) * 1995-06-07 1996-12-19 Rutgers University Systeme de verification de locuteur

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
FAKOTAKIS N ET AL: "SPEAKER VERIFICATION OVER TELEPHONE LINES BASED ON DIGITAL STRINGS", SIGNAL PROCESSING THEORIES AND APPLICATIONS, BRUSSELS, AUG. 24 - 27, 1992, vol. 1, no. CONF. 6, 24 August 1992 (1992-08-24), VANDEWALLE J;BOITE R; MOONEN M; OOSTERLINCK A, pages 399 - 402, XP000348685, ISBN: 0-444-89587-6 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8725514B2 (en) 2005-02-22 2014-05-13 Nuance Communications, Inc. Verifying a user using speaker verification and a multimodal web-based interface
US10818299B2 (en) 2005-02-22 2020-10-27 Nuance Communications, Inc. Verifying a user using speaker verification and a multimodal web-based interface
US9443523B2 (en) 2012-06-15 2016-09-13 Sri International Multi-sample conversational voice verification
CN110838295A (zh) * 2019-11-17 2020-02-25 西北工业大学 一种模型生成方法、声纹识别方法及对应装置
CN110838295B (zh) * 2019-11-17 2021-11-23 西北工业大学 一种模型生成方法、声纹识别方法及对应装置

Also Published As

Publication number Publication date
EP1072035A1 (fr) 2001-01-31
AU4135199A (en) 1999-11-08

Similar Documents

Publication Publication Date Title
CN106782507B (zh) 语音分割的方法及装置
CN108766441B (zh) 一种基于离线声纹识别和语音识别的语音控制方法及装置
Reynolds et al. Speaker verification using adapted Gaussian mixture models
JPH0354600A (ja) 不明人物の同一性検証方法
US5216720A (en) Voice verification circuit for validating the identity of telephone calling card customers
Matsui et al. Likelihood normalization for speaker verification using a phoneme-and speaker-independent model
Lindberg et al. Techniques for a priori decision threshold estimation in speaker verification
CA2189011C (fr) Procede permettant de reduire les besoins en base de donnees de systemes de reconnaissance vocale
CN102324232A (zh) 基于高斯混合模型的声纹识别方法及系统
EP0528990A1 (fr) Reconnaissance et verification de la voix simultanees et multilocuteur par l'intermediairre d'un reseau telephonique
US20040098259A1 (en) Method for recognition verbal utterances by a non-mother tongue speaker in a speech processing system
Bimbot et al. Speaker verification in the telephone network: research activities in the CAVE project
US8050920B2 (en) Biometric control method on the telephone network with speaker verification technology by using an intra speaker variability and additive noise unsupervised compensation
KR100779242B1 (ko) 음성 인식/화자 인식 통합 시스템에서의 화자 인식 방법
WO1999054868A1 (fr) Fixation de seuils et apprentissage d'un systeme de verification de locuteur
Bimbot et al. An overview of the PICASSO project research activities in speaker verification for telephone applications
Naik et al. Evaluation of a high performance speaker verification system for access control
Chenafa et al. Biometric system based on voice recognition using multiclassifiers
Olsson Text dependent speaker verification with a hybrid HMM/ANN system
Sailaja et al. Text independent speaker identification with finite multivariate generalized gaussian mixture model and hierarchical clustering algorithm
Rosenberg et al. Small group speaker identification with common password phrases
Ali et al. A comparative study of Arabic speech recognition
Bellegarda et al. Language-independent, short-enrollment voice verification over a far-field microphone
Dean Synchronous HMMs for audio-visual speech processing
Matsui et al. A study of models and a priori threshold updating in speaker verification

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AL AM AT AU AZ BA BB BG BR BY CA CH CN CU CZ DE DK EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT UA UG US UZ VN YU ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW SD SL SZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
WWE Wipo information: entry into national phase

Ref document number: 1999924813

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: KR

WWP Wipo information: published in national office

Ref document number: 1999924813

Country of ref document: EP

REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

WWW Wipo information: withdrawn in national office

Ref document number: 1999924813

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: CA