EP1072035A1 - Fixation de seuils et apprentissage d'un systeme de verification de locuteur - Google Patents

Fixation de seuils et apprentissage d'un systeme de verification de locuteur

Info

Publication number
EP1072035A1
EP1072035A1 EP99924813A EP99924813A EP1072035A1 EP 1072035 A1 EP1072035 A1 EP 1072035A1 EP 99924813 A EP99924813 A EP 99924813A EP 99924813 A EP99924813 A EP 99924813A EP 1072035 A1 EP1072035 A1 EP 1072035A1
Authority
EP
European Patent Office
Prior art keywords
speaker
model
speech
utterances
verification system
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP99924813A
Other languages
German (de)
English (en)
Inventor
Lodewijk Willem Johan Boves
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Koninklijke PTT Nederland NV
Koninklijke KPN NV
Original Assignee
Koninklijke PTT Nederland NV
Koninklijke KPN NV
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Koninklijke PTT Nederland NV, Koninklijke KPN NV filed Critical Koninklijke PTT Nederland NV
Publication of EP1072035A1 publication Critical patent/EP1072035A1/fr
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building

Definitions

  • Speaker verification (SV) systems are systems in which models of each customer must be built during an enrolment process, accept/reject thresholds must be established during the same enrolment process and speech of customers who claim a certain identity must be compared to the claimed speaker's model, to determine whether the identity claim is likely to be true
  • Speech is a behavioural biometric measure. As all other behaviour, speech behaviour is variable. Therefore, it is not possible to build exact models of a person's speech behaviour. Rather, models must always consist of some combination of central tendencies and the attendant variance around the central tendency value of all parameters with which the speech is characterised. By consequence, the process of verifying a claimed identity is always statistical in nature: one must test what the likelihood is that the newly observed speech pattern is indeed produced by the person who has enrolled the model (i.e., the person whose identity is claimed by the speaker) .
  • Speaker verification systems may use a wide range of parameters to characterise the speech, including spectral coefficients, Mel Frequency coeffeicients, Cepstral coefficients, Mel Cepstral coefficients. Pitch, Loudness, etc. All these different parameter representations are used in essentially the same process during model enrolment: for all individual coefficients central tendencies and variances must be estimated.
  • Another pair of statistical distributions must be estimated during the enrolment process, viz. the distribution of the distances to the speaker model of new utterances of the same speaker, and the distribution the distances of suitable utterances produced by impostor speakers to this speaker's model.
  • This pair of distributions is needed to enable the system to determine whether a new utterance is more likely to have been produced by the speaker who has enrolled the model or by an impostor speaker.
  • estimating the distribution of the distances of impostor speaker utterances to the newly enrolled speaker's model it may be possible to use speech of many speakers that has been recorded well before the start of the enrolment session.
  • a false accept decision means that the distance to the model of an impostor utterance is so small, that it falls well within the distribution of the distances of the true customer to her/his own model, and must therefore be accepted as if it was indeed produced by the true customer.
  • False reject means that the distance between an utterance of the true customer and her/his own model happened to be so large that it falls well within the distribution of impostor utterances, and therefore must be considered as an utterance produced by a speaker different from the true customer.
  • both classes can be combined, so as to obtain even better results.
  • Both classes of techniques address the issue of improving the estimates of the distance between the newly built model and utterances of the true customer.
  • Th(new) b * CTi + (1 -b) * CTt
  • Th (new) is the optimal threshold
  • CTi is the central tendency obtained from pre-recorded impostor speech
  • CTt is the central tendency estimated from the enrolment speech of the new customer
  • b is the interpolation parameter, that is optimised using additional pre-recorded impostor utterances that were not used in estimating Cti.
  • the distance distributions of both true customers and impostors approach the Gaussian distribution.
  • enrolment speech and pre-recorded impostor utterances are segmented into a large number of theoretically independent parts, for each of which the distance to the newly enrolled model is computed.
  • the Central Tendencies of the distance distributions are then corrected to remove the bias caused by the fact that the enrolment speech has been used both for building the model and for computing the distances to the model.
  • the optimal correction parameter h is optimised using additional pre-recorded impostor speech.
  • a receiving module 1 receives utterances of a speaker 2 during an enrolment process, during which speaker 2 produces n tokens of some set of phrases.
  • Model building module 3 builds one or more models consisting of explicit or implicit sets of central tendencies and variances of the speech coefficients of the utterances received via receiving module 1.
  • Threshold module 4 establishes accept/reject thresholds during said enrolment process, while estimating module 5.
  • Model building module 3 builds n different speaker models, each based on n-1 tokens, for each model one independent token being available for estimating, by the estimating module 5, the distance between the model and an utterance that has not been used to build the model.
  • the estimation module 5 estimates the central tendency of the distance between the speaker's model and newly produced utterances of the same speaker, and also its variance.
  • the estimation of the accept/reject threshold from enrolment speech is combined, by combining module 6, with pre-recorded impostor speech, whereby the central tendency of the distances between the model and utterances is optimised by optimising module 7.
  • Optimisation in the optimisation module 7, is executed by linear interpolation: Th(new) - b * CTi + (1 -b) * CTt, where Th(new) is the optimal threshold, CTi is the central tendency obtained from pre-recorded impostor speech, CTt is the central tendency estimated from the enrolment speech of the new customer, and b is the interpolation parameter, that is optimised using additional pre-recorded utterances not used in estimating Cti.
  • Enrolment speech and pre-recorded impostor utterances are segmented, in said optimising module 7, into a large number of theoretically independent parts, for each of which the distance to the newly enrolled model is computed.
  • the central tendencies of the distance distributions are corrected, in the optimising module 7, removing a bias caused by the fact that the enrolment speech has been used both for building the model and for computing the distances to the model.
  • a single optimal value is computed that applies to all speakers.
  • an optimal correction factor is estimated for each newly enrolled speaker.
  • ABSTRACT The EER gives a good estimate of die modeling module ot
  • a key pio lem for field applications in speaker verification is the SV system.
  • the EER does, however, not give much the issue ot a priori threshold setting
  • the decision threshold(s) must be independent and speaker-dependent decision thiesholds weie estimated a priori during the enrollment phase
  • Bayesian compaied Relevant parameters are estimated fiom theory indicate that the decision threshold; s) could be development data only.
  • the CAV1- pi iecl (CAller VF. ⁇ fication in Banking and If we denote as X (resp XI the acceptance desp leiection) Telecommunications) was a 2-year pro
  • FRANCF -FU 10 - isiiibutions the minimisation of C in equation ( 1 ) is If n is large enough, the utterance log-likelihood latio can be obtained bv implementing the PDF Ratio (PR) test [4] assumed to follow a Gaussian distribution Tins distribution accept is different depending on whether the speech utterance Y was i P ⁇ Y ) > pronounced by speaker X or by an impostor X log LRJY ⁇ X ) ⁇ G(M X ; S X ) reject (8) wlieie R is the Bavesian threshold log LR ⁇ ( ⁇ x) ⁇ G ⁇ M ,S ⁇ ) and similarly
  • the fourth SD method can be viewed as a speaker dependent
  • SD 1 consists ot estimating ⁇ (R) as a neai combination ot places Du ⁇ ng each call, the speaker was asked to utter a the log I R mean M . and variance S , following an number of items, including a speaker-dependent sequence ol appioach similai to the one proposed by Furui [6] 14 digits (twice) and a few other sequences ot 14 digits conesponding to other speakers
  • the second method lelies on an estimation of ⁇ (R) using enrollment matenal
  • the rest of the calls weie used as test also the client scoie obtained with the eniollment data
  • ⁇ (R) is obtained as a linear combination of
  • TS we have split the SFSP data into 2 estimates ol M and M. sub-populations which we denote SESP a and SFSP b SESP a contains 11 male and 10 female speakeis while
  • SESP-b contains 10 male and 10 female speakers
  • aus data set is composed of approximateh 800 genuine d ials and 250 wheie M x is obtained fiom pseudo-impostoi data wheieas impostor attempts trom odier clients (out ol which about 75 f M . is the (biased) estimate ol M Paiameter ⁇ is optimised are same-sex attempts)
  • SESP b as pseudo-nnposiois on a development population and development data for SESP-a and vice veisa
  • Acoustic lcatures are 16 LPC cepstral coetficienls with log
  • ABSTRACT Laboratory evaluations of SV systems are usually based on the Equal Error Rate (EER), obtained by
  • the C ⁇ Y ⁇ project (CAller VErification in Banking real data distributions requires the adjiistciiient of the and Telecommunications) is a 2-year project, supported threshold for an efficient decision.
  • I he Language Engineering Sector of the Telemat This paper reports on a series of comparative experics Applications Programme of the European Union, iments on a priori Threshold Setting (TS) carried out and for the Swiss partners by the Office Federal de by WP4.
  • TS Threshold Setting
  • the logarithm of LR ⁇ (Y) is obtained as while ⁇ " ( ⁇ - ⁇ - 1 and C j .,... represent the corresponding the sum of the logarithm of the frame-based likelihood costs (assuming a null cost for a true acceptance and a ratio scores lr ⁇ (y,) : true rejection).
  • the optimal threshold log LRx (Y ⁇ X) (8) (M ⁇ - Sx should only depend on the false acceptance / rejection co i ratio and the impostor / client a priori probability and log Lfifcv(V
  • UtviY ⁇ ⁇ -( ⁇ ) P (Y rejcrt 3.
  • SI SPEAKER-INDEPENDENT
  • l ami P ⁇ denotes the respective model likelihood functions for the speaker and the non-speaker
  • SI without normalisation
  • SI-N with normaliHowever, the obvious factor that make them signifisation. cantly different from those that could be expected from a field test data collection, is the lack of intentional impostor attempts.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephonic Communication Services (AREA)
  • Stereophonic System (AREA)

Abstract

L'invention concerne un système de vérification de locuteur comportant un dispositif de construction de modèles permettant de construire n modèles de locuteurs différents, chaque modèle étant basé sur n-1 jetons. Pour chaque modèle, un jeton indépendant est disponible pour estimer la distance entre le modèle et une émission de son qui n'a pas été utilisée pour construire le modèle. Un moyen d'estimation estime la tendance centrale de la distance entre le modèle de locuteur et les nouvelles émissions de son produites par le même locuteur, ainsi que sa variance. La tendance centrale des distances entre le modèle et les émissions de son est optimisée sur la base d'une interpolation linéaire : Th(nouveau) = b * CTi + (1-b) * CTt, Th (nouveau) représentant le seuil optimal, CTi représentant la tendance centrale obtenue à partir de la parole d'imposteur préenregistrée, CTt représentant la tendance centrale estimée à partir de la parole enregistrée du nouveau client, et b représentant le paramètre d'interpolation qui est optimisé au moyen d'émissions de son préenregistrées supplémentaires non utilisées dans l'estimation de CTi.
EP99924813A 1998-04-20 1999-04-16 Fixation de seuils et apprentissage d'un systeme de verification de locuteur Withdrawn EP1072035A1 (fr)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
NL1008930 1998-04-20
NL1008930 1998-04-20
PCT/EP1999/002641 WO1999054868A1 (fr) 1998-04-20 1999-04-16 Fixation de seuils et apprentissage d'un systeme de verification de locuteur

Publications (1)

Publication Number Publication Date
EP1072035A1 true EP1072035A1 (fr) 2001-01-31

Family

ID=19766981

Family Applications (1)

Application Number Title Priority Date Filing Date
EP99924813A Withdrawn EP1072035A1 (fr) 1998-04-20 1999-04-16 Fixation de seuils et apprentissage d'un systeme de verification de locuteur

Country Status (3)

Country Link
EP (1) EP1072035A1 (fr)
AU (1) AU4135199A (fr)
WO (1) WO1999054868A1 (fr)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8725514B2 (en) 2005-02-22 2014-05-13 Nuance Communications, Inc. Verifying a user using speaker verification and a multimodal web-based interface
US9251792B2 (en) 2012-06-15 2016-02-02 Sri International Multi-sample conversational voice verification
CN110838295B (zh) * 2019-11-17 2021-11-23 西北工业大学 一种模型生成方法、声纹识别方法及对应装置

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5839103A (en) * 1995-06-07 1998-11-17 Rutgers, The State University Of New Jersey Speaker verification system using decision fusion logic

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See references of WO9954868A1 *

Also Published As

Publication number Publication date
AU4135199A (en) 1999-11-08
WO1999054868A1 (fr) 1999-10-28

Similar Documents

Publication Publication Date Title
CN106782507B (zh) 语音分割的方法及装置
CN108766441B (zh) 一种基于离线声纹识别和语音识别的语音控制方法及装置
JPH0354600A (ja) 不明人物の同一性検証方法
US5216720A (en) Voice verification circuit for validating the identity of telephone calling card customers
Lindberg et al. Techniques for a priori decision threshold estimation in speaker verification
EP1159737B1 (fr) Reconnaissance du locuteur
Li et al. Automatic verbal information verification for user authentication
CN102324232A (zh) 基于高斯混合模型的声纹识别方法及系统
EP0528990A1 (fr) Reconnaissance et verification de la voix simultanees et multilocuteur par l'intermediairre d'un reseau telephonique
TW546632B (en) System and method for efficient storage of voice recognition models
Pierrot et al. A comparison of a priori threshold setting procedures for speaker verification in the CAVE project
US8050920B2 (en) Biometric control method on the telephone network with speaker verification technology by using an intra speaker variability and additive noise unsupervised compensation
Bimbot et al. Speaker verification in the telephone network: research activities in the CAVE project
KR100779242B1 (ko) 음성 인식/화자 인식 통합 시스템에서의 화자 인식 방법
EP1072035A1 (fr) Fixation de seuils et apprentissage d'un systeme de verification de locuteur
Bimbot et al. An overview of the PICASSO project research activities in speaker verification for telephone applications
Naik et al. Evaluation of a high performance speaker verification system for access Control
Chenafa et al. Biometric system based on voice recognition using multiclassifiers
Olsson Text dependent speaker verification with a hybrid HMM/ANN system
Jokinen et al. Comparison of Gaussian process regression and Gaussian mixture models in spectral tilt modelling for intelligibility enhancement of telephone speech.
Vivaracho et al. A comparative study of MLP-based artificial neural networks in text-independent speaker verification against GMM-based systems
Bellegarda et al. Language-independent, short-enrollment voice verification over a far-field microphone
Rosenberg et al. Small group speaker identification with common password phrases
Ali et al. A comparative study of Arabic speech recognition
Burnett Rapid speaker adaptation for neural network speech recognizers

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 20001120

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AT BE CH DE DK ES FI FR GB GR IE IT LI LU NL PT SE

17Q First examination report despatched

Effective date: 20010406

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN

18D Application deemed to be withdrawn

Effective date: 20011017