WO2002065455A1

WO2002065455A1 - Evaluation system and method for binary classification systems utilizing unsupervised database

Info

Publication number: WO2002065455A1
Application number: PCT/ZA2002/000019
Authority: WO
Inventors: Johan Nikolaas Langenhoven Brummer
Original assignee: Spescom Datavoice (Pty) Limited
Priority date: 2001-02-15
Filing date: 2002-02-15
Publication date: 2002-08-22

Abstract

A method of evaluating performance of a computerized binary event classification system (30) comprises the steps of utilizing an unsupervised database (46) comprising a plurality of recorded data samples each relating to an event of a first kind alternatively a second kind, but wherein it needs not be known to which of the first kind or the second kind any data sample relates. For each sample in the database the system (30) is utilized to map the sample to a decision score (20), to yield a set of scores. The set of scores is modeled with a probabilistic parametric model having a first part being a probability distribution for scores obtained from data samples relating to events of the first kind and a second part being a probability distribution for scores obtained from data samples relating to events of the second kind. The parameters of the probabilistic parametric model are determined to produce a DET curve, to evaluate performance of the system (30), for example by determining estimates of false rejection and false acceptance rates.

Description

EVALUATION SYSTEM AND METHOD FOR BINARY CLASSIFICATION SYSTEMS UTILIZING UNSUPERVISED DATABASE

TECHNICAL FIELD

THIS invention relates to computerized event classification systems

and more particularly to apparatus for and a method of evaluating the

performance of such a classification system.

BACKGROUND ART

Computerized event classification systems of the kind which

automatically provides for each input data sample a decision score to

be used in a decision to be made by the system on whether the input

data sample relates to an event of a first kind alternatively to an event

of a second kind, are known in the art. Examples of such systems are

computerized speaker verification systems for use with voice

communication systems such as telephone systems. These

verification systems provide a decision whether a given input speech

utterance was spoken by a speaker who claims to have been the

speaker. The system utilizes a speaker model of the claiming speaker

and compares it to the input speech. The model is pre-generated by

the system during a training step by utilizing speech utterances of the

speaker. The decision comprises a "yes" for an acceptance or a "no"

for a rejection, alternatively or in addition a score which is proportional to how well the input speech utterance fits the speaker model. The

decision is arrived at by comparing the score to a threshold level.

Systems of the aforementioned kind are prone to two kinds of errors,

namely false rejections and false acceptances. A false rejection is

when the system indicates "no", while the utterance was in fact

spoken by the claiming speaker. Similarly, a false acceptance is when

the system indicates "yes", while the speaker was in fact not the

claiming speaker. The probability of occurrences of these two errors is

used to evaluate and characterize the performance of the system.

The performance of such a speaker verification system is often

evaluated at the hand of a so-called Detection Error Trade-off (DET)

curve of false acceptances against false rejections for various

threshold levels. It is known to determine this curve and/or other

evaluation parameters for a speaker verification system utilizing so-

called supervised databases. A supervised database comprises speech

utterances by many speakers and the identity of the speaker of each

utterance is known. The disadvantages of this known method and

system are that suitable supervised databases are expensive. They

need to be large and since the identity of the speaker of each

utterance must be known, they are tedious, time consuming and expensive to compile. Furthermore, a supervised database which may

be suitable for use in evaluating a verification system used in one

landline environment, may not be suitable in a landline environment in

another language jurisdiction or in a mobile environment, for example.

OBJECT OF THE INVENTION AND DEFINITIONS

Accordingly, it is an object of the present invention to provide an

alternative system for and method of evaluating performance of a

computerized event classification system of the aforementioned kind

and with which the applicant believes the aforementioned

disadvantages may at least be alleviated.

SUMMARY OF THE INVENTION

According to the invention there is provided a method of evaluating

performance of a computerized event classification system of a kind

which automatically provides for each input data sample a decision

score to be used in a decision on whether the input data sample

relates to either an event of a first kind or an event of a second kind,

the method comprising the steps of:

- utilizing an unsupervised database comprising a plurality of

recorded data samples each relating to an event of a first kind

alternatively a second kind, but wherein it needs not be known to which of the first kind or the second kind every data sample

relates;

for each sample in the database utilizing said system to map the

sample to a decision score, to yield a set of scores;

- modeling the set of scores with an overall probabilistic

parametric model having a first part being a probability

distribution for scores obtained from data samples relating to

events of the first kind and second part being a probability

distribution for scores obtained from data samples relating to

events of the second kind; and

estimating parameters of the overall probabilistic parametric

model.

The probability distribution for the scores in the case of scores that are

real numbers is a probability density distribution and in the case of

discrete scores, it is a probability distribution.

According to another aspect of the invention there is provided an

evaluation system for evaluating performance of a computerized event

classification system of a kind which automatically provides for each

input data sample a decision score to be used in a decision on whether the input data sample relates to either an event of a first kind or an

event of a second kind, the system comprising:

an unsupervised database comprising a plurality of recorded

data samples each of which relates to any one of an event of a

first kind and an event of a second kind, and wherein it needs

not be known to which of the first kind or the second kind every

data sample relates;

a data handler for providing the system with said data samples

to output for each data sample a respective decision score,

thereby to provide a set of decision scores;

a data processor for estimating parameters of an overall

probabilistic parametric model of the set of scores having a first

part being a probability distribution for scores obtained from

data samples relating to events of the first kind and a second

part being a probability distribution for scores obtained from

data samples relating to events of the second kind; and

a performance estimation stage utilizing the parameters to

generate estimates of the classification system performance.

BRIEF DESCRIPTION OF THE ACCOMPANYING DIAGRAMS

The invention will now further be described, by way of example only,

with reference to the accompanying diagrams wherein: figure 1 is a block diagram of a known computerized event

classification system;

figure 2 is a very basic block diagram of a known speaker

verification system under evaluation;

figure 3 shows examples of typical Detection Error Trade-off

(DET) curves for speaker verification systems;

figure 4 is a block and flow diagram of the system and method

according to the invention for evaluating performance of a

classification system of the kind shown in figures 1 and

2;

figure 5 are distribution curves of decision scores obtained by the

method according to the invention and modeling steps

relating thereto; and

figure 6 shows DET curves for a typical speaker verification

system under evaluation in accordance with the method

of the invention.

DESCRIPTION OF PREFERRED EMBODIMENTS OF THE INVENTION

In figure 1 there is shown a known computerized event classification

system 1 0 for automatically classifying input events 1 2 which are

elements of a set comprising events 1 4 of a first kind or targets and

events 1 6 of a second kind or non-targets, as either target or non- target. In use, indirect observations 18 of the events are made and

the system 10 generates a decision score 20, which is used by the

system in conjunction with a threshold 22, to arrive at a decision 24

which is an element of a set comprising "yes" or "no" for the sample.

Systems of the aforementioned kind are prone to the decision errors

referred to in the introduction of this specification. Therefore, it is

important to evaluate the error rates of these systems. The known

supervised database systems that are being used for this purpose are

also described in the introduction and the disadvantages of these

known systems and methods are also set out.

As one example of a classification systems of the aforementioned

kind, there is shown in figure 2 a known computerized speaker

verification system designated 30. The system is typically used with

telephone networks automatically to verify the identity of a speaker.

As described in the introduction of this specification, during a training

step, system 30 utilizes an input utterance 32 by a speaker to derive a

speaker model 34. Subsequently, and during a verification step, when

another utterance by a speaker claiming to be the speaker and the

model are input to the system at 36, the system derives a verification

score 38 which is proportional to how well the other utterance matches the model. Based on the score and a selectable threshold

value, the system accepts or rejects the identity claimed, as shown at

40. The probability of occurrences of false acceptances and false

rejections are used as an indication of the performance of the system

30.

As shown at 39, the threshold of the system may be adjusted to

either increase false rejections and hence decrease false acceptances,

or conversely to decrease false rejections and hence increase false

acceptances. If the threshold is adjusted from one extreme to another

in a series of steps and the error rates are determined for each step, a

curve of false rejection rate (f_r) against false acceptance rate (f_a) is

obtained. Typical curves are shown at 41 , 42 and 44 in figure 3. The

curve is called the Detection Error Trade-of (DET) curve. It will be

appreciated that the curve 42 represents a better system 30 than the

curve 44 and that curve 41 represents an even better system. One of

the objects of the invention is to obtain such a curve for a system 30,

but utilizing an unsupervised database 46, as opposed to the

supervised databases utilized in the prior art.

An unsupervised database for use with a speaker verification (SV)

system 30 is shown at 46 in figures 2 and 4. Required properties of the database 46 are that the database must contain data relating to

single-speaker utterances from many speakers, where the identities of

the speakers of the utterances need not be known. The database

must provide data relating to test pairs of utterances, where a pair

comprises a training utterance and a test utterance. A significant

proportion of the pairs must be impostor pairs (that is where the two

utterances of a pair are not spoken by the same speaker) and a

significant proportion must be target pairs (that is where both

utterances are spoken by the same speaker). The database 46 must

further provide a first section 48 of impostor pairs only and a second

section 50 which is a mixture of impostor pairs and target pairs.

Referring to figure 4, the method comprises the steps of utilizing the

unsupervised database 46 and to obtain a decision score 20 for each

data sample in the database, to yield a set of decision scores. The

result is a sub-set of pure impostor scores 52 shown in figure 5

derived from section 48 of the database 46 and a sub-set of mixed

scores 58 shown in figure 5, derived from section 50 of the database.

The set of scores therefore comprises an unknown number of scores

obtained from impostor tests and an unknown number of scores

obtained from true speaker tests. The system according to the

invention comprises a data processor 51 shown in figure 4 for estimating parameters of an overall probabilistic parametric model of

the scores having a first part 53 shown in figure 5 being a probability

distribution of impostor scores and a second part 55 being a

probability distribution of target or true speaker scores. A

performance estimation stage 57 is utilized to compute DET curves,

detection cost function (DCF) and equal error rates (EER). These

results may be fed back to the system 30 for automatic adjustment

and improvement of the system 30.

The pure impostor pairs from the section 48 and the SV system 30

under test are used to generate a set of pure impostor decision scores

52 shown in figure 5. As further shown in figure 5, the impostor

distribution is modeled with an /7-component, 1 -dimensional Gaussian

mixture model (GMM) 54, where the likelihood of a score, x, is: n

RW = ∑9JV(x,//,,σ,) •-' (1)

(2)

where N(x^_{| (}o, ) is the normal distribution with mean μ and standard

eviation σ . A concentric initialization is adopted where all components are

initialized with the sample mean of the score set and the variances are

spread over a range. The range begins smaller than the sample

variance and ends larger than the sample variance. Next, the model

parameters are adapted with several iterations of the so-called EM

algorithm 56. The EM algorithm and its use is described in Demster,

A., Laird N., and Rubin, D., "Maximum likelihood from incomplete data

via the EM algorithm", J Roy Stat. Soc. 39: 1 -38, 1 977 and Reynolds,

D.A. and Rose R.C., "Robust text-independent speaker identification

using Gaussian mixture speaker models", IEEE Trans. Speech Audio

Process. 3:72-83, 1995.

The mixed pairs from the section 50 in figure 4 and the SV 30 system

under test are used to generate a set of mixed impostor and target

scores 58 shown in figure 5. This distribution is modeled with a

structured GMM 60 of the form:

p(x) = P,mp∑_ιeι g, N(x , a + yμ„ γσj

+ P_lar Σ_ιeT r_l N(x _l β_l, δJ (3)

where

Σ_ιeT r, = l and Σ_{ι eI} q,=l (5) Here, P_ιmp is the fraction of impostor scores in the mixed score set and

P,_ar the fraction of target scores. This model has two sets of

components: .; the impostor components and T the target components.

The impostor component parameters are formed from the previously

estimated parameters and are left unchanged throughout re-estimation.

A global offset parameter a and global scale parameter γ are added to

modify the impostor distribution. These are to allow for possible

change in the impostor distribution between the pure impostor set and

the mixed score set. Note that N(x,α+γμ,γσ)=N((x-α)/γ,μ,σ)/γ, effecting

a transformation of the random variable x. The target components are

left to adapt freely.

The a priori parameters P_imp and P_tar may be initialized with guesses.

Adaptation parameter α can be set to zero and γ to one. Impostor

parameters {q„μ„σ,} are fixed as previously estimated. An equal

number of target components are initialized from the impostor

components, with an offset and enlarged variances: η = 7 , β_/ = μ, + s,

δ = 20σ,², where the offset s can be roughly estimated by inspection

of a histogram of the mixed scores. All parameters except the original impostor parameters h^^} , are re-

estimated with several iterations of the aforementioned EM algorithm

on the mixed score data.

The EM algorithm is used to obtain a local maximum of the likelihood

of the observed data (^χJ> given a model λ for the data. Specifically,

TIp(x \_lX) j_s maximized with respect to by iteratively re-estimating the

parameters of λ An iteration of the EM algorithm starts with a model

so that π_tP(x_t\λ) ≥π_tp(x_t\λ) (6)

In the case of a simple GMM: p(x_ι\λ)=∑_ι qN(x_ι,μ_rσ_ι) (7)

It can be shown that the inequality of equation 6 is ensured by

maximizing the auxiliary function Q(λ ,λ) with respect to λ, where Q(λ, λ)=∑_l ∑ P(i\x λ) log(q N(x_t,μ_{ι t} σ ) (9)

and where

P(i\x_t, λ)= _ q N(x_fμ_t,σ) / p(x_t\λ) (10)

is the posterior probability of component /^', given the data and the old

model. Q(-) is augmented by adding a Lagrange-multiplier term to

ensure the constraint of equation 8 and is globally maximized by setting its partial derivatives, with respect to each of the parameters in

λ, to 0.

In the case of the stuctured GMM of equations 3 through 5, the auxiliary function becomes

+ Σ, Σ, _eT P„ lθg(P,ar r, N(x ,β„δj) (11)

where the posterior component probabilities, given the old model, for

brevity may be written as:

P_lt = P(i\x_t,λ) (12)

Three Lagrange-multipliers for the three constraints (equations 4 and

5) are required. After differentiating and solving, the required

formulae are:

P_mp =(ΣΣ_l£lP_lt) / (Σ_lΣ_ιeIurP,_t) (13)

P_lar =(Σ,Σ_ιeTP_lt) / (Σ,Σ_l6lurP„) (14)

r_k = (Σ_tP_kt)/(Σ_lΣ_ιeTP,_l) (15)

β_k = (Σ_tP_klx,)/(∑,P_kl) (16)

δ_k ² = ((Σ_tP_klx_i ²)/(∑,Pk,))-βk (17) f-EJγ² + [AC/B-FJγ + [D-A²/B] = 0 (18) a = (A-yC)/B (19)

where A=Σ_lΣ_teιP„x_t/σ? (20)

B = Σ,Σ_ιeIP /σ² (23)

C = Σ_tΣ_ιejP_uμι/σ² (24)

D=Σ_tΣ_ιeIP,_tx²/σ² (25)

E = Σ,Σ_ιelP,_t (26)

F = Σ,Σ_ιeIP_llx_lμXσ² (27)

A Detection Error Tradeoff (DET) curve (see figure 3) for a

classification or verification system is a non-linear scaling of the

receiver operating curve (ROC), where the threshold of the system is

varied to produce a curve of miss or false rejection probability against

false acceptance probability.

Given the impostor and target parts of the estimated structured GMM,

the DET curve can be calculated. For a threshold t, the error

probabilities are:

The integrals are evaluated using the well known error function.

Varying t over a range of values and applying the DET transform,

produces the estimated DET curve.

In figure 6 there are shown practical examples of DET curves for a

typical speaker verification system 30. The curve 70 was determined

directly from the data utilizing the "1 -speaker detection" part of the

NIST 2000 Speaker Recognition Evaluation Database, which is

supervised in the sense that the speakers' identities are known.

Curve 72 was determined according to the method according to the

invention with an unsupervised database. It will be seen that for the

system 30 under evaluation the method according to the invention

provides results which are comparable to the results obtained with the

prior art supervised database.

The system and method according to the invention may also be

utilized to evaluate other classification systems, more particularly

binary classification systems. Such systems may include, but are not

limited to systems of a kind which automatically provides a decision

on whether input data relates to given data relating to biometric

features. Such features may include iris patterns, fingerprints, face

and/or hand shapes and profiles.

Claims

1 . A method of evaluating performance of a computerized event

classification system of a kind which automatically provides for

each input data sample a decision score to be used in a decision

on whether the input data sample relates to either an event of a

first kind or an event of a second kind, the method comprising

the steps of:

utilizing an unsupervised database comprising a plurality

of recorded data samples each relating to an event of a

first kind alternatively a second kind, but wherein it needs

not be known to which of the first kind or the second

kind every data sample relates;

for each sample in the database utilizing said system to

map the sample to a decision score, to yield a set of

scores;

modeling the set of scores with an overall probabilistic

parametric model having a first part being a probability

distribution for scores obtained from data samples relating

to events of the first kind and a second part being a

probability distribution for scores obtained from data

samples relating to events of the second kind; and estimating parameters of the overall probabilistic

parametric model.

2. A method as claimed in claim 1 , wherein the database

preferably comprises first and second sections, the first section

comprising data samples relating to events of the first kind only

and the second section comprising data samples relating to

events of both the first kind and events of the second kind and

wherein in the second section it needs not be known to which

of the first kind or the second kind any data sample relates.

3. A method as claimed in claim 2 wherein with each data sample

in the first section a set of first event scores is generated and

with each data sample in the second section a set of mixed

event scores is generated.

4. A method as claimed in claim 3 wherein the set of first event

scores is modeled with a first event probabilistic parametric

model and wherein said first event probabilistic model is utilized

in estimating said parameters of said overall probabilistic

parametric model.

5. A method as claimed in claim 2 wherein the overall probabilistic

parametric model comprises a third part being a probability

distribution of a ratio of data samples relating to events of the

first kind and data samples relating to events of the second kind

in the second section of the database.

6. An evaluation system for evaluating performance of a

computerized event classification system of a kind which

automatically provides for each input data sample a decision

score to be used in a decision on whether the input data sample

relates to either an event of a first kind or an event of a second

kind, the system comprising:

an unsupervised database comprising a plurality of

recorded data samples each of which relates to any one

of an event of a first kind and an event of a second kind,

and wherein it needs not be known to which of the first

kind or the second kind every data sample relates;

a data handler for providing the classification system with

said data samples to output for each data sample a

respective decision score, thereby to provide a set of

decision scores; a data processor for estimating parameters of an overall

probabilistic parametric model having a first part being a

probability distribution for scores obtained from data

samples relating to events of the first kind and a second

part being a probability distribution for scores obtained

from data samples relating to events of the second kind;

and

a performance estimation stage utilizing the parameters to

generate estimates of the classification system

performance.

7. A system as claimed in claim 6, wherein the database preferably

comprises first and second sections, the first section comprising

data samples relating to events of the first kind only and the

second section comprising data samples relating to events of

both the first kind and events of the second kind.

8. A system as claimed in claim 6 or claim 7, wherein the

classification system comprises a speaker detection system of

the kind which decides on whether speech spoken by a target

speaker is present in a given speech sample.

9. A system as claimed in claim 8 wherein each data sample

comprises data relating to a pair of first and second utterances,

the first utterance being a training utterance and the second

utterance being a test utterance.

10. A system as claimed in claim 9 wherein at least some of the

pairs of utterances comprises impostor utterances wherein the

first and second utterances of a pair are not spoken by a

common speaker; and wherein at least some of the pairs of

utterances comprises true speaker utterances wherein the first

and second utterances of a pair are spoken by a common

speaker.