WO1990008379A1 - Speaker recognition - Google Patents

Speaker recognition Download PDF

Info

Publication number
WO1990008379A1
WO1990008379A1 PCT/GB1990/000068 GB9000068W WO9008379A1 WO 1990008379 A1 WO1990008379 A1 WO 1990008379A1 GB 9000068 W GB9000068 W GB 9000068W WO 9008379 A1 WO9008379 A1 WO 9008379A1
Authority
WO
WIPO (PCT)
Prior art keywords
individual
vocal
bid
certainty
features
Prior art date
Application number
PCT/GB1990/000068
Other languages
French (fr)
Inventor
Andrew Mackinnon Sutherland
Steven Mark Hiller
Mervyn Abraham Jack
John David Michael Henry Laver
Original Assignee
The University Court Of The University Of Edinburgh
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from GB898900931A external-priority patent/GB8900931D0/en
Priority claimed from GB898926989A external-priority patent/GB8926989D0/en
Priority claimed from GB898926988A external-priority patent/GB8926988D0/en
Application filed by The University Court Of The University Of Edinburgh filed Critical The University Court Of The University Of Edinburgh
Publication of WO1990008379A1 publication Critical patent/WO1990008379A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques

Definitions

  • the invention relates to methods and apparatus of speaker recognition which in this context concerns generally the comparison of samples of speech uttered at different times to obtain an indication of whether the same speaker was responsible for both samples.
  • Such methods include speaker identification in which a speaker 0 is identified from a population of enrolled speakers as well as more conventional speaker verification in which a speaker is verified against one particular enrolled speaker.
  • Speaker recognition techniques can be divided into two main types. These are known as text-dependent and text-independent techniques. Text-dependent speaker recognition requires that exactly the same utterance is employed during the enrolment and the bid sessions while text-independent techniques remove the effects of 0 different text or utterances. Text-independent systems have been developed because the processing times involved can be considerably reduced over text-dependent techniques. Text-independent techniques are generally preferred in order to avoid the limitation of requiring a 5 speaker to utter particular words.
  • apparatus for recognizing that samples of speech uttered at different times were uttered by the same speaker comprises means for obtaining values relating to characteristics of the two samples and for determining whether the values satisfy a predetermined relationship characterized in that at least one of the characteristics is a waveform perturbation feature of each sample.
  • the vocal folds generate the acoustical signal which is modulated by the oral-pharyngeal tract and the nasal tract.
  • the independent nature of these parts means that it is possible to extract information about the vocal folds of the speaker independently of the effects of the other two parts. This information is extracted by looking at waveform perturbation features of one or more of the pitch and amplitude of the sample and is generally referred to below as a time domain analysis in contrast to the known frequency domain analyses which have previously been adopted for extracting information about the oral-pharyngeal tract.
  • the method comprises extracting a single value relating to one waveform perturbation feature from each of the samples.
  • at least two and most preferably ten waveform perturbation features are extracted from each sample.
  • the waveform perturbation features will be described in more detail below but may be chosen from the following: a. The mean of the absolute perturbations in pitch, b. The mean of the absolute perturbations in amplitude. c. The standard deviation of the perturbations in pitch, d. The standard deviation of the perturbations in amplitude, e. The number of perturbations in pitch over a set threshold (for example set at three percent of some trend value) , f. The number of perturbations in amplitude over a set threshold,
  • each sample of speech will comprise a number of words.
  • the utterance can be broken down into a number, say three hundred, equally time spaced intervals and then . a vector of say twenty four cepstral coefficients determined for each interval. The mean of 2c these 300 vectors averaged over the utterance can then be appended to the ten waveform perturbation features to generate a 34 component vector.
  • the similarity between the two vectors for the two speech samples may be 30 compared by determining the Euclidean distance between them and comparing that with a threshold. If the distance is less than the threshold, the speaker is verified.
  • the apparatus may be implemented by hard wired - c circuits or a suitably programmed computer.
  • Many different processes have been proposed to verify that an individual is the person he claims to be.
  • a vocal characteristic of an individual is analysed to generate a number of features unique to that characteristic and then those features are compared individually with corresponding reference features so as to generate a final score which is compared with a threshold so as to arrive at a final accept/reject decision.
  • a threshold so as to arrive at a final accept/reject decision.
  • a method of verifying an individual comprises obtaining from the individual a first, bid vocal feature characteristic of the individual; comparing the first, bid feature with a reference first vocal feature; generating a first certainty value representing the degree of similarity between the first bid and first reference features; and, if the first certainty value lies within predetermined limits, obtaining from the individual a second, bid vocal feature characteristic of the individual; comparing the second vocal feature with a reference second vocal feature; generating a second certainty value representing the degree of similarity between the second bid and second reference vocal features; and verifying the individual if at least one of the first and second certainty values satisfies a predetermined condition.
  • apparatus for verifying an individual comprises means for obtaining from the individual first and second bid vocal features characteristic of the individual; comparison means for comparing each of the first and second vocal features with corresponding reference vocal features and for generating respective first and second certainty values representing the degree of similarity between the compared features; and verifying means for verifying the individual if at least one of the first and second certainty values satisfies a predetermined condition.
  • certainty values allows a significant advantage to be achieved in which a second, bid vocal feature is only obtained if the first certainty value lies within predetermined limits.
  • the first certainty value lies within predetermined limits which are set to define a range in which the first comparison is marginally successful or marginally unsuccessful then the second, bid vocal feature is obtained and processed in an attempt to improve upon the first test.
  • the first certainty value lies below the lower limit or above the higher limit (in a second predetermined range) then the individual is immediately rejected or accepted respectively.
  • the individual may be accepted following the obtaining of second bid, vocal feature solely by comparing the second certainty value with a threshold, or the first and second certainty values may be combined, as mentioned above.
  • a method of verifying an individual comprises obtaining from the individual a first, bid vocal feature characteristic of the individual; comparing the first, bid feature with a reference first vocal feature; generating a first certainty value representing the degree of similarity between the first bid and first reference features; obtaining from the individual a second, bid vocal feature characteristic of the individual; comparing the second vocal feature with a reference second vocal feature; generating a second certainty value representing the degree of similarity between the second bid and second reference vocal features; and verifying the individual if at least one of the first and second certainty values satisfies a predetermined condition.
  • the vocal features which can be used in this invention may be chosen from any known features.
  • the features could include cepstral coefficients derived from a single word (in combination with a dynamic time warping technique) ; or cepstral coefficients obtained from a number of words.
  • waveform perturbation values as described above are used.
  • a first certainty value may be obtained by looking at cepstral coefficients obtained from one word; a second certainty value may be obtained by looking at cepstral coefficients derived from a number of words; and a third certainty value may be derived by looking at waveform perturbation and these may then be combined together to generate a final certainty value.
  • certainty values are derived from a waveform perturbation measurement, and cepstral coefficients from predetermined words. This allows the benefit of combining two dissimilar types of vocal characteristics, giving enhanced discrimination and reduced overall error rates.
  • a text dependent utterance is combined with a text independent utterance.
  • Figure 2 illustrates the output signal from a microphone
  • Figure 3 is a flow diagram illustrating operation of the waveform perturbation analysis processor of Figure 1;
  • Figure 4 illustrates the variation of pitch with time
  • Figure 5 illustrates the variation of smoothed pitch with time
  • Figure 6 illustrates excursions of pitch from a median level
  • FIG. 7 illustrates typical frequency distribution curves
  • Figure 8 illustrates error rate curves derived from Figure 7
  • Figure 9 illustrates a typical variation of certainty score
  • Figure 10 illustrates error rate as a function of certainty score
  • Figure 11 illustrates the frequency distribution of certainty scores for combined biological feature processors
  • Figure 12 illustrates the variation of error rate with certainty for a combination of processors
  • Figure 13 is a block diagram of apparatus for carrying out the method.
  • Figure 14 is a flow diagram illustrating operation of the Figure 13 apparatus.
  • the apparatus shown in Figure 1 comprises a data input device 1 such as a microphone and digitizing circuit connected in parallel to two separate analyzing processors.
  • the processors can be implemented on a suitably programmed computer ( Figure 3) or by circuits ( Figure 1) .
  • the first processor analyses pitch and amplitude perturbations in the incoming signal and comprises a waveform analysis circuit 2 to which the signal is fed, a perturbation analysis circuit 3 the output of which is input to a matcher circuit 5, and a reference store 6 for holding the enrolment perturbation measures.
  • the second processor analyses the speech signal by employing a Linear Predictive model. This model effects the removal of the pitch of the signal, and limits parametrization of the signal to vocal tract effects.
  • the second processor comprises a cepstral coefficient analyzer 4 for computing the average of the cepstral coefficients over the duration of the sentence utterance which are input to the matcher 5, and the reference store 6 for holding the enrolment cepstral coefficients.
  • the first processor analyses the vibration of the vocal folds
  • the second processor analyses the characteristics of the oral-pharyngeal vocal tract while the function of the matcher 5 is to output a value representing the relationship between the bid and the enrolment data over both processors.
  • the first processor will now be described in more detail.
  • FIG. 2 A typical example of a speech waveform as captured by a microphone is shown in Figure 2.
  • a period of the waveform is immediately identifiable. This periodicity in the waveform reflects the motion of the vocal folds, which vibrate in the airflow from the lungs. Any rise in the pitch or tone of the voice is caused by an increase in the rate of the vocal fold vibrations, and is thus accompanied by a shortening of the periods in the speech waveform. If this increase and decrease of vocal fold vibration is a gradual process, relative to the duration of a single period, it is evident in the perceived speech signal as intonation.
  • the duration of individual pitch periods also vary in an irregular fashion on a much shorter time base. Even if a steady note is held, the durations of successive pitch periods vary considerably. Although not all factors contributing to this phenomenon are known, it is generally accepted that both the anatomical make-up of the vocal folds and their dynamics during speech production influence the magnitude and rate of such irregularities.
  • the amplitude of the speech waveform, at a point which may be associated with a given pitch period, also varies in a rapid and irregular fashion (in addition to the more gradual effects of increasing or decreasing the volume of the voice) . Measures of pitch and amplitude perturbation may be generically termed waveform perturbation.
  • the analysis of waveform perturbation includes three major components. Determination of the pitch from the speech waveform; removal of the intonational trend, and quantification of the resulting perturbation.
  • a first step 10 the input signal is filtered to remove high frequency noise using for example a 600Hz, 32-tap FIR (Finite Impulse Response) filter.
  • the signal is then analysed in 10ms periods and in each such period the maximum amplitude of the signal
  • Amax is determined (step c 11) and a threshold T.1 set at 0.75 Amax (step ⁇ 12) .
  • a threshold T.1 set at 0.75 Amax (step ⁇ 12) .
  • a number of extraneous waveform peaks are removed (on the basis of their amplitude being below the dynamically determined threshold T,) . It should be noted that the 10ms period is chosen so that the amplitude threshold is updated at a rate commensurate with the likely (long term) variation in the amplitude envelope (i.e. the volume) of the voice.
  • the remaining waveform peaks are then parameterised into energy, shape and width (step 14) to eliminate most extraneous peaks while further extraneous peaks are removed on the basis of their temporal location with respect to other peaks.
  • This further removal is achieved by computing an expected time of occurrence of a peak T C ⁇ by carrying out a running two point average of pitch period duration (step 15) . Peaks occurring outside a time interval from 60% Tex to 175% Tex are removed,' this time interval being determined from a knowledge of known speech patterns so as to restrict the durations of potential pitch periods for those which are physically possible in human speech (step 16) .
  • the finally remaining peaks are then arranged in pairs using a Euclidean distance classifier 6 (step 17) . Distances on each axis are weighted: shape X 1.0, width X 1.0, energy X 0.2.
  • the output of the algorithm at this stage is a sequence of pitch period duration values and amplitude values.
  • An example of a sequence of pitch peri duration values is shown in Figure 4. These are known the pitch and amplitude contours.
  • the next stage in the process is to remove t underlying trend of intonation (for pitch) and volu (for amplitude) .
  • This is carried out using the techniq of non-linear smoothing which is described in more deta in Rabiner L.R., Sa bur M.R. and Schmidt C.E. "Applications of a non-linear smoothing algorithm 0 speech processing" , IEEE Transactions on Acousti Speech and Signal Processing, Vol 23, pp552-557, (197 incorporated herein by reference.
  • step 21 T sequence of median values is then low pass filtered (ste 22) and then steps 18-22 are repeated on successive se o of values which may or may not overlap.
  • the output of this part of the algorithm may be considered to reflect the intonationa trend (for pitch) or volume trend (for amplitude) of th incoming sequence of values as shown for pitch in Figur 5 5.
  • a subtraction of the trend from the origina sequence results in a residual of irregularities o perturbations of the form shown in Figure 6.
  • the final stage in perturbation analysis is t quantify these values. This is carried out simply b 0 taking the mean of the (absolute) perturbations; th standard deviation of the perturbations, and the numbe of perturbations over a set threshold. Thes calculations are carried out on both the amplitude an pitch contours resulting in six measures of perturbation 5 In addition to these measures, the mean value an standard deviation of the pitch itself is taken.
  • One final measure of perturbation is employed to both the pitch and amplitude contours: the directional perturbation factor. This does not employ the non-linear smoother described above. Rather, it begins by calculating the differences between successive pitch periods. Then, the number of differences which are associated with a change in direction of the contour (i.e.
  • a directional perturbation is logged.
  • the proportion of directional perturbations is then expressed as a percentage of all adjacent period differences. This is carried out on both the amplitude and pitch contours.
  • the final profile of the speaker based on waveform perturbation is thus comprised of ten features.
  • the second processor performs a linear prediction based on the fact that the vocal tract system may be approximated by a simple linear short-term invariant filter.
  • a linear predictive model is an adaptive filter which adjusts its own coefficients such that it models the incoming signal in an optimal fashion. It is adaptive on a short term basis, commensurate with the rate of change of the vocal tract articulators. A typical update rate of 10ms in employed.
  • the resulting filter coefficients describe the acoustical characteristics (and thus to some extent the shape of the vocal tract) . The more coefficients employed, the better the model. In one example, 24 coefficients are employed. Although these coefficients are themselves useful for speaker recognition, it has been found that the cepstral coefficients are yet more efficient.
  • Cepstral coefficients may be obtained by mathematically manipulating the linear predictive filter coefficients. This processor requires that an utterance of a given text is made on each occasion ("enrol” and "bid”). Having computed the cepstral coefficients for each 10ms frame of the utterance the values are averaged over the entire duration of the utterance to yield one set of mean coefficients.
  • the technique of cepstral analysis has previously been applied to the problem of speaker verification as the only processor in a system and is more fully described in S. Furui, "Cepstral analysis technique for automatic speaker verification". IEEE Trans. Acoust. Speech and Sig. Proc. ASSP-29 pp.254-271 (1981) incorporated herein by reference.
  • the matcher circuit 5 appends the 10 values from the waveform perturbation processor to the 24 average cepstral coefficient values from the analysis circuit 4 to generate a 34 dimensional vector for the utterance.
  • this 34 value or dimensional vector is stored in the reference store 6.
  • a number of such vectors will be generated which will form a cluster in feature space allowing each speaker to be represented by a single "average" reference feature vector in which each of the values has been averaged.
  • a new 34 dimensional feature vector is generated and this is then compared by the matcher circuit 5 with the corresponding reference vector.
  • This comparison is achieved by determining the distance in feature space between the bid and enrol vectors which in the simplest case is achieved by determining the Euclidean distance between the two vectors. This distance value is then compared with a predetermined threshold and if it is found to be less than the threshold the speaker is verified.
  • Figures 7-12 illustrate the principles on which a more sophisticated verification process is based.
  • the origins of the false reject (Type I) and false accept (Type II) error rate curves This will then lead directly to a definition of certainty value in the form of a certainty score and finally to the way in which the error rate curves for multiple processors may be combined.
  • the distance between the bid and the relevant reference may be calculated. This distance may be viewed as the Euclidean distance between
  • the false reject (Type I) curve is derived by setting a sequence of increasing distance thresholds and, for each, calculating the area under the intra bid curve of Fig 7, which lies to the right of the threshold.
  • the false accept (Type II) curve is derived by again setting a sequence of increasing distance thresholds and, for each, calculating the area under the inter bid curve of Fig 7, which lies to the left of the threshold.
  • a threshold could be set for which the error rate would be zero.
  • the intersection of the curves of Fig 8 corresponds to the equal error rate (EER) position (this does not necessarily correspond to the intersection point of Fig 1) .
  • EER error rate
  • the error rate ordinate ranges from 0 to 100% whereas the range of values on the abscissa
  • the curves of Fig 8 correspond to some particular vocal biometric processor, and in practice should be derived from a population large enough to be statistically significant.
  • I (d) is the value of the error rate taken from the Type I curve at distance d
  • II (d) is the value of the error rate taken from the Type II curve at distance d.
  • the error rate characteristic for the processor may be plotted as a function of certainty rather than distance, as shown in Fig 10.
  • Type II curve (cf. Fig 8) . What this means is that a bid giving a high certainty score will result more often in a false reject error than a false accept error. Conversely, a low certainty score will result more often in a false accept error than a false reject error.
  • the apparatus shown in Figure 13 comprises a central processing unit (CPU) 30 which performs the 0 authentication process.
  • the CPU 30 is coupled with a card reader 31 of conventional form for reading reference parameter information contained on a credit card like article (such as a smart card or a magnetic strip on such a card) .
  • the CPU 30 is also coupled with a waveform 5 perturbation analyser 32 of the type described above in connection with circuits 2, 3 ( Figure 1) and a cepstral coefficient analyser 33 similar to circuit 4 ( Figure 1) .
  • the CPU 30 controls a display on a monitor 5.
  • a signal 0 is issued by the CPU 30 on a line 35 to the system to which access is to be given, such as a cash dispensing system or a lock which can be released to allow access to a secure area.
  • the CPU 30 instructs an individual to 5 be authenticated (by displaying messages on the monitor 34) initially to utter a word or sequence of words into a microphone (not shown) .
  • the analyser 33 which is connected to the microphone, generates cepstral coefficient values which are fed to the CPU 30 which Q stores this information in an internal memory.
  • the CPU 30 then instructs the individual, again by displaying a suitable message on the monitor 34, to utter another sequence of words into the same or a different microphone which generates an analogue signal c which is supplied to the analyser 32.
  • the analyser 32 in turn generates signals representing characteristic waveform perturbation parameters of the utterance. These parameters are also stored in the internal memory.
  • the CPU 30 also reads predetermined (authentic) values relating to the cepstral coefficients and waveform perturbation parameters from a credit card inserted by the individual in the card reader 31.
  • the CPU 30 then calculates respective certainty values or scores using the formula set out above for each of the two processors and then sums the resultant certainty scores, after weighting, to derive a final certainty score. This is then compared with a final, predetermined threshold to determine whether or not the individual is authenticated.
  • the first utterance is input (step 40) , and the cepstral coefficient information is obtained and stored (step 41) . Then a certainty score CS for that information is immediately generated. If the first certainty score is less than 45% then the individual is immediately rejected while if the score is greater than 55%, the individual is immediately accepted.
  • the CPU 30 then instructs the individual to utter a second sequence (step 45) so that a waveform perturbation analysis (step 46, 47) can be performed in addition.
  • the resulting certainty score CS' from that analysis can then be used (alone or in combination with the previous certainty score (not shown) ) to authenticate the individual by comparison with a predetermined threshold (step 48) .

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)

Abstract

A method of recognising that samples of speech uttered at different times were uttered by the same speaker. The method comprises determining whether values related to characteristics of the two speech samples satisfy a predetermined relationship, characterized in that at least one of the characteristics is a waveform perturbation feature of each sample. A method of verifying an individual is also described. The method comprises obtaining (40, 41) from the individual a first, bid vocal feature characteristic of the individual; comparing the first, bid feature with a reference first vocal feature; and generating a first certainty value (42) representing the degree of similarity between the first bid and first reference features. If the first certainty value lies within predetermined limits, obtaining (45, 46) from the individual a second, bid vocal feature characteristic of the individual; comparing the second vocal feature with a reference second vocal feature; generating (47) a second certainty value representing the degree of similarity between the second bid and second reference vocal features; and verifying (48) the individual if at least one of the first and second certainty values satisfies a predetermined condition.

Description

SPEAKER RECOGNITION The invention relates to methods and apparatus of speaker recognition which in this context concerns generally the comparison of samples of speech uttered at different times to obtain an indication of whether the same speaker was responsible for both samples. Such methods include speaker identification in which a speaker 0 is identified from a population of enrolled speakers as well as more conventional speaker verification in which a speaker is verified against one particular enrolled speaker.
Speaker recognition techniques can be divided into two main types. These are known as text-dependent and text-independent techniques. Text-dependent speaker recognition requires that exactly the same utterance is employed during the enrolment and the bid sessions while text-independent techniques remove the effects of 0 different text or utterances. Text-independent systems have been developed because the processing times involved can be considerably reduced over text-dependent techniques. Text-independent techniques are generally preferred in order to avoid the limitation of requiring a 5 speaker to utter particular words.
We have developed a new text-independent method of recognizing that samples of speech uttered at different times were uttered by the same speaker, the method comprising determining whether values related to Q characteristics of the two speech samples satisfy a predetermined relationship, characterized in that at least one of the characteristics is a waveform perturbation feature of each sample.
In accordance with a second aspect of the present 5 invention, apparatus for recognizing that samples of speech uttered at different times were uttered by the same speaker, comprises means for obtaining values relating to characteristics of the two samples and for determining whether the values satisfy a predetermined relationship characterized in that at least one of the characteristics is a waveform perturbation feature of each sample.
We have investigated in some depth the speech production system and this can be divided into three largely independent parts: the vocal folds, the oral-pharyngeal tract, and the nasal tract. The vocal folds generate the acoustical signal which is modulated by the oral-pharyngeal tract and the nasal tract. The independent nature of these parts means that it is possible to extract information about the vocal folds of the speaker independently of the effects of the other two parts. This information is extracted by looking at waveform perturbation features of one or more of the pitch and amplitude of the sample and is generally referred to below as a time domain analysis in contrast to the known frequency domain analyses which have previously been adopted for extracting information about the oral-pharyngeal tract.
In its simplest form, the method comprises extracting a single value relating to one waveform perturbation feature from each of the samples. Preferably, however, at least two and most preferably ten waveform perturbation features are extracted from each sample. The waveform perturbation features will be described in more detail below but may be chosen from the following: a. The mean of the absolute perturbations in pitch, b. The mean of the absolute perturbations in amplitude. c. The standard deviation of the perturbations in pitch, d. The standard deviation of the perturbations in amplitude, e. The number of perturbations in pitch over a set threshold (for example set at three percent of some trend value) , f. The number of perturbations in amplitude over a set threshold,
10 g. A directional perturbation factor for pitch, h. A directional perturbation factor for amplitude, i. Mean absolute pitch, and j. Standard deviation in absolute pitch. ■ c Typically, each sample of speech will comprise a number of words.
If the waveform perturbation features alone are not sufficient, average cepstral coefficients from the same utterance can be obtained and processed with the waveform 20 perturbation values.
For example, the utterance can be broken down into a number, say three hundred, equally time spaced intervals and then . a vector of say twenty four cepstral coefficients determined for each interval. The mean of 2c these 300 vectors averaged over the utterance can then be appended to the ten waveform perturbation features to generate a 34 component vector.
The similarity between the two vectors for the two speech samples (with or without cepstral values) may be 30 compared by determining the Euclidean distance between them and comparing that with a threshold. If the distance is less than the threshold, the speaker is verified.
The apparatus may be implemented by hard wired -c circuits or a suitably programmed computer. Many different processes have been proposed to verify that an individual is the person he claims to be. In a typical system, a vocal characteristic of an individual is analysed to generate a number of features unique to that characteristic and then those features are compared individually with corresponding reference features so as to generate a final score which is compared with a threshold so as to arrive at a final accept/reject decision. There is a need to improve the performance of such verification processes.
In accordance with a third aspect of the present invention, a method of verifying an individual comprises obtaining from the individual a first, bid vocal feature characteristic of the individual; comparing the first, bid feature with a reference first vocal feature; generating a first certainty value representing the degree of similarity between the first bid and first reference features; and, if the first certainty value lies within predetermined limits, obtaining from the individual a second, bid vocal feature characteristic of the individual; comparing the second vocal feature with a reference second vocal feature; generating a second certainty value representing the degree of similarity between the second bid and second reference vocal features; and verifying the individual if at least one of the first and second certainty values satisfies a predetermined condition.
In accordance with a fourth aspect of the present invention, apparatus for verifying an individual comprises means for obtaining from the individual first and second bid vocal features characteristic of the individual; comparison means for comparing each of the first and second vocal features with corresponding reference vocal features and for generating respective first and second certainty values representing the degree of similarity between the compared features; and verifying means for verifying the individual if at least one of the first and second certainty values satisfies a predetermined condition.
We have developed a sophisticated method for combining the results of two or more comparisons of vocal features in which a series of certainty values is generated, for example in the form of percentages, which may then be combined for example by simple summing or by summing weighted versions to arrive at the final value which can be compared with a threshold. The final value represents the overall degree of success with which the different comparison steps have been passed. In particular, this method permits a significant degree of choice to be available over the setting of the final threshold which is unavailable with AND or OR logic methods.
The generation of certainty values allows a significant advantage to be achieved in which a second, bid vocal feature is only obtained if the first certainty value lies within predetermined limits. Thus, in this case, if the first certainty value lies within predetermined limits which are set to define a range in which the first comparison is marginally successful or marginally unsuccessful then the second, bid vocal feature is obtained and processed in an attempt to improve upon the first test. However, if the first certainty value lies below the lower limit or above the higher limit (in a second predetermined range) then the individual is immediately rejected or accepted respectively.
In this latter case, the individual may be accepted following the obtaining of second bid, vocal feature solely by comparing the second certainty value with a threshold, or the first and second certainty values may be combined, as mentioned above.
In accordance with a fifth aspect of the present invention, a method of verifying an individual comprises obtaining from the individual a first, bid vocal feature characteristic of the individual; comparing the first, bid feature with a reference first vocal feature; generating a first certainty value representing the degree of similarity between the first bid and first reference features; obtaining from the individual a second, bid vocal feature characteristic of the individual; comparing the second vocal feature with a reference second vocal feature; generating a second certainty value representing the degree of similarity between the second bid and second reference vocal features; and verifying the individual if at least one of the first and second certainty values satisfies a predetermined condition.
It should be noted that although the vocal features a e characteristic of the individual they may not be unique to the individual.
The vocal features which can be used in this invention may be chosen from any known features. Thus, the features could include cepstral coefficients derived from a single word (in combination with a dynamic time warping technique) ; or cepstral coefficients obtained from a number of words. Preferably, however, waveform perturbation values as described above are used.
More than two certainty values may be obtained. In one preferred example, a first certainty value may be obtained by looking at cepstral coefficients obtained from one word; a second certainty value may be obtained by looking at cepstral coefficients derived from a number of words; and a third certainty value may be derived by looking at waveform perturbation and these may then be combined together to generate a final certainty value.
In the preferred mode certainty values are derived from a waveform perturbation measurement, and cepstral coefficients from predetermined words. This allows the benefit of combining two dissimilar types of vocal characteristics, giving enhanced discrimination and reduced overall error rates. In this mode a text dependent utterance is combined with a text independent utterance.
An example of a method and apparatus for speaker recognition in accordance with the present invention will now be described with reference to the accompanying drawings, in which:- Figure 1 is a block diagram of the apparatus;
Figure 2 illustrates the output signal from a microphone;
Figure 3 is a flow diagram illustrating operation of the waveform perturbation analysis processor of Figure 1;
Figure 4 illustrates the variation of pitch with time;
Figure 5 illustrates the variation of smoothed pitch with time;
Figure 6 illustrates excursions of pitch from a median level;
Figure 7 illustrates typical frequency distribution curves;
Figure 8 illustrates error rate curves derived from Figure 7;
Figure 9 illustrates a typical variation of certainty score;
Figure 10 illustrates error rate as a function of certainty score; Figure 11 illustrates the frequency distribution of certainty scores for combined biological feature processors;
Figure 12 illustrates the variation of error rate with certainty for a combination of processors;
Figure 13 is a block diagram of apparatus for carrying out the method; and,
Figure 14 is a flow diagram illustrating operation of the Figure 13 apparatus. The apparatus shown in Figure 1 comprises a data input device 1 such as a microphone and digitizing circuit connected in parallel to two separate analyzing processors. The processors can be implemented on a suitably programmed computer (Figure 3) or by circuits (Figure 1) .
The first processor, with which the present invention is primarily concerned, analyses pitch and amplitude perturbations in the incoming signal and comprises a waveform analysis circuit 2 to which the signal is fed, a perturbation analysis circuit 3 the output of which is input to a matcher circuit 5, and a reference store 6 for holding the enrolment perturbation measures.
The second processor analyses the speech signal by employing a Linear Predictive model. This model effects the removal of the pitch of the signal, and limits parametrization of the signal to vocal tract effects. The second processor comprises a cepstral coefficient analyzer 4 for computing the average of the cepstral coefficients over the duration of the sentence utterance which are input to the matcher 5, and the reference store 6 for holding the enrolment cepstral coefficients.
To summarize, the first processor analyses the vibration of the vocal folds, and the second processor analyses the characteristics of the oral-pharyngeal vocal tract while the function of the matcher 5 is to output a value representing the relationship between the bid and the enrolment data over both processors.
The first processor will now be described in more detail.
A typical example of a speech waveform as captured by a microphone is shown in Figure 2. A period of the waveform is immediately identifiable. This periodicity in the waveform reflects the motion of the vocal folds, which vibrate in the airflow from the lungs. Any rise in the pitch or tone of the voice is caused by an increase in the rate of the vocal fold vibrations, and is thus accompanied by a shortening of the periods in the speech waveform. If this increase and decrease of vocal fold vibration is a gradual process, relative to the duration of a single period, it is evident in the perceived speech signal as intonation.
The duration of individual pitch periods also vary in an irregular fashion on a much shorter time base. Even if a steady note is held, the durations of successive pitch periods vary considerably. Although not all factors contributing to this phenomenon are known, it is generally accepted that both the anatomical make-up of the vocal folds and their dynamics during speech production influence the magnitude and rate of such irregularities. The amplitude of the speech waveform, at a point which may be associated with a given pitch period, also varies in a rapid and irregular fashion (in addition to the more gradual effects of increasing or decreasing the volume of the voice) . Measures of pitch and amplitude perturbation may be generically termed waveform perturbation.
The analysis of waveform perturbation includes three major components. Determination of the pitch from the speech waveform; removal of the intonational trend, and quantification of the resulting perturbation.
The algorithm which is adopted by the circuit 2 is shown in Figure 3. In a first step 10 the input signal is filtered to remove high frequency noise using for example a 600Hz, 32-tap FIR (Finite Impulse Response) filter. The signal is then analysed in 10ms periods and in each such period the maximum amplitude of the signal
Amax is determined (step c 11) and a threshold T.1 set at 0.75 Amax (step^ 12) . In a step 13 a number of extraneous waveform peaks are removed (on the basis of their amplitude being below the dynamically determined threshold T,) . It should be noted that the 10ms period is chosen so that the amplitude threshold is updated at a rate commensurate with the likely (long term) variation in the amplitude envelope (i.e. the volume) of the voice.
The remaining waveform peaks are then parameterised into energy, shape and width (step 14) to eliminate most extraneous peaks while further extraneous peaks are removed on the basis of their temporal location with respect to other peaks. This further removal is achieved by computing an expected time of occurrence of a peak T CΛ by carrying out a running two point average of pitch period duration (step 15) . Peaks occurring outside a time interval from 60% Tex to 175% Tex are removed,' this time interval being determined from a knowledge of known speech patterns so as to restrict the durations of potential pitch periods for those which are physically possible in human speech (step 16) . The finally remaining peaks are then arranged in pairs using a Euclidean distance classifier 6 (step 17) . Distances on each axis are weighted: shape X 1.0, width X 1.0, energy X 0.2.
The output of the algorithm at this stage is a sequence of pitch period duration values and amplitude values. An example of a sequence of pitch peri duration values is shown in Figure 4. These are known the pitch and amplitude contours.
The next stage in the process is to remove t underlying trend of intonation (for pitch) and volu (for amplitude) . This is carried out using the techniq of non-linear smoothing which is described in more deta in Rabiner L.R., Sa bur M.R. and Schmidt C.E. "Applications of a non-linear smoothing algorithm 0 speech processing" , IEEE Transactions on Acousti Speech and Signal Processing, Vol 23, pp552-557, (197 incorporated herein by reference.
Initially, in the non-linear smoothing algorith five consecutive pitch or amplitude values are requir (step 18) and these are sorted in ascending order (st
19) . The mid-value or median value is then chosen (ste
20) and passed to a low pass filter (step 21) . T sequence of median values is then low pass filtered (ste 22) and then steps 18-22 are repeated on successive se o of values which may or may not overlap.
The output of this part of the algorithm (i.e. fro step 22) may be considered to reflect the intonationa trend (for pitch) or volume trend (for amplitude) of th incoming sequence of values as shown for pitch in Figur 5 5. Thus, a subtraction of the trend from the origina sequence results in a residual of irregularities o perturbations of the form shown in Figure 6.
The final stage in perturbation analysis is t quantify these values. This is carried out simply b 0 taking the mean of the (absolute) perturbations; th standard deviation of the perturbations, and the numbe of perturbations over a set threshold. Thes calculations are carried out on both the amplitude an pitch contours resulting in six measures of perturbation 5 In addition to these measures, the mean value an standard deviation of the pitch itself is taken. One final measure of perturbation is employed to both the pitch and amplitude contours: the directional perturbation factor. This does not employ the non-linear smoother described above. Rather, it begins by calculating the differences between successive pitch periods. Then, the number of differences which are associated with a change in direction of the contour (i.e. if the last difference was positive and the current negative) a directional perturbation is logged. The proportion of directional perturbations is then expressed as a percentage of all adjacent period differences. This is carried out on both the amplitude and pitch contours. The final profile of the speaker based on waveform perturbation is thus comprised of ten features.
The second processor performs a linear prediction based on the fact that the vocal tract system may be approximated by a simple linear short-term invariant filter. A linear predictive model is an adaptive filter which adjusts its own coefficients such that it models the incoming signal in an optimal fashion. It is adaptive on a short term basis, commensurate with the rate of change of the vocal tract articulators. A typical update rate of 10ms in employed. The resulting filter coefficients describe the acoustical characteristics (and thus to some extent the shape of the vocal tract) . The more coefficients employed, the better the model. In one example, 24 coefficients are employed. Although these coefficients are themselves useful for speaker recognition, it has been found that the cepstral coefficients are yet more efficient. Cepstral coefficients may be obtained by mathematically manipulating the linear predictive filter coefficients. This processor requires that an utterance of a given text is made on each occasion ("enrol" and "bid"). Having computed the cepstral coefficients for each 10ms frame of the utterance the values are averaged over the entire duration of the utterance to yield one set of mean coefficients. The technique of cepstral analysis has previously been applied to the problem of speaker verification as the only processor in a system and is more fully described in S. Furui, "Cepstral analysis technique for automatic speaker verification". IEEE Trans. Acoust. Speech and Sig. Proc. ASSP-29 pp.254-271 (1981) incorporated herein by reference.
The matcher circuit 5 appends the 10 values from the waveform perturbation processor to the 24 average cepstral coefficient values from the analysis circuit 4 to generate a 34 dimensional vector for the utterance. In the case of an enrolment, this 34 value or dimensional vector is stored in the reference store 6. Typically, during enrolment a number of such vectors will be generated which will form a cluster in feature space allowing each speaker to be represented by a single "average" reference feature vector in which each of the values has been averaged.
During a bid sequence, a new 34 dimensional feature vector is generated and this is then compared by the matcher circuit 5 with the corresponding reference vector. This comparison is achieved by determining the distance in feature space between the bid and enrol vectors which in the simplest case is achieved by determining the Euclidean distance between the two vectors. This distance value is then compared with a predetermined threshold and if it is found to be less than the threshold the speaker is verified.
Figures 7-12 illustrate the principles on which a more sophisticated verification process is based. In order to understand the basic principles of the method it is first necessary to understand the origins of the false reject (Type I) and false accept (Type II) error rate curves. This will then lead directly to a definition of certainty value in the form of a certainty score and finally to the way in which the error rate curves for multiple processors may be combined. Assume a sample population of subjects and that with reference to some vocal biometric feature vector each individual has a reference vector against which a bid may be made. There are two classes of bid, namely i) intra bids - these are bids made by individuals against their own references (valid bids) . ii) inter bids - these are bids made by individuals against other peoples' references (imposter bids) .
For each bid of each type the distance between the bid and the relevant reference may be calculated. This distance may be viewed as the Euclidean distance between
2 vectors (i.e. the bid and the reference) in n-dimensional feature space.
Typically, for a large sample population the frequency distributions for the intra bids and the inter bids will look like the curves of Fig 7. These curves are only schematic. The error rate curves, shown in Fig 8, are derived directly from the frequency distribution curves of Fig 7.
The false reject (Type I) curve is derived by setting a sequence of increasing distance thresholds and, for each, calculating the area under the intra bid curve of Fig 7, which lies to the right of the threshold.
Similarly the false accept (Type II) curve is derived by again setting a sequence of increasing distance thresholds and, for each, calculating the area under the inter bid curve of Fig 7, which lies to the left of the threshold. There are 3 important points to notice: i) if the curves of Fig 7 did not overlap then a threshold could be set for which the error rate would be zero. ii) the intersection of the curves of Fig 8 corresponds to the equal error rate (EER) position (this does not necessarily correspond to the intersection point of Fig 1) . iii) the error rate ordinate ranges from 0 to 100% whereas the range of values on the abscissa
(distance/threshold) is dependent on the vocal biometric to which the curves correspond.
The curves of Fig 8 correspond to some particular vocal biometric processor, and in practice should be derived from a population large enough to be statistically significant.
To derive the certainty scores corresponding to that processor the following formula is used: C(d) = I (d) -II (d)+100
2 where C(d) is the certainty and is a function of distance d according to the curves of Fig 8:
I (d) is the value of the error rate taken from the Type I curve at distance d, and II (d) is the value of the error rate taken from the Type II curve at distance d.
Certainty plotted as a function of distance for the biometric processor is shown in Fig 9.
By combining the curves of Fig 7 and Fig 9 the error rate characteristic for the processor may be plotted as a function of certainty rather than distance, as shown in Fig 10.
The Type I curve is now on the right side of the
Type II curve (cf. Fig 8) . What this means is that a bid giving a high certainty score will result more often in a false reject error than a false accept error. Conversely, a low certainty score will result more often in a false accept error than a false reject error.
The other crucial point to notice is that both scales of Fig 10 range from 0 to 100%. This point is crucial because it means that by "normalising" the abscissa of the error rate curve then error rate curves for different processors may now be combined.
Given two sets of frequency distribution curves and two sets of error rate curves of the types shown in Fig 7 and Fig 8, i.e. one set for each processor, then to combine them: i) derive two curves of certainty versus distance each similar to that shown in Fig 9, i.e. one for each processor. ii) using these curves, convert the original two sets of frequency distribution curves into a single pair of curves as shown in Fig 11 in which the abscissa is now certainty score C which is combined from the individual certainties C, and C~ by c = wL cL + w2 c2 1 = wχ +w2 and W.. , W_ are suitable weighting coefficients More generally: C = Σ
Figure imgf000018_0001
iii) Fig 11 is now used to derive Fig 12 in the same way that Fig 2 was derived from Fig 7 and Fig 12 now represents the error rate characteristics in terms of certainty score
(instead of distance) for the combined processors. Consequently when two bids are made i.e. one bid against each processor, the combined certainty score is calculated using C - Wl Cl + W2C2 and the combined certainty is thresholded so as to make a decision. Fig 7 shows what the expected error rates will be corresponding to any particular thresholding value (T) for the 5 combined certainty.
The principles of the certainty method described above can be implemented using the apparatus shown in Fig 13. The apparatus shown in Figure 13 comprises a central processing unit (CPU) 30 which performs the 0 authentication process. The CPU 30 is coupled with a card reader 31 of conventional form for reading reference parameter information contained on a credit card like article (such as a smart card or a magnetic strip on such a card) . The CPU 30 is also coupled with a waveform 5 perturbation analyser 32 of the type described above in connection with circuits 2, 3 (Figure 1) and a cepstral coefficient analyser 33 similar to circuit 4 (Figure 1) . In addition, the CPU 30 controls a display on a monitor 5. Once an individual has been authenticated, a signal 0 is issued by the CPU 30 on a line 35 to the system to which access is to be given, such as a cash dispensing system or a lock which can be released to allow access to a secure area.
In one method, the CPU 30 instructs an individual to 5 be authenticated (by displaying messages on the monitor 34) initially to utter a word or sequence of words into a microphone (not shown) . The analyser 33, which is connected to the microphone, generates cepstral coefficient values which are fed to the CPU 30 which Q stores this information in an internal memory. In this example, the CPU 30 then instructs the individual, again by displaying a suitable message on the monitor 34, to utter another sequence of words into the same or a different microphone which generates an analogue signal c which is supplied to the analyser 32. The analyser 32 in turn generates signals representing characteristic waveform perturbation parameters of the utterance. These parameters are also stored in the internal memory. The CPU 30 also reads predetermined (authentic) values relating to the cepstral coefficients and waveform perturbation parameters from a credit card inserted by the individual in the card reader 31.
The CPU 30 then calculates respective certainty values or scores using the formula set out above for each of the two processors and then sums the resultant certainty scores, after weighting, to derive a final certainty score. This is then compared with a final, predetermined threshold to determine whether or not the individual is authenticated. In an alternative approach (Figure 14) , the first utterance is input (step 40) , and the cepstral coefficient information is obtained and stored (step 41) . Then a certainty score CS for that information is immediately generated. If the first certainty score is less than 45% then the individual is immediately rejected while if the score is greater than 55%, the individual is immediately accepted. If the certainty score lies in a predetermined range, for example 45%-55%, indicating a marginal situation, the CPU 30 then instructs the individual to utter a second sequence (step 45) so that a waveform perturbation analysis (step 46, 47) can be performed in addition. The resulting certainty score CS' from that analysis can then be used (alone or in combination with the previous certainty score (not shown) ) to authenticate the individual by comparison with a predetermined threshold (step 48) .

Claims

1. A method of recognising that samples of speech uttered at different times were uttered by the same speaker, the method comprising determining whether values related to characteristics of the two speech samples satisfy a predetermined relationship, characterized in that at least one of the characteristics is a waveform perturbation feature of each sample.
2. A method according to claim 1, wherein at least two of the characteristics are waveform perturbation features.
3. A method according to claim 2, wherein the characteristics comprise pitch and amplitude perturbation features.
4. A method according to claim 1, wherein the waveform perturbation features are chosen from: a. the mean of the absolute perturbations in pitch, b. the mean of the absolute perturbations in amplitude, c. the standard deviation of the perturbations in pitch, d. the standard deviation of the perturbations in amplitude, e. the number of perturbations in pitch over a set threshold, f. the number of perturbations in amplitude over a set threshold, g. a directional perturbation factor for pitch, h. a directional perturbation factor for amplitude, i. mean absolute pitch, and, j. standard deviation in absolute pitch.
5. A method according to claim 1, wherein each sample of speech comprises an utterance with a duration of at least three seconds.
6. A method according to claim 1, further comprising performing a first frequency domain analysis on the samples of speech uttered at different times to obtain cepstral coefficient values and determining whether the waveform perturbation and cepstral coefficient values together satisfy a predetermined relationship.
7. A method according to claim 1, in which the Euclidean distance between the values representing the two speech samples is determined and compared with a threshold to determine if the predetermined relationship is satisifed.
8. Apparatus for recognising that samples of speech uttered at different times were uttered by the same speaker, the apparatus comprising means for obtaining values relating to characteristics of the two samples and for determining whether the values satisfy a predetermined relationship characterized in that at least one of the characteristics is a waveform perturbation feature of each sample.
9. A method of verifying an individual, the method comprising obtaining (40, 41) from the individual a first, bid vocal feature characteristic of the individual; comparing the first, bid feature with a reference first vocal feature; generating a first certainty value (42) representing the degree of similarity between the first bid and first reference features; and, if the first certainty value lies within predetermined limits, obtaining (45,46) from the individual a second, bid vocal feature characteristic of the individual; comparing the second vocal feature with a reference second vocal feature; generating (47) a second certainty value representing the degree of similarity between the second bid and second reference vocal features; and verifying (48) the individual if at least one of the first and second certainty values satisfies a predetermined condition.
10. A method according to claim 9, wherein the predetermined limits define the range for the first certainty value of 45%-55%.
11. A method according to claim 9, wherein the predetermined condition is satisfied if the second certainty value exceeds a threshold.
12. A method according to claim 9, wherein if the first certainty value falls within a second predetermined range outside the first range, the individual is immediately verified.
13. A method according to claim 9, wherein the verifying step comprises combining the first and second certainty values to generate a final certainty value which is then compared with a predetermined threshold.
14. A method according to claim 9, wherein the first and second vocal features are of different types.
15. A method according to claim 14, wherein the vocal features are selected from cepstral coefficients obtained from one word, cepstral coefficients obtained from a number of words, and waveform perturbation values.
16. A method of verifying an individual, the method comprising obtaining from the individual a first, bid vocal feature characteristic of the individual; comparing the first, bid feature with a reference first vocal feature; generating a first certainty value representing the degree of similarity between the first bid and first reference features; obtaining from the individual a second, bid vocal feature characteristic of the individual; comparing the second vocal feature with a reference second vocal feature; generating a second certainty value representing the degree of similarity between the second bid and second reference vocal features; and verifying the individual if at least one of the first and second certainty values satisfies a predetermined condition.
17. A method according to claim 16, wherein the verifying step comprises combining the first and second certainty values to generate a final certainty value which is then compared with a predetermined threshold.
18. A method according to claim 16, wherein the first and second vocal features are of different types.
19. A method according to claim 18, wherein the vocal features are selected from cepstral coefficients obtained from one word, cepstral coefficients obtained from a number of words, and waveform perturbation values.
20. Apparatus for verifying an individual, the apparatus comprising means (32,33) for obtaining from the individual first and second bid vocal features characteristic of the individual; comparison means (30) for comparing each of the first and second vocal features with corresponding reference vocal features and for generating respective first and second certainty values representing the degree of similarity between the compared features; and verifying means (30) for verifying the individual if at least one of the first and second certainty values satisfies a predetermined condition.
21. Apparatus according to claim 20, wherein the comparison means and verifying means are provided by a suitably programmed computer.
PCT/GB1990/000068 1989-01-17 1990-01-16 Speaker recognition WO1990008379A1 (en)

Applications Claiming Priority (6)

Application Number Priority Date Filing Date Title
GB898900931A GB8900931D0 (en) 1989-01-17 1989-01-17 Speaker verification
GB8900931.0 1989-01-17
GB8926988.0 1989-11-29
GB898926989A GB8926989D0 (en) 1989-11-29 1989-11-29 Verification method and apparatus
GB898926988A GB8926988D0 (en) 1989-11-29 1989-11-29 Speaker verification
GB8926989.8 1989-11-29

Publications (1)

Publication Number Publication Date
WO1990008379A1 true WO1990008379A1 (en) 1990-07-26

Family

ID=27264273

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB1990/000068 WO1990008379A1 (en) 1989-01-17 1990-01-16 Speaker recognition

Country Status (1)

Country Link
WO (1) WO1990008379A1 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE2431458A1 (en) * 1974-07-01 1976-02-05 Philips Patentverwaltung Identifying speaker from sound of voice - uses labelling system and recording system correlating labels with known speakers
WO1987000332A1 (en) * 1985-07-01 1987-01-15 Ecco Industries, Inc. Speaker verification system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE2431458A1 (en) * 1974-07-01 1976-02-05 Philips Patentverwaltung Identifying speaker from sound of voice - uses labelling system and recording system correlating labels with known speakers
WO1987000332A1 (en) * 1985-07-01 1987-01-15 Ecco Industries, Inc. Speaker verification system

Non-Patent Citations (9)

* Cited by examiner, † Cited by third party
Title
1979 Carnahan Conference on Crime Countermeasures, 16-18 May 1979, Lexington, Kentucky, University of Kentucky, (Lexington, US), U. HOEFKER et al.: "A New System for Authentication of Voice", pages 47-52 *
1980 Carnahan conference on Crime Countermeasures, 14-16 May 1980, Lexington, Kentucky, University of Kentucky, (Lexington, US), C.E. CHAFEI: "A Real Time Automatic Speaker Recognition System on Mini-Computer", pages 53-56 *
IEEE ASSP Magazine, Volume 3, No. 4, October 1986, IEEE, (New York, US), D. O'SHAUGHNESSY: "Speaker Recognition", pages 4-17 *
IEEE Transactions on Acoustics, Speech, and Signal Processing, Volume ASSP-27, No. 1, February 1979, IEEE, (New York, US), J.D. MARKEL et al.: "Text-Independent Speaker Recognition from a Large Linguistically Unconstrained Time-Spaced Data Base", pages 74-82 *
IEEE Transactions on Acoustics, Speech, and Signal Processing, Volume ASSP-27, No. 3, June 1979, IEEE, (New York, US), H.M. DANTE et al.: "Automatic Speaker Identification for a Large Population", pages 255-263 *
Proceedings of the IEEE, Volume 64, No. 4, April 1976, (New York, US), A.E. ROSENBERG: "Automatic Speaker Verification: A Review", pages 475-487 *
The Journal of the Acoustical Society of America, Volume 46, No. 4, Part 2, 1969, (New York, US), J.E. LUCK: "Automatic Speaker Verification using Cepstral Measurements", pages 1026-1032 *
The Journal of the Acoustical Society of America, Volume 52, No. 6, Part 2, December 1972, (New York, US), B.S. ATAL: "Automatic Speaker Recognition Based on Pitch Contours", pages 1687-1697 *
The Official Proceedings of Speech Tech '86, Voice Input/Output Applications Show and Conference, 28-30 April 1986, New York, Volume 1, No. 3, Media Dimensions, Inc., (New York, NY, US), M.G.K. YANG: "A Speaker Identification System for Field use", pages 277-280 *

Similar Documents

Publication Publication Date Title
CN108281146B (en) Short voice speaker identification method and device
JP4802135B2 (en) Speaker authentication registration and confirmation method and apparatus
Tiwari MFCC and its applications in speaker recognition
JP3532346B2 (en) Speaker Verification Method and Apparatus by Mixture Decomposition Identification
US6519561B1 (en) Model adaptation of neural tree networks and other fused models for speaker verification
EP1399915B1 (en) Speaker verification
EP0891618B1 (en) Speech processing
US5522012A (en) Speaker identification and verification system
US7502736B2 (en) Voice registration method and system, and voice recognition method and system based on voice registration method and system
US20070129941A1 (en) Preprocessing system and method for reducing FRR in speaking recognition
US20030009333A1 (en) Voice print system and method
EP1159737B1 (en) Speaker recognition
JPH11507443A (en) Speaker identification system
AU2002311452A1 (en) Speaker recognition system
JPH02238495A (en) Time series signal recognizing device
US20070198262A1 (en) Topological voiceprints for speaker identification
JPH1083194A (en) Two-stage group selection method for speaker collation system
EP0424071A2 (en) Speaker recognition
WO2002029785A1 (en) Method, apparatus, and system for speaker verification based on orthogonal gaussian mixture model (gmm)
WO1990008379A1 (en) Speaker recognition
KR100917419B1 (en) Speaker recognition systems
Upadhyay et al. Analysis of different classifier using feature extraction in speaker identification and verification under adverse acoustic condition for different scenario
Wadehra et al. Comparative Analysis Of Different Speaker Recognition Algorithms
Djeghader et al. Hybridization process for text-independent speaker identification based on vector quantization model
Saswati et al. Text-constrained speaker verification using fuzzy C means vector quantization

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): JP US

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE CH DE DK ES FR GB IT LU NL SE