EP0215065A1 - Individual recognition by voice analysis - Google Patents

Individual recognition by voice analysis

Info

Publication number
EP0215065A1
EP0215065A1 EP19860901680 EP86901680A EP0215065A1 EP 0215065 A1 EP0215065 A1 EP 0215065A1 EP 19860901680 EP19860901680 EP 19860901680 EP 86901680 A EP86901680 A EP 86901680A EP 0215065 A1 EP0215065 A1 EP 0215065A1
Authority
EP
European Patent Office
Prior art keywords
acoustic feature
feature signals
utterance
identified speaker
representative
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
EP19860901680
Other languages
German (de)
English (en)
French (fr)
Inventor
Lawrence Richard Rabiner
Aaron Edward Rosenberg
Frank Kao-Ping Soong
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
AT&T Corp
Original Assignee
American Telephone and Telegraph Co Inc
AT&T Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by American Telephone and Telegraph Co Inc, AT&T Corp filed Critical American Telephone and Telegraph Co Inc
Publication of EP0215065A1 publication Critical patent/EP0215065A1/en
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction

Definitions

  • This invention relates to voice analysis an .- more particularly, to recognition of individuals from their speech patterns.
  • Conference on Acoustics, Speech and Signal Processing, pp. 555-558 discloses a speaker recognition technique using a statistical model of a speaker's vector quantized speech.
  • Each speaker model includes speaker and imposter mean and standard deviation values for selected speech elements which are obtained from a frequency of occurrence analysis of the speech elements.
  • the unknown talker's speech pattern is compared to the speaker model and a statistical measure of the match is generated based on the distribution of distances for the compared speech elements.
  • the statistical measures are then processed according to a likelihood formulation derived from the speaker and imposter mean and standard deviation values.
  • the use of statistical speaker models and frequency of occurrence analysis causes the speaker recognition arrangement to be complex and its accuracy is dependent on the statistical measures used.
  • the present invention provides a more simple and accurate speaker recognition arrangement.
  • a set of acoustic feature signals is generated that is characteristic of an identified talker from a plurality of his or her speech patterns.
  • The.entire set of characteristic feature signals is compared to each speech feature signal of an unknown talker and the closest matching characteristic signal is selected.
  • the identity of the unknown talker is determined from the similarities of the closest matching feature signals to the feature signals of the unknown talker.
  • FIG. 1 depicts a general flow chart of a speaker identification arrangement illustrative of the invention
  • FIG. 2 depicts a detailed flow chart of the speaker identification arrangement of FIG. 1;
  • FIG. 3 depicts a block diagram of a speaker identification arrangement illustrative of the invention
  • FIG. 4 is a detailed flow chart illustrating the operation of the circuit of FIG. 3 as a speaker verification system
  • FIGS. 5 and 6 are flow charts illustrating the operation of the circuit of FIG. 3 where the unknown talker utters a randomly selected phrase for purposes of verification.
  • FIG. 7 is a flow chart illustrating details of the operation of the flow chart of FIG. 5.
  • General Description It is well known in the art that a set of short term acoustic feature vectors of a speaker can be used to represent the acoustic, phonological, and physiological characteristics of the speaker if the speech patterns from which the feature vectors are obtained contain sufficient variations. A direct representation by feature vectors, however, is not practical for large numbers of feature vectors since memory requirements for storage and processing complexity in recognition are prohibitive.
  • the original set of feature vectors may be compressed into a smaller set of representative feature vectors which smaller set forms a vector quantization codebook for the speaker.
  • Each subspace Si forms a nonoverlapping region and every feature vector inside Si is represented by a corresponding centroid feature vector bi of Si.
  • the partitioning is performed so that the average distortion
  • LPC linear prediction coefficient
  • Ra is the autocorrelation matrix of speech input associated with vector a.
  • the distortion measure of Equation (3) may be used to generate speaker-based VQ codebooks of different sizes. Such codebooks of quantized feature vectors characterize a particular speaker and may be used as reference features to which the feature vectors of an unknown speaker are compared for speaker verification and speaker identification.
  • FIG. 1 is a flow chart that illustrates the general method of talker identification illustrative of the invention.
  • the arrangement of FIG. 1 permits the identification of an unknown person by comparing the acoustic feature signals of that person's utterance with stored codebooks of acoustic feature signals corresponding to previously identified individuals.
  • an utterance of an unknown talker is received and partitioned into a sequence of time frames.
  • the partitioned speech pattern is analyzed to produce a speech feature signal for each successive frame as per step 101.
  • All of the codebook feature signals for the current reference talker are compared to the current frame feature signal of the unknown talker and the closest codebook feature signal for the current reference talker is selected (step 105).
  • a signal representative of the similarity between the selected closest corresponding reference talker feature signal and the current frame feature signal of the unknown talker as well as a cumulative similarity signal over the frames of the unknown utterance for the current reference talker are produced in step 110.
  • steps 105 through 110 are iterated for the set of reference talkers so that a frame similarity signal and a cumulative similarity signal are formed from the codebook set of feature signals of each reference talker for the current unknown utterance frame.
  • the unknown talker's speech pattern is tested to determine whether the last speech frame has occurred (step 120).
  • step 101 is reentered via step 122 for the generation of the similarity signals in the loop including steps 101, 105, 110, 115 and 118.
  • the minimum cumulative distance signal is selected in step 125 so that the closest corresponding reference talker is determined.
  • the closest corresponding talker is then identified (step 130) from the selected minimum cumulative correspondence of step 125.
  • the identification is made by comparing each acoustic feature of the unknown talker's utterance with the acoustic feature codebook of each reference talker. The best matching codebook feature is determined, and its similarity to the unknown talker's frame feature signal is measured. The similarity measures for the frames of the unknown talker are combined to select his or her identity or to reject the unknown talker if the utterance features are not sufficiently similar to any of the reference features.
  • FIG. 3 shows a block diagram of a speaker identification arrangement utilizing the operations of the flow chart of FIG. 2 to provide identification of an unknown talker as one of a set of reference talkers for whom reference acoustic feature codebooks have been stored.
  • an unidentified person's speech pattern is applied to electroacoustic transducer 301 and an electrical signal representative thereof is supplied to speech feature signal generator 305.
  • the speech feature signal generator is operative to analyze the speech signal and to form a time frame sequence of acoustic feature signals corresponding thereto.
  • Generator 305 may, for example, comprise any of the linear prediction feature signal generators well known in the art.
  • Reference talker feature signal store 335 contains a plurality of reference templates.
  • Each template includes vector quantized feature signals obtained by means of Equation (2).
  • These feature signals are derived from speech patterns of a predetermined reference talker and correspond to the entire range of his acoustic .features.
  • these feature signals are not restricted to any particular speech pattern so that recognition of an unknown talker may be independent of the utterance used for identification.
  • Store 335 may be one of the well-known read-only memories and stores codebooks for a plurality of persons to which an unknown speaker may be compared.
  • Input speech store 315 is a random access memory well known in the art adapted to receive and store the acoustic feature signals produced in feature signal generator 305. Similarity signal store also comprises a random access memory that stores the similarity signals produced in signal processor 345 responsive to the unknown talker's acoustic feature signals from input speech feature signal store 315 and the reference talker's codebook feature signals from reference talker feature signal store 335.
  • Signal processor 345 is a microprocessor arrangement well known in the art such as the M 68000 microprocessor operating under control of the permanently stored instructions of program instruction store 320 to perform the speaker recognition functions. In the speaker identification arrangement, store 320 contains the instructions shown in general form in the flow chart of FIG. 2. The circuit arrangement of FIG.
  • VME-SBC single board computer MK75601 the VME-DRAM256 Dynamic RAM memory card MK75701 , and the VME-SIO serial I/O board made by MOSTEK Corporation, Carrollton, Texas with appropriate power supply and mounting arrangements.
  • the flow chart of FIG. 2 may be utilized to identify a person by his utterance of a phrase that may be preselected or may be arbitrary. If his speech pattern is identified as that of one of the persons for whom codebooks have been stored, the identity can be made. Assume for purposes of illustration that an unknown talker X is to be identified. The speech patterns of X have previously been analyzed and an acoustic feature codebook for X has been included in reference store 335 along with codebooks of other authorized persons.
  • the unknown talker's speech pattern frame index I and the stored reference talkers' codebook index J are then set to zero in signal processor 345 (steps 201 and 205).
  • Processor 345 also resets the cumulative similarity signals d (K) to zero for all reference talkers in step 210.
  • the unknown talker's input utterance is analyzed in speech feature signal generator 305 of FIG. 3 to produce a time frame sequence of acoustic feature signals which are transferred to input speech store 315 via interface 310 and bus 350 as per step 215.
  • the talker identification may be performed during the utterance so that the acoustic feature signals may be transferred one frame at a time for processing.
  • Step 220 is entered from step 215 and the utterance signal in feature signal generator 305 is tested to determine if a speech signal is present.
  • Speech signal detection can be done using known techniques, e.g., in generator 305 by the energy analysis technique disclosed in U. S. Patent 3,909,532.
  • the loop from step 225 to step 260 is iterated to compare the sequence of unknown talker acoustic feature signals with the codebooks of the reference talkers.
  • steps 225 and 230 the unknown talker's frame index I is incremented and the acoustic feature signals a(I) for the current frame are supplied to signal processor 345 from store 315 via bus 350.
  • Reference talker index J is incremented to obtain the next reference talker codebook in store 335 (step 235), and the reference talker codebook feature signals are compared to unknown talker r s a(I) feature signal in processor 345 to select the closest corresponding codebook J feature signal (step 240).
  • a signal representative of the distance between the selected codebook feature signal and the a(I) feature signal is formed in processor 345 in accordance with step 245 and cumulative distance sig ⁇ nal dace (J) for reference talker J is generated which is a measure of the similarity of the unknown talker's utterance and the reference talker's speech characteristics up to the current utterance frame I (step 250) .
  • Step 255 is entered after the cumulative distance signal is formed for the current reference talker J and control is transferred to step 235 to access the next reference talker codebook until the last reference talker N has been processed.
  • reference talker index J exceeds N, cumulative distance signals have been produced for all reference talkers in the current unknown talker's utterance frame I.
  • Step 215 is reentered from step 260 and the next portion of the unknown talker's utterance is input.
  • the loops from steps 215 through 255 and 260 are iterated once speech has started until no more speech is present at microphone 301. At this point in the operation of the flow chart of FIG. 2, a cumulative distance signal has been formed for each reference talker and the last utterance frame of the unknown talker has been processed.
  • Step 270 is entered.
  • the minimum cumulative distance signal is selected in processor 345 and the reference talker J* having the minimum cumulative distance signal is identified.
  • the selected cumulative distance signal is compared to a predetermined threshold in step 272. If the selected cumulative distance signal exceeds the threshold, the identity is rejected and a rejection indicative signal is sent to utilization device 330 via bus 350 and interface 325. Otherwise, the reference talker identification signal J* is accepted and supplied to utilization device 330 (step 275).
  • the unknown talker is identified and given access to the computer system. His identification may also be recorded so that the computer session may be charged to his account.
  • the flow chart of FIG. 4 illustrates the operation of the circuit of FIG.
  • the unknown talker inputs an identity signal J at keying device 303 in FIG. 3 (step 401).
  • Processor 345 addresses the feature signals of codebook J in store 335 responsive to identity signal J.
  • the unknown talker's utterance frame index is set to zero (step 410) and the cumulative distance sig ⁇ nal dace (I,J) is also set to zero (step 415).
  • the unknown talker's utterance is - analyzed in feature signal generator 305 to produce frame acoustic feature signals which are placed in store 315.
  • step 425 the loop from step 430 to step 450 is iterated so that cumulative distance signals are formed for the sequence of utterance speech frames I.
  • unknown talker frame index is incremented (step 430).
  • the acoustic feature signal a(I) for the frame is transferred from store 315 to processor 345 (step 440).
  • the closest corresponding feature signal of codebook J is determined and a distance signal d(I,J) is formed
  • step 445 The cumulative distance signal for the current utterance frame I is produced according to step 445 and the next frame portion of the utterance is analyzed as per step 420.
  • control is passed to step 455 in which a signal corresponding to the average cumulative distance signal is generated by dividing the cumulative distance signal of the last utterance frame by I.
  • the average cumulative distance signal is then compared to a threshold distance corresponding to an acceptable similarity between the unknown talker's utterance characteristics and the characteristics of the asserted identity. If the average cumulative distance signal exceeds the threshold, the identity is rejected (step 475) and step 480 is entered to await another keyed identity signal.
  • the identity is accepted (step 465), the threshold is adaptively altered as is well known in the art (step 470) and wait step 480 is entered.
  • the arrangements thus far described permit recognition of an unknown person from an arbitrary utterance. In telephone credit applications, it is desirable to obtain recognition of the caller with a relatively relaxed identity acceptance standard. Where a stricter standard of identity acceptance is needed, the individual utterance may be predetermined and the reference talker codebooks arranged so that only selected portions of the codebook are used for talker recognition.
  • the flow chart of FIG. 5 shows a method in which the phrase to be spoken by the unknown talker is randomly selected and indicated to him. The talker utters the indicated phrase which, for example, may be a sequence of digits. The utterance acoustic feature signals are then compared to reference talker codebook portions containing the acoustic features characteristic of the particular phrase. Since the phrase is randomly selected, the particular phrase is not known in advance and the security level is substantially improved.
  • each successive acoustic feature of the unknown talker's utterance is compared to the selected 24 reference feature signals of the reference speaker.
  • the best matching reference feature signal is chosen and a signal representative of the distance between the unknown talker's feature signal and the best matching reference speaker's feature signal is produced.
  • Identity is accepted or rejected responsive to the similarity signals for the unknown talker's utterance feature signal sequence.
  • the circuit of FIG. 3 may be utilized in performing the recognition process. Use is made of keyboard-display 360 to indicate the randomly selected digit sequence in response to a keyed input of an asserted identity. Referring to FIGS. 3 and 5, the person whose identity is to be verified, enters his asserted identity q into keyboard and display device 360 as per step 501 in FIG. 5.
  • the asserted identity signal is transferred to signal processor 345 under control of program store 320 via interface 310 and bus 350. Responsive to the identity signal cj, processor 345 is operative to generate a random sequence of three digits D1 , D2 and D3 which digit sequence is transferred to display 360 (step 505).
  • unknown talker frame index I and accumulative distance signal D are set to zero in steps 510 and 515.
  • the portion of reference speaker's codebook in store 335 representative of the selected digit sequence is produced in processor 245 (step 520).
  • the digit sequence feature signal generation of step 520 is shown in greater detail in the flow chart of FIG. 7.
  • the feature select signals L(m) are set to a valve LPN corresponding to the largest possible number in the processor of FIG. 3.
  • Feature index m is set to one (step 715) and the feature select
  • step 725 incremented in step 725 and until the eighth digit feature signal is selected for reference speaker , the loop from step 710 to 730 is iterated. Digit sequence index j_ is then incremented so that the feature signals for the next randomly selected digit are obtained. After the feature select signals Lm for the third randomly selected digit are generated, the unknown talker's utterance is input in step (525) of FIG. 5.
  • the loop of decision steps 525 and 530 are traversed.
  • the frame index for the unknown talker's speech pattern is incremented (step 540)
  • the acoustic feature signal for the frame are generated in feature signal generator 305 (step 545), and stored in speech signal store 315.
  • the codebook for reference speaker q in store 335 is then addressed and a similarity signal is generated for each feature signal entry therein according to the distance measure
  • step 550 The cumulative distance signal for the asserted identity up to the current frame I is produced in step 560 by adding the current frame similarity signal to the cumulative distance signal for the preceding frames. Step 525 is then reentered for the next utterance frame of the unknown talker.
  • the loop from 535 to step 565 is iterated until speech is no longer present (step 565) after speech has been started (step 535).
  • step 601 A signal corresponding to the average frame distance is formed in step 601 and this average distance signal is compared to a predetermined acceptance threshold (step 605). If smaller than the threshold, the asserted identity is accepted (step 610) and an access permission signal is sent from processor 345 to utilization device 330 via bus 350 and interface 325. Otherwise, the identity is rejected and access is denied (step 615). In either case, step 620 is entered and the next asserted identity signal is awaited. In this manner, the asserted identity is evaluated based on the utterance of a randomly selected sequence of digits so that a higher standard of identity acceptance criterion is obtained.

Landscapes

  • Engineering & Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Telephonic Communication Services (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Golf Clubs (AREA)
  • Traffic Control Systems (AREA)
  • Collating Specific Patterns (AREA)
  • Burglar Alarm Systems (AREA)
EP19860901680 1985-03-21 1986-02-17 Individual recognition by voice analysis Ceased EP0215065A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US71452485A 1985-03-21 1985-03-21
US714524 1985-03-21

Publications (1)

Publication Number Publication Date
EP0215065A1 true EP0215065A1 (en) 1987-03-25

Family

ID=24870377

Family Applications (1)

Application Number Title Priority Date Filing Date
EP19860901680 Ceased EP0215065A1 (en) 1985-03-21 1986-02-17 Individual recognition by voice analysis

Country Status (6)

Country Link
EP (1) EP0215065A1 (es)
JP (1) JPS62502571A (es)
AU (1) AU580659B2 (es)
CA (1) CA1252567A (es)
ES (1) ES8708266A1 (es)
WO (1) WO1986005618A1 (es)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2709386B2 (ja) * 1987-06-24 1998-02-04 株式会社 エイ・ティ・ア−ル自動翻訳電話研究所 スペクトログラムの正規化方法
IT1229782B (it) * 1989-05-22 1991-09-11 Face Standard Ind Metodo ed apparato per riconoscere parole verbali sconosciute mediante estrazione dei parametri e confronto con parole di riferimento
AU670379B2 (en) * 1993-08-10 1996-07-11 International Standard Electric Corp. System and method for passive voice verification in a telephone network
DE4424735C2 (de) * 1994-07-13 1996-05-30 Siemens Ag Diebstahlschutzsystem
AUPM983094A0 (en) * 1994-12-02 1995-01-05 Australian National University, The Method for forming a cohort for use in identification of an individual
US5835894A (en) * 1995-01-19 1998-11-10 Ann Adcock Corporation Speaker and command verification method
US6081660A (en) * 1995-12-01 2000-06-27 The Australian National University Method for forming a cohort for use in identification of an individual
DE19630109A1 (de) * 1996-07-25 1998-01-29 Siemens Ag Verfahren zur Sprecherverifikation anhand mindestens eines von einem Sprecher eingesprochenen Sprachsignals, durch einen Rechner
CN102496366B (zh) * 2011-12-20 2014-04-09 上海理工大学 一种与文本无关的说话人识别方法
US9282096B2 (en) 2013-08-31 2016-03-08 Steven Goldstein Methods and systems for voice authentication service leveraging networking
US10405163B2 (en) 2013-10-06 2019-09-03 Staton Techiya, Llc Methods and systems for establishing and maintaining presence information of neighboring bluetooth devices

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See references of WO8605618A1 *

Also Published As

Publication number Publication date
CA1252567A (en) 1989-04-11
AU580659B2 (en) 1989-01-27
ES8708266A1 (es) 1987-10-16
WO1986005618A1 (en) 1986-09-25
AU5456286A (en) 1986-10-13
ES553204A0 (es) 1987-10-16
JPS62502571A (ja) 1987-10-01

Similar Documents

Publication Publication Date Title
Tiwari MFCC and its applications in speaker recognition
KR0139949B1 (ko) 미지인 사람의 신원을 확인하기 위한 음성 검증 회로
EP0891618B1 (en) Speech processing
US6519565B1 (en) Method of comparing utterances for security control
US6580814B1 (en) System and method for compressing biometric models
US5339385A (en) Speaker verifier using nearest-neighbor distance measure
US5687287A (en) Speaker verification method and apparatus using mixture decomposition discrimination
EP0121248B1 (en) Speaker verification system and process
CN101154380B (zh) 说话人认证的注册及验证的方法和装置
JPS6217240B2 (es)
US4665548A (en) Speech analysis syllabic segmenter
EP0535380B1 (en) Speech coding apparatus
JPS5941600B2 (ja) 話者の身元確認方法および装置
AU580659B2 (en) Individual recognition by voice analysis
Campbell Speaker recognition
Kekre et al. Speaker identification using row mean vector of spectrogram
US4790017A (en) Speech processing feature generation arrangement
Trysnyuk et al. A method for user authenticating to critical infrastructure objects based on voice message identification
Naik et al. Evaluation of a high performance speaker verification system for access Control
Narendra et al. Classification of Pitch Disguise Level with Artificial Neural Networks
Ibiyemi et al. Face and speech recognition fusion in personal identification
KR101838947B1 (ko) 불완전한 발음에 적용 가능한 화자인증방법 및 장치
CN113870875A (zh) 音色特征提取方法、装置、计算机设备及存储介质
JPS63157199A (ja) 話者照合装置
Jin et al. A high-performance text-independent speaker identification system based on BCDM

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): BE DE FR GB IT NL SE

17P Request for examination filed

Effective date: 19870302

17Q First examination report despatched

Effective date: 19881025

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION HAS BEEN REFUSED

18R Application refused

Effective date: 19900125

RIN1 Information on inventor provided before grant (corrected)

Inventor name: RABINER, LAWRENCE, RICHARD

Inventor name: ROSENBERG, AARON, EDWARD

Inventor name: SOONG, FRANK, KAO-PING