WO1987000332A1 - Speaker verification system - Google Patents

Speaker verification system Download PDF

Info

Publication number
WO1987000332A1
WO1987000332A1 PCT/US1986/001402 US8601402W WO8700332A1 WO 1987000332 A1 WO1987000332 A1 WO 1987000332A1 US 8601402 W US8601402 W US 8601402W WO 8700332 A1 WO8700332 A1 WO 8700332A1
Authority
WO
WIPO (PCT)
Prior art keywords
values
value
acoustic
source
speaker
Prior art date
Application number
PCT/US1986/001402
Other languages
French (fr)
Inventor
Huseyin Abut
Thomas A. Denker
Jeffrey L. Elman
Bertram P. M. Tao
Original Assignee
Ecco Industries, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ecco Industries, Inc. filed Critical Ecco Industries, Inc.
Publication of WO1987000332A1 publication Critical patent/WO1987000332A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/06Decision making techniques; Pattern matching strategies

Definitions

  • This invention relates to speaker verification, and particularly to a method and system for identifying an individual based on speech input.
  • speech input There are numerous situations in which it is necessary to establish an individual's identity. Such situations include controlling physical access to a secure environment, validating data entry or data transfer, controlling access to an automated teller machine, and establishing identification for credit card usage. Identification by voice is advantageous over alternative systems in all of these cases because it is more convenient, optionally may be carried out at a distance such as over the telephone, and can be very reliable.
  • One such type of system utilizes selected peaks and valleys of successive pitch periods to obtain characteristic coordinates of the voiced input of an unknown speaker. These coordinates are selectively compared to previously stored reference coordinates. As a result of the comparison, a decision is made as to the identity of the unknown speaker.
  • a serious limitation of this system is that it may experience problems resulting from changes in overall intensity levels of the received utterance as compared to the stored utterances.
  • Another system in this area of technology compares the characteristic way an individual utters a test sentence with a previously stored utterance of the same sentence. The system relies on a spectral and fundamental frequency match between the test utterance and the stored reference utterances. As a result, the system is subject to errors from changes in the pitch of the speaker's voice.
  • Still another type of arrangement which has been utilized for verification of a speaker filters each utterance to obtain parameters that are highly indicative of the individual, but are independent * of the content of the utterance. This is accomplished through linear prediction analysis, based on the unique properties of the speaker's vocal tract.
  • a set of reference coefficient signals are adapted to transform signals identifying the properties of the speaker's vocal tract into a set of signals which are representative of the identity of the speaker, and indicative of the speaker's physical characteristics.
  • Prescribed linear prediction parameters are utilized in the system to produce a hypothesized identity of an unknown speaker, which is then compared with ,the signals representative of the identified speaker's physical characteristics, whereby the identity of the unknown speaker is recognized.
  • the reference describes no mechanism in its system by which the distortion between the test and reference utterances can be compared. As a result, no explicit method is provided for actually carrying out the stage of speaker verification, as opposed to the preliminary steps of utterance analysis.
  • Another system which takes a somewhat different approach comprises a speaker verification system utilizing moment invariants of a waveform which corresponds to a standard phrase uttered by a user.
  • a number of independent utterances of the same phrase by the same user are used to compile a set of moment invariants which each correspond to an utterance vector.
  • An average utterance vector is then computed. Comparison of the average utterance vector with later obtained utterance vectors provides for verification of the speaker.
  • Moment invariants from a group of persons of different ages and sexes are also stored, and the group of invariants from persons in the group who are closest in age and sex to the user are compared against the- stored utterance vectors of the user to arrive at a weight vector.
  • the user's weight vector and computed utterance vector are stored on a card and used in computing a threshold, and in evaluating later utterances of the user.
  • the system provides no mechanism by which the distortion between the test and reference utterances may be compared. Further, reliability of acceptances based on comparison of new utterances against the average utterance vector could be questionable. This is especially true if the utterances of the user tend to have a large variance in their various characteristics. For example, if the speaker says a word differently at any given time, the single average value provides no flexibility in recognizing such varying speech. Still another system of interest calculates "distance thresholds" and "mean values" between a test word to be classified and other reference samples.
  • Weighting factors are utilized to gauge the importance of particular variables.
  • a fixed threshold for a given user is required, except that a comparison of portions of the test word outside the threshold may still be used to verify the speech if those portions come within a minimum distance of portions of a reference sample. If no reference sample happens to be near the test sample, there is no means to gain acceptance where the test sample is outside the basic threshold. For example, a user may not be verified if he unintentionally does not properly pronounce the word.
  • the average values of acoustic features from a plurality of speakers are stored in standardized templates for a given reference word.
  • a set of signals representative of the correspondence of the "identified speaker's features with the feature template of the -..ference word is generated.
  • An unknown speaker's utterance is analyzed by comparing the unknown speake 's utterance features and the stored templates fo_ the recognzied words.
  • this system experiences the problem of comparing the single sample of incoming speech with a threshold for that particular user.
  • the user may be unable to qualify for verification if his single attempt to pronounce the word varies by too great an amount from the reference information stored in the system.
  • Another problem with many prior art systems is they have no reliable or tractable means for detecting the beginning and end of speech.
  • This invention comprises a method and apparatus for carrying out verification of a user's identity by evaluation of his utterance.
  • the invention comprises a system which initially develops a data base comprising word samples from a user which are processed by comparison with themselves and with generic utterance data, for developing measures of the probability of erroneously accepting an impostor with respect to verification based on a given word. With this data base created, the system operates to verify the identity of a speaker based upon plural trials and in light of the information in the data base.
  • the system Before a speaker can use the system for verification purposes, he must.be enrolled. In order to enroll the user, the system repeatedly prompts the user for tokens (a token is a single utterance of a word) of a series of reference words until a sufficient number of tokens of the words are obtained and stored in a data base.
  • the tokens are subjected to feature analysis whereby certain coefficients representative of the speaker's vocal tract are obtained.
  • the tokens are also subjected to end point detection. By comparison of the tokens with themselves, and with corresponding tokens of a generic group of people, the system obtains measures of the probabilities of erroneously accepting an impostor or rejecting the true speaker.
  • an enrolled user When an enrolled user wishes to have his speech verified, he enters the identity he is claiming and the system then prompts him for an utterance. His utterance is digitally encoded and analyzed. The start and end points are detected, and coefficients corresponding to features of the utterance are developed. Selected coefficients developed from the user's previously recorded tokens of the selected word are compared with the coefficients of the newly received utterance, producing a measure of the distance between the new utterance and each of the the reference tokens. This process may be repeated for additional utterances from the user. By analyzing one or more of the measures of distance against the probability information developed during enrollment, the system determines the probabilities of making erroneous decisions.
  • decisions are made at successive stages whether to accept or reject the user.
  • the use of cumulative probabilities from stage to stage provides a means of dynamically evaluating the speaker in conjunction with several trials of speech directed to different words, so that the verification decision is based on e user's performance on each of the various words, reducing the likelihood of erroneously accepting an imposter.
  • One general feature of the invention is in making the verification decision in stages where in an earlier stage a decision is made whether or not to proceed o a subsequent stage, and in the subsequent stage a verification decision is based both on the analysis mad ⁇ in the first stage and the analysis made in subsequent stages.
  • Another general feature is in basing the verification decision on at least one probability value derived from probability data which is in turn derived from stored speech information.
  • Another general feature of the invention is in updating the stored information based on test information about 'a speaker's utterance, but only if the speaker has been verified as being the known person.
  • Another general feature is a non-mechanical, non-magnetic power switch for triggering a device (e.g., a solenoid of a door lock) upon verifying a speaker as a known person.
  • a device e.g., a solenoid of a door lock
  • Another general feature is in both detecting and decoding coded tone signals, and performing speech verification using the same digital processor. Another general feature is in time interleaving the analyses of different utterances received from different stations so that the analyses can proceed simultaneously.
  • Another general feature is in the combination of a plurality of stations for receiving utterances, a plurality of processors for serving respective stations, and a host computer having a real-time operating system for serving respective stations in real time.
  • Figure 1 illustrates a general block diagram of the method of speech verification as used in the p * esent invention.
  • Figure 2 is a detailed block diagram of one preferred embodiment of an apparatus for use in the speaker verification system of the present invention.
  • Figures 3 and 4 are flowcharts illustrating the operation of the system of the present invention.
  • Figure 5 is a state diagram of the speech detector system of the present invention.
  • Figures 6 through 8 are flow charts illustrating the operation of the system of the present invention.
  • Figure 9 is a graphical representation of the method for obtaining the global distortion representative of comparison of tokens of a given word.
  • Figures 10 and 11 are flow charts illustrating the operation of the system of the present invention.
  • Figure 12 is a tabular representation of the array "D" for organizing distortions developed in operation of the present invention.
  • Figure 13 is a tabular representation of a per-word STAT file created during operation of the present invention.
  • Figures 14 through 16 are flow charts illustrating the operation of the system of the present invention.
  • Figure 17 is a block-diagram of a multiple access module system served by a host computer with a real-time operating system.
  • Figure 18 is a diagram of an access module.
  • Figure 19 is a block diagram of a portion of the access module.
  • Figure 20 is a circuit diagram of a relay-type circuitry in the access module.
  • Figure 21 is a diagram of memory locations and tables for a real time operating system.
  • Figures 22, 23 show various tables used in the real-time operating system.
  • Figure 24 is a flow-chart of a verify procedure.
  • Figures 25, 26 are flow-charts of an alternate verify procedure.
  • Figure 27 shows Gaussian functions for use in the alternate verify procedure.
  • FUNCTIONAL DESCRIPTION The present invention may be functionally described by reference to Figure 1.
  • the operation of the system is initiated in block 20 by an external condition, such as the operation of a switch or other mechanical activating device, by electronic detection equipment, such as photosensors which detect the presence of a user or by voice or sound activation which detect the user speaking or otherwise making a noise, such as activating a "touch tone" signal.
  • an external condition such as the operation of a switch or other mechanical activating device
  • electronic detection equipment such as photosensors which detect the presence of a user or by voice or sound activation which detect the user speaking or otherwise making a noise, such as activating a "touch tone" signal.
  • the activated system will function to perform operations which have been requested either by the user, or which have been preselected for system operation by an operator at an earlier time.
  • the system moves to decision block 22, and cased on the instructions which it has received, determines whether it is to enter the "enroll" mode which is indicated generally at 24.
  • the enroll mode accomplishes a procedure whereby info-maticu relating to a particular user is obtained frop that user, processed and stored for use at subsequent tj-..es * -.n verifying the identity of the user.
  • This informatior includes utterances (i.e., tokens) by the user of selected reference words which are processed to form a data base which is used for comparison purposes during subs quent verification of the user.
  • utterances i.e., tokens
  • the system passes from decision block 22 to block 26 where the operator keys in information to the system, including the user's name, access points through which • the user may pass, and maximum permissible levels for false acceptance and false rejection errors in verification of the user. If the access points and maximum permissible levels are not specified, default values are used.
  • the system passes to block 28 where it prompts the user to provide an utterance of one of the list of reference words which are to be recorded by the user.
  • Such utterances produce an acoustic message or communication whose pattern is indicative of the identity of the user.
  • the system After prompting the user the system passes to block 30 where it samples incoming signals. Upon detecting an utterance the system passes to block 32 wherein the detected utterance i; ** converted into digital form.
  • the system passes to block 34 where the incoming signals are periodically sampled with each period forming a sample containing signals representing the detected speecn during the particular period of time.
  • the samples are stored until a specified number of them form a frame, which is then processed to obtain autocorrelat ⁇ or functions, normalized autocorrelation functions, and linear prediction coefficients representative of the frame.
  • the set of coefficients which are extracted for each token are utilized in the verification process to be described subsequently.
  • the linear prediction coefficients comprise a representation of the characteristics of the speaker's vocal tract. From block 34 the system passes to block 36 wherein the features extracted in block 34 are untilized for detecteing the beginning and end of the utterance. A state machine is employed accomplishing this end point detection, based upon energy and spectral levels of the incoming signals.
  • the extracted features and end point information are stored in a temporary store 38 and the system then passes from block 36 to decision block 40.
  • decision block 40 returns to block 28 and the user is prompted to provide another utterance, and processing continues as described above. If it is determined in block 40 that the necessary tokens have been obtained then the system passes to block 42 and awaits in a loop there until such time as instructions are received to form a data base. Upon being instructed to form a data base, the system moves from block 42 to block 44 wherein global distortion values are formed be accessing the extracted features from block 38, and conducting comparisions between the individual templates of a given reference word against themselves, and also comparisons of these features against corresponding tokens from a generic group of speakers. As a result of these comparisons, a set of global distortions is produced, each distortion being representative of a distance measure between two corresponding templates.
  • the system passes to block 46 where the global distortions for both the intra-speaker comparisions of corresponding templates from the same speaker, and inter-speaker comparisons of the user's templates with the generic templates are processed and ordered into arrays which indicate the probability of a false acceptance of an impostor depending upon a selected threshold level.
  • the templates and corresponding thresholds are stored in reference templates block 47, and the identity of the user relating to the templates and thresholds stored in block 47 is stored in block 48. After all templates have been formed and thresholds have been computed, the system moves from block 46 to block 49 and terminates further operation of the enroll mode.
  • decision block 50 it is determined whether the system is to enter the verify mode, generally indicated at 53.
  • the verification procedure is initiated when a user presents himself at any access point. If the system is instructed to enter the verify mode, the system moves to block 52 and awaits activation by a user. The user may activate the system in several ways including, but not limited to, entering a personal identification number on a key pad; inserting a plastic card in a reader; or saying his name into a microphone. Upuu receiving this user data in block 52, the system moves to block 54 where the system sends a message to h user requesting that he input an utterance which is one of _he reference words previously stored in the enrollment mode.
  • each frame is processed to obtain its linear prediction coefficient parameters (referred hereafter as LPC parameters) .
  • LPC parameters 5 first comprises computation of autocorrelation coefficients which measure the degree of statistical dependency between neighboring speech samples within the frame.
  • each frame contains from 100 to 300 speech samples, and the autocorrelation coefficients lOrepresent the relationship between these various samples.
  • the autocorrelation coefficients are processed by use of a well-known algorithm referred to as "Levinson's Algorithm" to produce the group of coefficients representative of the frame.
  • These 5 coefficients comprise the linear prediction coefficients, and are representative of the geometry of the vocal tract of the speaker.
  • These coefficients are then processed to transform them into a _. ⁇ ngle, transformed coefficient which is representative of the 0 group of linear prediction coefficients .presenting the frame of speech. This information s stored for future use.
  • the system detects the end point of an utterance, and forms a template comprising the group of transformed coefficients corresponding tc that utterance, with the template thus being representative 0 of the utterance.
  • the reference templates for comparison are retrievced from a reference templates store 66 which contains those templates which were produced by the individual user during the enrollment process, as well as templates of the corresponding word which are representative of a generic group of speakers.
  • the particular templates to be referenced in block 66 are indentified by a signal from an identity store 68 which selects the particular reference templates in block 66 based upon the identity of the user which was entered through block 52, and by use of the particular word being prompted, which was identified in block 54.
  • the distance comparisons or "distortions” - generated in block 64 comprise a set of values, with each value being a difference of the present utterance being tested as compared to one of the reference templates of this particular user. These difference values are then processed to provide a "score" which indicates the relative correspondence between the current utterance and the reference utterances.
  • the system next passes to decision block 7C ' and references the appropriate threshold value from block '/2 in determing whether to accept, reject or make no decision yet.
  • the threshold values in block 72 may be set by the operator, but if not set by the operator they will be set to default values. These threshold values comprise a set of "scores" which are based upon the results of comparisons in the enablement mode, and which indicate the probability of falsely accepting an impostor given a selected test utterance.
  • the score from block 64 is compared with the threshold value from block 72.
  • the system passes to block 74 from whence a signal is produced terminating operation of the verification mode. If the user has not provided at least two samples, even though the score is below the threshold, the system returns to block 54 where the user 0 is prompted to provide an utterance corresponding to another of the reference words provided in the enrollment process.
  • block 70 if the user has a score which is above any threshold value in block 72, this is counted 5 as a failure. If there are less than two failures, the .. • /stem proceeds to the block 54 and prompts the user for another of the user's reference words. If the user has failed twice, the system passes from block 70 to block 74 an rejects the user as an imposter. From block 74,
  • an IBM Personal Computer 112 includes a real-time operating system 114 that provides real-time servicing via a bus 116 connected to four speaker verification boards (SVBs) 118.
  • SVBs speaker verification boards
  • Each SVB in turn can serve two stations 102 on a multiprocessing basis.
  • One of the stations, station 103, is dedicated to enrollment and is denoted an enrollment station.
  • the system of the present invention has been designed to work on a host computer's bus in either a synchronous or direct memory access (DMA) manner after an interrupt protocol.
  • the host interface can be programmed for virtually any standard, the embodiment disclosed herein assumes use of an IBM PC/XT as the host.
  • the host may be selected from among many of the well-known IBM compatible personal computers which are presently on the commercial market, with only minor modifications necessary to interface to virtually any computer bus.
  • the host provides for user interaction to collect and store reference templates as described above, to calculate, update and store statistically information about the reference templates, and to perform certain ve r ification tasks.
  • the physical configuration of the system illustrated in Figure 17 is such that the bus 116 and the SVBs 118 are housed internally in the host computer with the stations 102 being external to the host.
  • each station 590 includes a 16-key touch-tone pad 592, a microphone 594 and indicator lights 596.
  • Station 590 is connected to an SVB 600 via a multiplexor 602 and two-way three-wire signal lines 598. O 87/00332
  • the user depresses the touch-tone keys (e.g. to identify himself during verification) and speaks into the microphone (e.g. to speak the reference words) , although never simultaneously.
  • Depressing the touch-tone keys causes DTFM tone generator 604 to produce one of the standard DTFM tone signal pairs which is then passed through signal combiner 606 and out onto lines 598 for delivery to the SVB 118.
  • the microphone produces voice signals which also pass through the signal combiner 606 and out onto lines 598.
  • the SD 106 receives the signals and passes them through analog-to-digital (A/D) converter 608.
  • the digitized samples are passed from SD 106 to SP 122 (described below) which, as described below, assembles them into frames and, for each frame, derives linear prediction coefficients.
  • SP 122 described below
  • CP 110 analyzes the subsequent incoming frames to determine whether speech or tone signals are being received. If tone signals are detected a conventional fourier analysis is performed to determine which key has been depressed. If, however, speech signals are being received they are processed as described below.
  • the decoding of DTFM signals thus uses the same hardware and software as is required for the processing of speech signals.
  • a solid state relay 610 (for operating, e.g., a door lock solenoid) is connected to a power source 612 (power supply) which is capable of supplying 12 volts or 24 volts, AC or DC (pulsating DC) , and is switch selectable.
  • Relay 610 includes (within the dashed lines) an "open" collector non-inverting buffer 614 for providing power to a light source for a Dionics DIG-12-08-45 opto-coupler 616.
  • the light source contained within the opto coupler is a light emitting diode (LED) of the Gallium Aluminum Arsenide type, emitting light in the near infrared spectrum.)
  • a door open signal 618 of a specific wave shape delivered over line 598 resets a counter 620 which is driven by a clock signal from an on-board door strike oscillator 622.
  • a second counter 624 is switch selectable (switches 626) for various door strike “on” times (2 sec to 50 sec) .
  • the output of counter 624 is connected to a gate 628 which implements either "norm off” or “norm on” door strike power as selected by jumper 630.
  • the signal output of gate 628 directly controls the state of the solid state relay 610.
  • optically coupled (electrically isolating) relay prevents referencing the power MOSFETS of the relay 610 or their input drive to system ground and, secondly, it isolates the system logic from noise and any high voltage (11500 volt) transients.
  • the output of opto-coupler 616 drives two power
  • MOSFETS 632, 634 Motorola MTP25N06 or International Rectifier Corp. 1RF-541) which are characterized by a very low resistance drain to source (0.085 ohms max., known as R "on") .
  • All power MOSFETS have an integral diode, drain to source, connected in a reverse direction from the normal polarity of the operating voltage. The diode plays an important role in the operation of relay 610, which allows AC or DC operation. With one input polarity, the current flows through the bulk of the left hand power MOSFET and through the diode of the right power MOSFET. When the polarity is reversed the path of the current flow also reverses.
  • MOSFETS are not subject to changes over time in their contact resistance and are therefore considerably more reliable. They are also unaffected by humidity and require extremely low input currents ( ⁇ 100 x 10 amps) to turn the device "on" (thus a very low power voltage source can be employed for turn-on) .
  • Element 634 is a V47ZA7 available from General Electric and is employed for transient limiting and protection.
  • the system of Figure 2 is capable of multi-tasking two verification channels for handling two users at different stations simultaneously in real-time. In addition, the system is designed to support as many as 16 separate user stations 102.
  • the SD 106 is comprised ⁇ f elements including PAL20RA10CNS programmable logic array from MMI for control purposes, in addition to other, wr l-known conventional logic elements for constructi * . - - -.:ch a data system.
  • the SD 106 receives analog signals from the stations 102, and iL ⁇ ltiplexes the signals to select two channels from among the set of 16. The two channels are then PCM filtered, both on their inputs and outputs, and are sampled every 125 to 150 microseconds. SD 106 is driven by its own clock 108 wh_.ch is interconnected thereto to obtain the samples. The SD 106 is electrically connected to a control processor subsystem (CP) 110, and interrupts the CP 110 at every sample conversion time set by the clock 108. During this time, the CP 110 can perform successive read and write operations to both channels from the SD 106. At no other time can the CP 110 gain valid access to the SD 106 for reading or writing.
  • CP control processor subsystem
  • the CP 110 comprises a 10 MHz MC68000 microprocessor 112, a memory 114 and logic 116 to generate various control signals for input/output and memory accesses.
  • the CP 110 communicates with and coordinates all other functions in the system, including communicating and coordinating with the host computer.
  • the CP 110 can perform signal processing tasks, it mainly functions as a data administrator. Thus, it can directly control and gain access to each part of the system and it also can sesiate with the host computer in the form of coded data via a host interface unit (HIU) 118.
  • HEU host interface unit
  • the CP 110 is also electrically connected to a central data buffer (CDB) 120 which is a fast 4K X 16 random access memory data buffer, through which a majority of the inter-processor data and messages can be quickly and efficiently passed.
  • the buffer could comprise, for example, four IMS1421S-50 RAM devices (4 X 4096) produced by INMOS.
  • CDB 120 is the sole source of data transfer between processors. The only other data path of this kind is the port between the CP 110 and the HIU 118.
  • CDB 120 includes data ports for the CP 110, for the HIU 118 and for a signal processor subsystem (SP) 122.
  • CDB 120 also includes an address port for memory-mapped input/output communication with - 22 -
  • the CP 110 For direct memory access purposes, there is a 12-bit counter 124 which allows fast sequential auto-incrementing access to the CDB.
  • the 4K X 16 memory is accessed by the SP 122, and the HIU 118, but only byte-wide by the latter.
  • the HIU 118 provides a signal on the line illustrated at 126 which, when the CDB 120 is placed in the proper mode, only allows auto-incrementing every two bytes. Either word or byte-wide accessing is available to the CP 110, however, for maximum efficiency.
  • the CDB 120 One of the most important functions of the CP 110 is to manage allocation of the CDB 120. In light of this, and to maximize efficiency, the CDB 120 was designed to interface with the CP 110 in memory-mapped fashion. The SP 122 and HIU 118, however, view the CDB ' as an input/output device, but attainable via direct memory access. Most inter-processor communications must utilize the CDB 120, therefore much of the system time in the verification mode is spent in this element of the system. Nevertheless, any processor can gain access to the CDB 120 at any time since there is no hard-wired priority system.
  • Accesses to the CDB 120 are strictly controlled by the CP 110, to the extent that CDB 120 requests are first granted by the CP 110.
  • the protocol starts with an interrupt to the CP 110 by the requesting processor.
  • the CP checks the queue to see if the CDB 120 is being used. If not, an interrupt is sent back to the requester, who then can assume that it has designated possession of the CDB buss.
  • the CDB 120 Since the SP 122 has no direct communications port to either of the other processors, the CDB 120 must be employed -for this purpose.
  • the HIU 118 can also deposit and retrieve messages and/or data via the CDB 120, even though there is a port between it and the CP 110.
  • each processor must have a pre-assigned window of memory, one of which is known to the other processors. Therefore, data can be passed to any processor from any processor by simply accessing the appropriate window in memory with any possession of the ' CDB 120.
  • the CP 110 is also connected to the SP 122 which is a 16/32-bit TMS32010 signal processing microprocessor.
  • the SP 122 also includes memory storage for the SP microcode provided by elements such as two 8 X 1024 bit PROMS of 754 ns speed, such as MMI6381.
  • This s-.osystem includes random logic for input/output decoding., communications with the CP 110, and a data path to the CDB 120.
  • SP 122 performs arithmetic calculations at a rate of 5 million operations per second, and therefore is used for all of the heavy number handling tasks.
  • the CP 110 initiates a task in the SP 122 by sending a special signal over line 128 which causes the SP 122 to branch to the location specified by a command written by the CP into the CDB.
  • SP 122 has a pre-stored library of commands comprising uncoded vector locations, which the CP 110 calls to perform various tasks. Therefore, the SP 112 is a slave system whose tasks are initiated by the CP 110. It is also possible for the CP 110 to assign the CDB 120 to the SP 122 for the entire duration of a selected task, in which case the CDB 120 becomes a dedicated storage medium for the SP 122.
  • the HIU 118 functions as the interface between the host processor (not shown) and the CP 110, as well as the CDB 120. All signal interfacing is buffered by the HIU 118, which conditions the timing and control signals from the host to meet compatibility requirements. For example, and incompatibilities between the timing system of Figure 2 and that of the host processor are taken care of by a programmable array logic device (PAL) (not shown) which is internal to the HIU 118.
  • PAL programmable array logic device
  • the interrupt vector can either be -.ead _rom the CP host processor data port or from the mf * - c ",age window in the CDB as described above.
  • the host processor must know the s-s__r. address.
  • the base address is installed in the HIU 118 input selector, device, which is simply an address comparator.
  • Each system's base address is user programmable via switches, giving the host processor multiple systems to use for greater processing power.
  • up to 64 stations 102 can be provided for use in the verificationton mode by merely installing more circuit boards onto the host processor's bus.
  • Figure 3 illustrates the enroll mode which is generally indicated at 24 of Figure 1.
  • the system begins operation so that in block 152 it obtains the user's name, the access points through which this user may pass, and the maximum permissible levels for false acceptance of a user and false rejection of a user. Default values are provided where necessary, if no values are specified.
  • the system also retrieves a list of reference words which are to be recorded by the user and stores them for reference. For purposes of discussion, the use of 10 reference words is described herein.
  • substantially any desired number of reference words may be selected without changing the function or structure of the system.
  • the number of reference words used is increased, the ability to obtain words having this low probability of error is enhanced.
  • the enroll mode is functioning to generate a set of data from this particular user which may bew utilized in the later verification protion of the system operation.
  • step 152 From block 152 the system passes to block 154 where the information is assigned a temporary identification number and then the system passes to step 156.
  • Block 156 is a decision block relating to the number of tokens of the 10 reference words which have been collected. If a selected number of tokens (or utterances) of each of the 10 words have not been made, the host processor passes to step 158 and requests a new word fro.- the user by generating a prompt command. For purposes of discussion, the use of four tokens for each reference word is described herein. Of course, it will be appreciated that substantially any number of tokens could be utilized for proper operation of the method and apparatus - ⁇ .. ⁇ the present invention.
  • the CP 110 of Fi ⁇ re 2 produces a signal through SD 106 which is co munic'.t'.' 1 via input/output line 104 to a station 102 whereby, through means of a speaker or other audio or visual communicatxon means, the user is prompted to utter one of tha ten reference words for which tokens are to be stored.
  • the microphone in station 102 receives an utterance of the reference word form the user, and communicates it from station 102 via line 104 and SD 106 to the CP 110, where it is further processed as explained hereafter.
  • the host processor moves to block 160 where it instructs the system to produce autocorrelation coefficients (r coefficients) for the incoming speech by calling the "enroll" routine.
  • the enroll routine will be discussed more completely hereafter in reference to Figure 4.
  • the system examines those coefficients to determine whether incoming speech is present in the system. If no speech is present within a preset time period following the prompt signal, the host porcessor determines that there is a failure and passes to block 162. If the number of times the system has detected no speech following a prompt signal exceeds a predetermined threshold number, the processor moves to block 164 and aborts further attempts to enroll the user.
  • the processor returns to block 158 and again causes the user to be prompted to input an utterance.
  • the host processor moves to block 166 and initiates conversion of the r coefficients to transformed linear prediction coefficients (aa coefficients) which provide a correlation between the detected utterance and the geometry of the user's vocal tract. This conversion is performed as part of the enroll routine to be described with respect to Figure 4.
  • the host processor moves to block -168 and causes the autocorrelation coefficients and transformed linear prediction coefficients to be stored in an array for future use.
  • the processor then moves to block 170 and increments the wo'rd and token counter to identify the next token of the current word if four tokens have not yet been received from the user, or to the next word if four tokens have been reveived. As was indicated above, the number of tokens is programmable, and is selected by an operator prior to use. From block 170, the processor returns to block 156.
  • the system moves from block 156 to block 169 and sets a flag indicating that the stored information from the enable mode must be processed to produce the data base necessary for later system operation in the verify mode. Having set this flag, the system passes to block 171 where it exits the enroll mode of operation and returns to the initiate block 20 of Figure 1 to await further instructions. If four tokens of each of the ten words have not been received, the system passes from block 156 to to block 158 and operates as described above. By reference to Figure 4, it is possible to describe the enroll routine which is activated by the host processor in block 160 of Figure 3.
  • the system Upon entering the enroll mode 160 of Figure 4, the system moves to block 172 wherein the system allows the MC 68000 microprocessor of the CP 110 of Figure 2 to digitize the analog speech signal received from SD 106 at a sampling rate of approximately 8,000 samples per second. Sample frames are formed and stored in buffers allocated in memory 114 of the CP 110. Each one of the samples is read from a coder/decoder (CODEC) chip in the SD 106 as an 8-bit wide m - law encoded word whenever control is passed to the CODEC interrupt handler. Once a buffer in the CP memory 114 is filled with a frame of samples, the system moves to block 174 and copies the samples from the buffer into the CDB 120.
  • CDB 120 coder/decoder
  • Each of the 8-bit code words making up the samples is decompanded to define a 16-bit sample. This decompansion is done on the fly utilizing a table driven procedure for the sake of speed.
  • the use of m-law compansion is well-known in the technology. A comprehensive discussion of this subject matter is presented in "Digital Signal Processing Application
  • the system moves to block 176 and accomplishes preemphasis by forming a frame which includes the present frame plus the last half of the previous frame, thereby permitting an overlapping analysis.
  • the samples are preemphasized by use of a first order finite impulse response filter which is applied to the input samples in a format as follows:
  • n the index number of the sample
  • x the samples prior to preemphasis
  • y samples after preemphasis.
  • Preemphasis is performed to emphasize the high frequencies and cancel the low frequency emphasis caused by the sound transition from the speaker's lips to.the open space between the speaker and the microphone.
  • Equation (2) The unnormalized autocorrelations from Equation (2) are converted into normalized correlations by the relation:
  • the R(0) correlation is referred to as the "energy” and is a quantity which is utilized by the system in the detection of the beginning and end of an utterance as is explained hereafter.
  • the normalized autocorrelations (r coefficients) from Equation (3) together with the energy term R(0) are returned to the CP 110 as soon as the SP 122 has finished the calcu] __ion ⁇ * - .
  • the system passes to block 178 where it develops the linear prediction (LP) coefficients. Specifically, t'.e "-r 110 retrieves a copy of the r coefficients from t'.te CDB 120 and then starts the SP 122 to develop the LP coefficients.
  • LP linear prediction
  • Levinson's Algorithm delivers a set of filter coefficients by the recursive procedure which relies on the relationships indicated below:
  • the system moves to b ' Ock 180 and transforms the LP coefficients into aa coefficients.
  • the procedure for accomplishing tnia - transformation is as follows:
  • the system moves to decision block 182.
  • the CP 110 feeds the extracted data into a state machine to determine whether a user was speaking into the microphone of station 102 during the time the frame was recorded. This state machine is described in more detail hereafter with reference to Figure 5.
  • the speech detector finds that no speech is active, the system moves to block 184 and determines whether the speech detector state machine is in an exit state. If it is in an exit state this indicates that speech is completed and the system moves to block 186, terminates operation of the algorithm of Figure 4, and then returns to block 160 of Figure 3. If the state rachine is not in an exit state while the system is in block 184, the system returns to block 172 to obtain the nex speech samples for processing as described above.
  • the system moves to block 188 and determines whether the system is in the enroll mode. If it is in the enroll mode, the system moves to block 190 and stores the aa coefficients in the memory 114 of Figure 2. If the system is not in _ne enroll mode, then the system moves from block 188 to block 192 and stores the normalized autocorrelation coefficients and the residual energy from the Levinson's Algorithm in memory 114. From either of blocks 190 or 192, the system returns to block 172 and continues functioning as described above. As soon as the last frame of active speech from the user has been found, all of the parameter sets extracted so far are copied from the memory 114 into the CDB 120. The host is then interrupted by the CP 110 indicating that the parameters for a full utterance are being delivered to the CDB 120. The host then reads the parameters from CDB 120 and eventually stores the results on its mass storage devices.
  • 1'he state diagram includes a silence state 200 which is botn the initial and final state. In this state, the machine waits to detect the beginning of an utterance and returns after the end of the utterance is detected.
  • the state machine goes from the silence state 200 to the attention state 202. Specifically, the machine goes to the attention state if the energy of the detected signal is either above a certain upper threshold level ⁇ r, if the energy is above a certain lower threshold level and the normalized autocorrelation function (r) has an euclidean distance which is more than a preselected threshold distance "a" from that value of the autocorrelation function (r) which the machine has measured for noise.
  • This noise autocorrelation function is recursively updated by the machine in the silence state 200.
  • the machine From the attention state 202 the machine will go to a speech state 204 when the detected energy is high enough to prevent return of the system to the silence state, and when the machine has spent three cycles in the attention state 202.
  • the system will remain there until the detected energy drops below an "end of speech" threshold, indicating that a possible end of the utterance has been detected. At that time, the machine will go to an exit state 206. From exit state 206 the machine will go to silence if the detected energy is not high enough to exceed a lower threshold after five cycles. If energy of a sufficiently high value is detected, the machine will move from exit state 206 to resumption state 208 which functions, similar to the attention state, to move the machine to the spee 1- state 204 when the energy is high enough not to rf.curn to the exit state 206 and when the machine has spent three cycles in the resumption state 208.
  • the speech detection automaton also controls the attenuator system of the present invention.
  • the speech detection machine comprises software, although a hardware embodiment could be readily provided by one skilled in the technology and based on the above description of the machine.
  • the system can be activated either by an operator, or by other means such as a timing device, to utilize the information generated in the enrollment mode for forming the data base which will be used to accomplish comparisons of incoming speech signals in the verification mode.
  • Figure 6 illuratrates a method of system performance by which the appropriate data base may be constructed.
  • the system Upon receiving a signal requesting formation of a data base, the system moves from block 220 of Figure 6 to block 222, and determines whether the flag set in 169 of Figure 3 is still set. If the flag is not set, then there is no data to be processed and the system moves to block 224, terminates operation of the procedure for making the data base, and returns to initialize state 20 in Figure 1.
  • the system moves from block 222 to block 226 where it obtains the next word stored in CDB 120.
  • the system also determines whether it is finished with the preparation of the data base. If it is not, the system passes from block 226 to block 228. in block 228 the system makes "intra-speaker” comparisons wherein the statistics developed in the enrollment mode are compared with each other to develop "global distortions” indicating the extent to which the words differ amongst themselves. These intra-speaker comparisons are computed by obtaining, for each of the speakers, N tokens of that word.
  • the host processor obtains the global distortion G from this intra-speaker comparison by use of an intra-speaker routing which implements the above-described procedure and which will be described hereafter with respect to Figures 7 - 9. _
  • the inter-speaker comparisons for a speaker's word are computed by obtaining, for each of the speaker's N tokens of that word, the global distortion (G) that ⁇ ._ ⁇ lt ⁇ from comparing r(i) versions of that token with the aa(i) versions of all other tokens of the sam-j word, produced by all speakers in a generic data ba'_. .
  • the system moves to block 232 where the distortions from the intra-speaker and the inter-speaker comparisons are merged, sorted in numerically descending order, and stored in an array D.
  • the distortions in this array continue to be labeled as inter-speaker or intra-speaker distortions.
  • An example of the array D created in block 232 is illustrated in Figure 12.
  • a statistics file STAT is created.
  • the system creates a file which, for each intra-speaker distortion, provides an indication of the likelihood that the system will erroneously reject the actual speaker, or erroneously accept an imposter, if the threshold value for making the accept/reject decision is based on a distortion value corresponding to the distortion of that particular intra-speaker distortion.
  • a STAT file based upon the information in array D of Figure 12 is illustrated in Figure 13.
  • the pro * ⁇ - ⁇ ure utilized in block 234 for developing the STAT file of Fiture 13 will be described hereafter with respect to Figure 11.
  • the system of Figure 6 returns to block 226 to obtain the enrollee's next word. If all of the words have been processed in the manner described above, the system moves to block 236 and constructs an ORDER file which indicates the relative ability of each of the reference words to distinguish the present speaker from other speakers whose data is stored in the generic data base. Words with high discriminability will be those that have, at any given level of false reject error rate, low false accept error rates. The system determines the relative discriminative power of the words for the current user by sorting those words based on the highest value of the false accept designation in that word's per-word STAT file.
  • the value used for sorting the word represented by the STAT file in Figure 13 would be the first entry under the ERROR FAL -_ A CCEPT nea 3ing, which is the value of 2/6.
  • ERROR FAL -_ A CCEPT nea 3ing which is the value of 2/6.
  • the system Upon completing the comparison and otaining the global distortion, the system moves to block 244 and stores the distortion for later use. From block 244 the system returns to block 240 and obtains the next of the N tokens, and then processes this as described above. If no more tokens are available for the given word, the system moves to block 230 which corresponds to block 230 of Figure 6 and initiates an inter-speaker comparison.
  • FIGs 8 and 9 the COMPARE procedure utilized in block 242 of Figure 7 is described.
  • the system Upon entering the COMPARE block 242 of Figure 7, the system moves to block 242 of Figure 8 and initiates a comparison between a reference token and a test token.
  • the reference token is one of the previous utterances of the enrollee.
  • the test token is the token which is currently being processed in the making of the riata base, as illustrated in block 228 of Figure 6.
  • the comparison is accomplished by comparing a reference pattern of aa coeffeicient sets against the * (i) and e values of the test template.
  • the circumstances of the test are represented graphically in Figure 10 where it is seen that the reference template is defined by designating the aa values for each frame from zero to j on the reference axis 280. Likewise, the r and e values for each frame from zero to i of the test template are located along the i axis 282.
  • the length of the reference template is designated as extending from the origin 284 to M on the j axis. Likewise, the length of the test template is indicated as extending from the origin 284 to the location indicated as N on the i axis.
  • the results of the test comprise a global distortion value G which is representative of the minimum amount of distortion experienced in traversing all possible paths between the origin 284 and the intersection of lines perpendicular to the axes of M and N locations, designated at 288.
  • This global distortion G is derived as the result of a two-stage process which is embodied in the flow diagram of Figure 8 and which will now be explained. From block 242 of Figure 8 the system passes to a decision block 246 where the number of frames M of the reference template is compared against the number of frames N of the test template. If the difference between these numbers of frames is outside preselected threshold values, the system moves to block 248 and provides an output signal "999" indicating that the lengths of the utterances are uncompatible for comparison purposes.
  • the threshold values are:
  • the SP 122 of Figure 2 compares the "N" vectors of aa coefficients with the normalized autocorrelations "r". The "N" resulting local distortions are stored as 16-bit values in the CDB 120. The relationship for obtaining these local distortions is as follows:
  • a distortion value may be derived representative of the distance measures between the reference and test templates at that location.
  • the system next moves to block 256 and develops the global distance value g(i,j) for the particular location indicated by the coordinates i a"- ⁇ _ j on the graph of Figure 9.
  • the process evaluates different paths to the given location and selects the path having the minimum distortion value.
  • the minimum distortion from among the three paths indicated at 292, 294 and 296 is accepted.
  • This minimum path can be determined mathematically by the following relation: 9i-2 , j-l + 2di- ⁇ ,j + di . j
  • the final global distortion which defines the value of the minimum path, is found at point 288 in Figure 9.
  • the system in block 256 returns a final G value which is normalized based on the length of the tokens being compared. This final G value is defined as:
  • the CP 110 maintains only tlr.ee -ows of global distortions and two rows of local 'istortions- since this is all the information necessary to continue the above-described computation. This saves a significant amount of memory space. Further, due to the parallelogram boundaries in the dynamic programming algorithm described above, approximately 30% of the points can be excluded from an explicit search. These points would be outside of a parallelogram with its corners connecting the origin 284 and the end point 286 of the line in Figure 9. These points would be particularly concentrated near those boundaries of the graph which are in the vicinity of the M value on the j axis, and the N value on the i axis.
  • the system moves to block 260 and increments the j index by 1.
  • the system then moves to decision block 262 and determines whether the new j index is equal to the number of aa coefficients in the reference token. If the j index does not equal the number of aa coefficients, then the system returns to block 254 and computes the distortions as described above for the new location in the graph of Figure 9.
  • the system moves from block 262 to fock 264 and increments the i index by 1. From block ⁇ 264 the system moves to decision block 266 and determines whether the new i index is equal to the number of r coefficients of the text token. If it is not, the system returns to block 252, sets the j index equal to zero and continues developing the distortions at the new location in the graph of Figure 9,
  • the system If the new index i equals the number of r coefficient sets for. t e test token, then the system is at the final point 288 in the graph of Figure 9. The system then moves to block 268 where the final, normalized global distortion value at point 288 is defined as indicated in Equation (13) above. This normalized value then becomes the global distortion correponding to the comparison of the selected reference token and the test token.
  • the system passes to block 230 of Figure 6 where the inter-speaker comparisons are performed. By reference to Figure 10, the procedure for accomplishing the inter-speaker comparisons is more clearly described.
  • the system moves to block 300 wherein it advances to the user's next token for the particular word being dealt with. Once this token is identified, the system moves to block 302 where it advances to the next generic speaker, and then moves to block 304 where it loads that specific generic speaker's next token for the current word being tried. The system next passes to block 306 where it conducts the comparison of the current token of the word from the user with the token from the particular generic speaker as identified in block 304.
  • the procedure for accomplishing this comparison is the same as was described above with respect to Figures 8 and 9, with each comparison producing a global distortion representative of a score value for the compared tokens. After completing the comparison, the global distortion is stored for further use.
  • the system returns to block 30 ⁇ and obtains the next token of that particular generic speaker. If there are no more tokens for that speaker relating to the word under consideration, the system passes from block 304 to block 302 and advances to the next generic speaker, and then proceeds as described above to evaluate the tokens for that generic speaker. If there are no more tokens of the present word from generic speakers, the system moves from block 302 to block 300, and advances to the user's next token of the particular word. If the comparison for all of the user's tokens of that word have been completed then the system moves from block 300 to block 232 in the flow chart of Figure 6 wherein the distortions which have been stored in block 308 of Figure 10 are merged and sorted as described previously.
  • each of the intra-speaker distortions in the sorted array D of Figure 12 is assigned a number
  • the first element of the ERROR treat file in the per-word STAT file is arrived at by applying the above formula (14) to the first intra-speaker distortion value which is encountered when working from the top down in the D array of Figure 12.
  • this first encountered distortion value would be identified in order as number 6, having a distortion value of 87.
  • the K term equal? 6 s- ' .nce there are 6 distortions which are equal to or I.ss than th_ first intra-speaker distortion encountere--, which is order number 6.
  • the system moves to block 314 where it develops the ERR0R-_ AL _ E REJECT (ERROR FR ) column of the per-word STAT file of Figure 13.
  • ERROR_late column can be developed by the following relation:
  • the system next asses to block 316 and stores the information developed in blocks 312 and 314 in the per-word STAT file as illustrated for discussion purposes in Figure 13.
  • the system moves from block 316 to block 310 where it gets the next intra-speaker distortion from the D array, and forms a per-word STAT file for that word, in the manner as described above. If there are no more intra-speaker distortions in the array D the system moves to block 226 of Figure 6 and functions as described previously.
  • a user may activate the verify mode of the present invention in attempting to gain access to a secured area, or to otherwise use the present invention for purposes of access, identification, and the like for which the verification system of the present invention finds application.
  • the system in Figure 14 passes to block 400 and determines whether there is a present claim for access at one of the stations 102.
  • the host processor continuously scans all access points for activity. Thus, if no activity is indicated the system loops via line 402 back to block 400 and continues to function here until such time as a claim is made.
  • the user may present an identity claim in several ways, including (but not limited to) entering a personal identification number on a key pad, inserting a plastic card in a reader, or saying his name into a microphone.
  • the system passes to block 404 and issues a request to the user for an identification. This request is transmitted from the CP 100 through the SD 106 to the station 102.
  • the system passes to block 406 where it monitors incoming signals from the station 102 via SD 106, and detects energy levels of the signals corresponding to an indication that identification information may be present. If no incoming signals of this type are detected after a preset period of time, the system returns to block 404 and again requests an identification from the user. Upon detecting signals comprising identification, the system moves to block 408. In block 408 the system compares the identification information which was previously stored for the user to determine whether the identification corresponds to an enrolled user, and whether the identification permits access from the particular access point.
  • the system moves to block 410 and produces a signal rejecting the user. If the identification is found to be acceptable for the particular access point, the system moves to block 412 and sets the threshold values based upon data for the particular user which was previously entered and stored. II no threshold values are specified, the system will utilize default values.
  • The. threshold values which are set include the maxxmum number of trials to permit before rejecting the claim (where each word constitutes a trial) ; the minimum number of trials to require before permitting a claim to be accepted; the level of false accept error rate
  • the system moves from block 412 to block 414 from whence it accesses the ORDER file 416 for this particular user.
  • the system gets the next word from the ORDER file which was created in the enrollment mode.
  • the system Upon receiving the current token uttered by the user the system moves to block 420 and obtains the normalized autocorrelation coefficients r. These r coefficients are obtained by use of the test procedure.
  • the test procedure functions on the current speech to develop the r coefficients, as well as other parameters, in order to facilitate the verification process. This procedure was previously described in connection with the illustration of Figure 4.
  • ERROR_ A by comparing the present token with the corresponding four templates of the user and then utilizing the global distortion obtained from those comparisons to obtain the correct value for ERROR-,., from the ST? ' , file for this word.
  • the procedure for obtainin g this ERROR FA is illustrated in more detail in F ⁇ ' jure 15 and will be discussed hereafter.
  • the system passes from block 422 to block 424, wherein the fail counter is incremented.
  • the system then passes to block 428 where further testing is conducted to determine whether any action as to acceptance or rejection should be taken.
  • the test procedure which is performed in block 428, is described hereafter with reference to Figure 16.
  • the system Based upon the decision of the test procedure in block 428, the system either makes no decision and returns to block 414 wherein the next word is obtained from the ORDER file and operation continues as described above, or the system passes to either block 430 and rejects the claimant, or to block 432 and accepts the claimant. From blocks 430 and 432 signals are produced which are transmitted to station 102 of Figure 2 to advise the claimant of the decision. Of course, these signals could also be provided to other external equipment to accomplish things such as opening doors to secured areas, initiating operation of selected equipment and the like.
  • the system moves to block 426 where the cumulative error is adjusted.
  • the cumulative error corresponds to the ERROR--- value provided from block 422.
  • the cumulative error comprises a combination of the previous cumulative error value and the current ERROR ⁇ value received from block 422. In one preferred embodiment, this combination comprises the product of the-previous cumulative error value and the current ERROR-., value received from block 422.
  • the procedure for obtaining the ERROR-,-, in block 422 may be described.
  • the system Upon entering block 422 the system immediately passes to block 440 of Figure 15 and gets the next of those four aa files developed in the enrollment mode, which correspond to the word being tested.
  • the system passes to block 442 and compares the new r coefficients obtained in block 420 of Figure 14 with the aa values obtained from block 440. This comparison is accomplished by the procedure outlined previously with reference to Figures 8 and 9, producing a global distortion representative of the difference between the r values of the new token, and the aa values of the token obtained in block 440.
  • the system next passes to block 444 and reads the global distortion value obtained in block 442. This distortion value is saved and the system passes to block 446 and determines whether any of the N stored tokens correspr--;_din ⁇ . to this user have not yet been compared with the r values of the current utterance. If there are tokens that have not been compared the system returns to ' jJc-i * 440 and functions as described above. If no more tokens remain to be compared, the system moves to block 443 '-here the global distortions produced by the comparisons of the new r values with the N stored templates of the user are processed to develop a composite distortion. This composite distortion may comprise any of several different vlues.
  • the token distortion values are averaged together to comprise the composite distortion.
  • the lowest value of the distortions is selected, and in yet another preferred embodiment the lowest two distortions are averaged.
  • the basis upon which the make-up of the composite distortion is selected may be dependent upon the type of application for which the invention is utilized and upon the desires of the operator.
  • the system passes to block 450 and references the per-word STAT file created in the enrollment mode.
  • the composite distortion is compared with the distortions in the per-word STAT file and that distortion which is closest to, but greater than, the composite distortion is identified.
  • the ERROR FA associated with the identified distortion is extracted, and is utilized as the ERROR-,, which is provided to block 422 in Figure 14.
  • block 428 of Figure 14 which comprises the test block, is described with respect to Figure 16. Specifically, upon entering block 428 the system passes .to block 460 and determines whether the 5 fail counter from block 424 indicates that the user has exceeded a preselected number of trials. In a default condition this preselected threshold value is two. Of course, this value may be set at the discretion of the operator. If the maximum failed trials has been o exceeded, the system passes to block 430 and rejects the claimant as described previously.
  • the system passes to block 462 and determines whether the maximum number of trials has been exceeded. 5 This condition arises when the cumulative error of the claimant is greater than the threshold value for acceptance, but the trial has not been failed in block 460. A default threshold value for the maximum number of trials is five. Again, the operator can select 0 another value if he desires. If the maximum trials has been exceeded, the system passes to block 430 and rejects the claimant in the manner described above. If the maximum trials have not been exceeded, the system passes to block 464 and determines whether the cumulative error from block 426 of Figure 14 is greater than the threshold value. If it is, the system passes to block 414 of Figure 14 and obtains the next word from the ORDER file and then proceeds as described above.
  • the system passes to block 466 and determines whether the claimant has passed at least two trials. This means that at least two trials have been conducted for this claimant, and that the cumulative error at the end of the second trial is below the threshold value for acceptance.
  • the cumulative error is adjusted with each trial in block 426.
  • the threshold is also adjusted along with the cumulative error. This adjustment can be made as a proportion of the change in cumulative error such as by multiplying the threshold by the present ERROR__ A value, in the same manner as is done for the cumulative error.
  • the tendency of the cumulative error value to go down as it is multiplied by fractions would be accompanied by a lowering of the threshold level, so that the likelihood of an impostor being accepted could not be increased with an increasing number of trials.
  • the system passes from block 466 to block 414, gets the next word from ORDER file 416, and proceeds as discussed previously. If at least two trials have been conducted, the system passes to block 468 and accepts the claimant by producing a signal which is transmitted through the station 102 to communicate to the claimant that he is accepted, and to otherwise activate equipment which may be operated by such acceptance signals.
  • the memory of the personal computer is organized into blocks to be used as stacks 124. These stacks will hold all modifiable data for each of twelve processes that may be contemporaneously in progress.
  • An "image" of each process to be run is stored in its stack.
  • Process 0 is designated to be special in that it needs no image initialization and it performs all other initializations and setups. Thereafter processes 0 simply reads the time continuously and makes the "time" available to all other processes in a shared semaphored memory location. Only after the images have been initialized is a routine called the "clock interrupt service routine" or "clock ISR" permitted to begin operation.
  • the clock ISR pushes the image of the current process onto that process's stack. Then, based on the status of other processes, a decision is made by a "scheduler" as to which process deserves use of the digital processor in the personal computer.
  • the chosen process's stack address is then retrieved from the process table 126 and its image is popped off the stack. Operation of that process begins immediately upon returning from the interrupt.
  • a critical part of the initial image is the establishment of the correct interrupt return address and flags.
  • the procedure of changing stacks and images is called a "context switch”.
  • a second mech_.ri-.ui also performs a context switch.
  • a process wa ⁇ cing for some event hands control to the context switch mechanism --hich builds an image as if it were interrupted then continues with the stack decision and image popping of another process.
  • a semaphore is a flag which can be set and examined only as an atomic non-interruptable' operation. At issue is the possibility that two processes may try to examine or modify the same location, device, or resource and expect no interference during that operation.
  • a semaphore is guaranteed validity by having a process turn off interrupts and read the flag. If a flag is free, the flag is set and interrupts are turned on again. If the flag is in use, do a context switch. The flag is examined again when the processor returns to this context. Interrupts remain off within this context until the semaphore is free.
  • Each device i.e. the clock, screen, keyboard, disk, diskette, rs232 bus, and SVB
  • the SVBs are examined in a "round robin” fashion in the interrupt service routine t_ determine which board has the right to use a DMA function which is shared by up to four boards.
  • a process When using a device a process sets the device semaphore, turns off interrupts, then commands the device, then sets the device busy flag, and then co ⁇ it ⁇ vt switches out.
  • the device interrupt service routine must reset the device busy flag.
  • the scheduler can then return to that process.
  • the process must then re-.ea-.p_ the device semaphore and turn interrupts back on.
  • the decision of the scheduler as to which process gets use of the processor is based on priorities, device busy flags, and process states. There are only two levels of priority. High priority is given only to speech output which must remain contiguous (continuous) to sound correct to the user of the system. All other processes have low priority. High priority processes have absolute precedence over low priority processes.
  • the scheduler examines the process table in a "round robin" fashion starting at the process immediately following the one being switched out. During this round robin examination, processes waiting for devices and dead processes are skipped. The first runnable low priority process is remembered. If a high priority process is waiting, the first one found is served first. if none is found, the remembered low priority task gets service. It is not possible for process 0 to be blocked or "dead” therefore the scheduler always has a process to run. Processes may be "killed” or "doomed”. A killed process will terminate immediately, context switch out (if in) and be subsequently skipped in the round robin examination. A doomed process will continue until the process finishes upon which time it kills itself.
  • the process driver is an infinite loop which executes the function specified in the process table. This infinite loop will kill itself upon finding the ⁇ .'ccess doom flag set and then context switchout.
  • Processes may be "exec'ed". To exec a process, - first it must be assuredly dead. A kill command is issued for that process and a function pointer is copied into the process table entry for that process, then the remainder of the process table entry is initialized except for the process state which must remain as killed until the entry is up to date. Then the process state is raised to "live". - 59 -
  • the operating system must maintain the fixed disk in the equivalent of a file system.
  • the disk is organized into seven major blocks: the BOOT, OPERATING SYSTEM, STRUCTURES, HEADERS, IMPOSTORS, ENROLLEES, and UDIT TRAIL.
  • the BOOT contains the program necessary to bring up the operating system including the disk linkage for finding the operating system and structures.
  • the OPERATING SYSTEM contains the program that operates, e.g., all devices and resources in the management of verification.
  • STRUCTURES is a map of where all the 7 major blocks are located. These major blocks are sometimes split into separated minor blocks. For instance, two copies of the HEADER are stored in widely separated disk sectors so that damage to one copy does not permanently destroy valuable data.
  • the HEADERS contain all the enrollee personal data and linkages into the ENROLLEES block and IMPOSTORS block.
  • the IMPOSTORS block is an array of tokens. Each IMPOSTOR has 5 tokens each of 5 words. .
  • the ENROLLEES block is an array of tokens. For each enrollee there are 5 tokens each for 5 words.
  • the verification procedure is coded in a separate module which uses primitive functions provided by the operating system.
  • Verification is broken down into three major sub-modules: VERIFY, ENROLL, and MANAGE.
  • ENROLL can only be seen from the action of the "Manager Access Point".
  • the two buffers are required for keeping in RAM the operation of generating intra- and inter-speaker statistics. Shuffling off and onto disk during this operation would slow the system down considerably.
  • NOTE These buffers are also used by other modules which may not run concurrently with ENROLL such as BACKUP. BACKUP also shuffles blocks of 25 tokens each off and on disks and diskettes.)
  • the prompts are kept in RAM for similar speed reasons.
  • the sharing of the processor by the scheduler could make the portions of output speech fragment leading to poor quality sound. Therefore, speech output is given high priority.
  • the process sets up a command for the interrupt handler, marks itself as waiting for service and signals the board interrupt handler. Since the interrupt handler has interrupts turned off there is never contention for the DMA so that semaphores and device busy flags are unnecessary.
  • the interrupt handler drops the waiting-for-service fl? ⁇ j for a process as soon as it is finished with that process's request.
  • the process context switches out while waiting for service and is not executed again until the waitinj for service flag drops.
  • the process semaphores the disk, sets up the command and DMA, marks the disk as busy, starts the DMA, and then requests a context switch to give other processes time to execute while the disk is busy.
  • the disk executes an interrupt which marks the disk as idle and turns off the semaphore.
  • the scheduler will, when the round robin loop reaches this process again, continue execution of this process.
  • ENROLL is started from the MANAGER which "dooms" process 1. Since process 1 is normally a verify process, the system waits for the verify to time out and kill itself. It would be inappropriate to interrupt a valid verification which may be occurring.
  • ENROLL begins by getting vital statistics, and then as each token is retrieved it is placed in the R token buffer in sequence according to its word number and token number. The tokens are then examined for large variances in length. If any such variances are found, an attempt is made to replace the tokens that have standout lengths. If too many tokens are "standout" within a word, a fresh set of tokens for that word is prompted. If this is unsuccessful, the ENROLL aborts. If the ENROLL is successful, the tokens are then converted to k tokens which are placed appropriately in the k token buffer.
  • the board is then commanded to generate scores on each token as compared to all the other tokens of the same word. These scores are used to determine the intra-speaker statistics which are stored in the header. Then each installed impostor is read into the r token buffer and inter-speaker statistics are generated in the same manner. For each impostor that is not installed a default set of statistics is generated. These statistics are combined into generalized inter-speaker statistics. These statistics are then stored in the header. The k token buffer is then assigned to a User Disk Index (udi) and written out. It is read back into the r token buffer and a byte-by-byte comparison is made to determine if there is a bad sector on the disk.
  • udi User Disk Index
  • VERIFY is the normal function for processes 1 through 8.
  • Appendix A is a copy of the object code for'the subprograms for verification, data
  • the object code is in octal form, written in the programming language C and the Masscomp version of the M68000 assembler language for use on the IBM PC/XT. Also attached, as Appendix B, is a copy of the SP object code which is written in the
  • the apparatus and method described above comprise a significant improvement over the prior art systems by providing a reliable and efficient speech verification system which: (1) includes an efficient and accurate means for detecting the beginning and end of speech; (2) provides parameters for use in verification which are related to features of the vocal cord, and which may provide for comparison between reference and test utterances without limitations based on time or intensity restrictions; (3) provides a verification mode which develops an indication of the probability of erroneously accepting an impostor or rejecting.
  • the system may proceed to calculate (500) the intra-speaker global distortion values for the enrollee' s utterances of the first reference word. Once all the global distortion values have been calculated for that word, the largest value is discarded and the mean and variance for the remainder "of the global distortion values are calculated and stored (502) for later use in' the verification operation.
  • the inter-speaker comparisons for calculating the inter-speaker global distortion values are performed (504) , the largest distortion value is discarded, and, for the remainder of the global distortion values, the inter-speaker mean and variance are calculated and stored (506) .
  • the intra-speaker and the inter-speaker mean and variance are calculated and stored, as described above.
  • the system retrieves from memory the intra-speaker and inter-speaker mean and variance values, the reference words, the k coefficients for the utterances associated with each reference word, and the two threshold values, U and V, corresponding to the user's claimed identity (522).
  • the system randomly selects one word and prompts the user to speak the selected word into the microphone (524) .
  • the received utterance is immediately processed to obtain the normalized autocorrelation coefficients r (526) .
  • the r coefficients are then compared with the k coefficients for each of the stored utterances of that word (previously spoken by the enrolled user) to calculate a new set of global distortions (528) . From the set of global distortions a single combined score (i.e. the mean global distortion) is calculated.
  • p and q are calculated; they represent, respectively, the probability that the enrolled user would produce an utterance yielding a combined score as poor as or worse than the one just calculated and the probability that an impostor would produce an utterance yielding a combined score as good as or better than the one just calculated.
  • p is calculated by integrating the Gaussian density function 540 characterized by the two intra-speaker values (i.e. the mean 542 and the variance 544) from the combined score 546 to positive infinity (i.e. p equals the area of the shaded region 548) .
  • q is calculated by integrating the Gaussian density function 540 characterized by the two intra-speaker values (i.e. the mean 542 and the variance 544) from the combined score 546 to positive infinity (i.e. p equals the area of the shaded region 548) .
  • q is calculated by integrating the Gaussian density function 540 characterized by the two intra-speaker values (i.e. the mean 542
  • Gaussian density function 550 characterized by the two mter-speaker values (i.e. the mean 552 and the variance 554) from negative infinity to the combined score 546 (i.e. q equals the area of shaded region 556).
  • P and Q are cumulative probability values which are updated each time new values for p and q are calculated, i.e. each time a new utterance is received from the user, P and Q (which are initialized to 1) are updated by multiplying each by its respective corresponding individual probability value, p or q. Consequently, after the users first utterance, P and Q equal p and q, respectively (560) .
  • P and Q are compared, respectively, with the two threshold values, U and V, for the enrolled user. If P is less than U but Q remains greater than V (564) the user is rejected; if, conversly, Q is less than V but P remains greater than U (566) the user is accepted. If, however, P remains greater than U and Q remains greater than V (568) no decision to accept or reject the user is made; instead another one of the reference words is chosen and the user is prompted for another utterance. New values for p, q, P, and Q are calculated for this utterance and the updated values of P and Q are once again compared to u and V. The system continues to prompt the user for new utterances until a verification decision is made to either accept or reject the user.
  • the sensitivity of the verification decision, and thus the security level, is related directly to the value of the two threshold values, U and V. If U and V are assigned relatively small values the verification operation will be an extended one requ i ring a long sequence of utterances and analysis.
  • the benefit of low threshold values is an increased accuracy, i.e. greater certainty that an impostor has not been accepted and that a valid user has not been rejected; the benefit of increased accuracy must of course be- weighed against a lengthy verification operation.
  • the utterances he or she spoke during the verification operation are used to update both the stored reference utterances and the intra-speaker and the inter-speaker mean and variance values. This is done to accommodate the changes which occur in the human voice over time. The updating is done autoregressively such that very recent utterances are given more weight than very old ones.

Abstract

A speaker verification system and method develop, in an enrollment mode (22), a data base comprising plural templates (47) for each of a group of words from a speaker having parameters corresponding to the vocal tract, which parameters are compared with themselves and with parameters representing templates of those words from a generic group of speakers, to develop distortion values indicating closeness of match of the speaker's vocal tract to his own templates and of a generic vocal tract to the same templates of the speaker. These distortions are used to develop information indicative of probabilities for given words of erroneously accepting an imposter, or erroneously rejecting the correct speaker (46). In a verification mode (50), the speaker utters word tokens corresponding to his stored templates. Parameters are extracted (60), with the parameters of the speaker's stored templates to develop distortion values representing the difference between the compared parameters. The distortion values for the current speech are analyzed against the probabilities of erroneous acceptance or rejection. A plurality of speech samples may be processed during verification. Acceptance or rejection is based on cumulative probabilities (70). Access to controlled areas is permitted in response to an acceptance signal from the verification system.

Description

Spea er Verification System Background of the Invention This application is a continuation-in-part of United States patent application serial number 751,031 filed July 1, 1985, and assigned to the same assignee as this application.
This invention relates to speaker verification, and particularly to a method and system for identifying an individual based on speech input. There are numerous situations in which it is necessary to establish an individual's identity. Such situations include controlling physical access to a secure environment, validating data entry or data transfer, controlling access to an automated teller machine, and establishing identification for credit card usage. Identification by voice is advantageous over alternative systems in all of these cases because it is more convenient, optionally may be carried out at a distance such as over the telephone, and can be very reliable.
Prior automated speaker verification systems have suffered a number of problems which have limited their reliability and have, therefore, restricted their application, in many cases , to areas not requiring high reliability.
One such type of system utilizes selected peaks and valleys of successive pitch periods to obtain characteristic coordinates of the voiced input of an unknown speaker. These coordinates are selectively compared to previously stored reference coordinates. As a result of the comparison, a decision is made as to the identity of the unknown speaker. A serious limitation of this system is that it may experience problems resulting from changes in overall intensity levels of the received utterance as compared to the stored utterances. Another system in this area of technology compares the characteristic way an individual utters a test sentence with a previously stored utterance of the same sentence. The system relies on a spectral and fundamental frequency match between the test utterance and the stored reference utterances. As a result, the system is subject to errors from changes in the pitch of the speaker's voice.
Still another type of arrangement which has been utilized for verification of a speaker, filters each utterance to obtain parameters that are highly indicative of the individual, but are independent*of the content of the utterance. This is accomplished through linear prediction analysis, based on the unique properties of the speaker's vocal tract. A set of reference coefficient signals are adapted to transform signals identifying the properties of the speaker's vocal tract into a set of signals which are representative of the identity of the speaker, and indicative of the speaker's physical characteristics. Prescribed linear prediction parameters are utilized in the system to produce a hypothesized identity of an unknown speaker, which is then compared with ,the signals representative of the identified speaker's physical characteristics, whereby the identity of the unknown speaker is recognized. The reference describes no mechanism in its system by which the distortion between the test and reference utterances can be compared. As a result, no explicit method is provided for actually carrying out the stage of speaker verification, as opposed to the preliminary steps of utterance analysis.
Another system which takes a somewhat different approach, comprises a speaker verification system utilizing moment invariants of a waveform which corresponds to a standard phrase uttered by a user. A number of independent utterances of the same phrase by the same user are used to compile a set of moment invariants which each correspond to an utterance vector. An average utterance vector is then computed. Comparison of the average utterance vector with later obtained utterance vectors provides for verification of the speaker. Moment invariants from a group of persons of different ages and sexes are also stored, and the group of invariants from persons in the group who are closest in age and sex to the user are compared against the- stored utterance vectors of the user to arrive at a weight vector. The user's weight vector and computed utterance vector are stored on a card and used in computing a threshold, and in evaluating later utterances of the user. The system provides no mechanism by which the distortion between the test and reference utterances may be compared. Further, reliability of acceptances based on comparison of new utterances against the average utterance vector could be questionable. This is especially true if the utterances of the user tend to have a large variance in their various characteristics. For example, if the speaker says a word differently at any given time, the single average value provides no flexibility in recognizing such varying speech. Still another system of interest calculates "distance thresholds" and "mean values" between a test word to be classified and other reference samples. "Weighting factors" are utilized to gauge the importance of particular variables. A fixed threshold for a given user is required, except that a comparison of portions of the test word outside the threshold may still be used to verify the speech if those portions come within a minimum distance of portions of a reference sample. If no reference sample happens to be near the test sample, there is no means to gain acceptance where the test sample is outside the basic threshold. For example, a user may not be verified if he unintentionally does not properly pronounce the word. In still another system in this technology, the average values of acoustic features from a plurality of speakers are stored in standardized templates for a given reference word. In response to utterances of identified speakers, a set of signals representative of the correspondence of the "identified speaker's features with the feature template of the -..ference word is generated. An unknown speaker's utterance is analyzed by comparing the unknown speake 's utterance features and the stored templates fo_ the recognzied words. Again, this system experiences the problem of comparing the single sample of incoming speech with a threshold for that particular user. As a result, depending upon the setting of the threshold, the user may be unable to qualify for verification if his single attempt to pronounce the word varies by too great an amount from the reference information stored in the system. Another problem with many prior art systems is they have no reliable or tractable means for detecting the beginning and end of speech. This further reduces the reliability of prior art systems, or greatly increases the cost of systems with elaborate schemes for attempting to identify speech endpoints. Without such detection it is very difficult in many prior art systems to reliably compare reference speech with incoming speech having endpoints which are difficult to detect. In light of the various problems in the speaker verification technology, such as those indicated above, it would be a great advance in the technology to provide a system with increased reliability as to speaker verification. It would be a further important improvement if such a system were substantially independent of variations in intensity, relied on speaker-dependent features that reflect the unique properties of the vocal tract, provided a parameter used to evaluate acceptability which is based on a plurality of trials of a given user and upon speaker comparison with his own word templates, and provided information for a given user comprising probabilities of accepting an impostor based on the selected threshold level for verification of the user. It would be a still further improvement if the system provided for a plurality of samples to be taken from the user before acceptance or rejection, so that spurious or happenstance errors in the pronunciation of a word may not prevent recognition of the user.' It would be a still further improvement if the verification criteria were influenced by the number of tests performed, and were subject to adjustment or alteration by supervisory personnel in changing the stringency of the verification threshold. Brief Summary of the Invention This invention comprises a method and apparatus for carrying out verification of a user's identity by evaluation of his utterance. Specifically, the invention comprises a system which initially develops a data base comprising word samples from a user which are processed by comparison with themselves and with generic utterance data, for developing measures of the probability of erroneously accepting an impostor with respect to verification based on a given word. With this data base created, the system operates to verify the identity of a speaker based upon plural trials and in light of the information in the data base.
Before a speaker can use the system for verification purposes, he must.be enrolled. In order to enroll the user, the system repeatedly prompts the user for tokens (a token is a single utterance of a word) of a series of reference words until a sufficient number of tokens of the words are obtained and stored in a data base. The tokens are subjected to feature analysis whereby certain coefficients representative of the speaker's vocal tract are obtained. The tokens are also subjected to end point detection. By comparison of the tokens with themselves, and with corresponding tokens of a generic group of people, the system obtains measures of the probabilities of erroneously accepting an impostor or rejecting the true speaker.
When an enrolled user wishes to have his speech verified, he enters the identity he is claiming and the system then prompts him for an utterance. His utterance is digitally encoded and analyzed. The start and end points are detected, and coefficients corresponding to features of the utterance are developed. Selected coefficients developed from the user's previously recorded tokens of the selected word are compared with the coefficients of the newly received utterance, producing a measure of the distance between the new utterance and each of the the reference tokens. This process may be repeated for additional utterances from the user. By analyzing one or more of the measures of distance against the probability information developed during enrollment, the system determines the probabilities of making erroneous decisions. Based on those probabilities, decisions are made at successive stages whether to accept or reject the user. The use of cumulative probabilities from stage to stage provides a means of dynamically evaluating the speaker in conjunction with several trials of speech directed to different words, so that the verification decision is based on e user's performance on each of the various words, reducing the likelihood of erroneously accepting an imposter. One general feature of the invention is in making the verification decision in stages where in an earlier stage a decision is made whether or not to proceed o a subsequent stage, and in the subsequent stage a verification decision is based both on the analysis madε in the first stage and the analysis made in subsequent stages.
Another general feature is in basing the verification decision on at least one probability value derived from probability data which is in turn derived from stored speech information.
Another general feature of the invention is in updating the stored information based on test information about 'a speaker's utterance, but only if the speaker has been verified as being the known person.
Another general feature is a non-mechanical, non-magnetic power switch for triggering a device (e.g., a solenoid of a door lock) upon verifying a speaker as a known person.
Another general feature is in both detecting and decoding coded tone signals, and performing speech verification using the same digital processor. Another general feature is in time interleaving the analyses of different utterances received from different stations so that the analyses can proceed simultaneously.
Another general feature is in the combination of a plurality of stations for receiving utterances, a plurality of processors for serving respective stations, and a host computer having a real-time operating system for serving respective stations in real time.
Other features and advantages will become apparent from the following description of the preferred embodiment, and from the claims.
Brief Description of the Drawings • Figure 1 illustrates a general block diagram of the method of speech verification as used in the p* esent invention.
Figure 2 is a detailed block diagram of one preferred embodiment of an apparatus for use in the speaker verification system of the present invention. Figures 3 and 4 are flowcharts illustrating the operation of the system of the present invention. Figure 5 is a state diagram of the speech detector system of the present invention. Figures 6 through 8 are flow charts illustrating the operation of the system of the present invention.
Figure 9 is a graphical representation of the method for obtaining the global distortion representative of comparison of tokens of a given word.
Figures 10 and 11 are flow charts illustrating the operation of the system of the present invention.
Figure 12 is a tabular representation of the array "D" for organizing distortions developed in operation of the present invention.
Figure 13 is a tabular representation of a per-word STAT file created during operation of the present invention. Figures 14 through 16 are flow charts illustrating the operation of the system of the present invention.
Figure 17 is a block-diagram of a multiple access module system served by a host computer with a real-time operating system.
Figure 18 is a diagram of an access module.
Figure 19 is a block diagram of a portion of the access module.
Figure 20 is a circuit diagram of a relay-type circuitry in the access module.
Figure 21 is a diagram of memory locations and tables for a real time operating system.
Figures 22, 23 show various tables used in the real-time operating system. Figure 24 is a flow-chart of a verify procedure.
Figures 25, 26 are flow-charts of an alternate verify procedure.
Figure 27 shows Gaussian functions for use in the alternate verify procedure. Detailed Description of the Preferred Embodiment 1. FUNCTIONAL DESCRIPTION The present invention may be functionally described by reference to Figure 1. The operation of the system is initiated in block 20 by an external condition, such as the operation of a switch or other mechanical activating device, by electronic detection equipment, such as photosensors which detect the presence of a user or by voice or sound activation which detect the user speaking or otherwise making a noise, such as activating a "touch tone" signal.
The activated system will function to perform operations which have been requested either by the user, or which have been preselected for system operation by an operator at an earlier time. In this regard, from the initiate blu.-k 20, the system moves to decision block 22, and cased on the instructions which it has received, determines whether it is to enter the "enroll" mode which is indicated generally at 24. The enroll mode accomplishes a procedure whereby info-maticu relating to a particular user is obtained frop that user, processed and stored for use at subsequent tj-..es *-.n verifying the identity of the user. This informatior includes utterances (i.e., tokens) by the user of selected reference words which are processed to form a data base which is used for comparison purposes during subs quent verification of the user. Thus, before an individual can use the system for verification purposes, he or she must go through an enrollment process in the enrollment mode of operation. If the system ir instructed to enter the enroll mode, the system passes from decision block 22 to block 26 where the operator keys in information to the system, including the user's name, access points through which the user may pass, and maximum permissible levels for false acceptance and false rejection errors in verification of the user. If the access points and maximum permissible levels are not specified, default values are used.
From block 26 the system passes to block 28 where it prompts the user to provide an utterance of one of the list of reference words which are to be recorded by the user. Such utterances produce an acoustic message or communication whose pattern is indicative of the identity of the user.
After prompting the user the system passes to block 30 where it samples incoming signals. Upon detecting an utterance the system passes to block 32 wherein the detected utterance i;** converted into digital form.
From block 32 the system passes to block 34 where the incoming signals are periodically sampled with each period forming a sample containing signals representing the detected speecn during the particular period of time. The samples are stored until a specified number of them form a frame, which is then processed to obtain autocorrelatϊor functions, normalized autocorrelation functions, and linear prediction coefficients representative of the frame.
The set of coefficients which are extracted for each token (termed a "template") are utilized in the verification process to be described subsequently. The linear prediction coefficients comprise a representation of the characteristics of the speaker's vocal tract. From block 34 the system passes to block 36 wherein the features extracted in block 34 are untilized for detecteing the beginning and end of the utterance. A state machine is employed accomplishing this end point detection, based upon energy and spectral levels of the incoming signals. The extracted features and end point information are stored in a temporary store 38 and the system then passes from block 36 to decision block 40.
If a specified number of tokens of each of the selected words have not yet been obtained then decision block 40 returns to block 28 and the user is prompted to provide another utterance, and processing continues as described above. If it is determined in block 40 that the necessary tokens have been obtained then the system passes to block 42 and awaits in a loop there until such time as instructions are received to form a data base. Upon being instructed to form a data base, the system moves from block 42 to block 44 wherein global distortion values are formed be accessing the extracted features from block 38, and conducting comparisions between the individual templates of a given reference word against themselves, and also comparisons of these features against corresponding tokens from a generic group of speakers. As a result of these comparisons, a set of global distortions is produced, each distortion being representative of a distance measure between two corresponding templates.
From block 44 the system passes to block 46 where the global distortions for both the intra-speaker comparisions of corresponding templates from the same speaker, and inter-speaker comparisons of the user's templates with the generic templates are processed and ordered into arrays which indicate the probability of a false acceptance of an impostor depending upon a selected threshold level. The templates and corresponding thresholds are stored in reference templates block 47, and the identity of the user relating to the templates and thresholds stored in block 47 is stored in block 48. After all templates have been formed and thresholds have been computed, the system moves from block 46 to block 49 and terminates further operation of the enroll mode.
If the system in decision block 22 is not instructed to enter the enroll mode, the system moves to decision block 50 where it is determined whether the system is to enter the verify mode, generally indicated at 53. The verification procedure is initiated when a user presents himself at any access point. If the system is instructed to enter the verify mode, the system moves to block 52 and awaits activation by a user. The user may activate the system in several ways including, but not limited to, entering a personal identification number on a key pad; inserting a plastic card in a reader; or saying his name into a microphone. Upuu receiving this user data in block 52, the system moves to block 54 where the system sends a message to h user requesting that he input an utterance which is one of _he reference words previously stored in the enrollment mode.
From block 54 the system moves to block 56 where it detects incoming electronic signals representative of speech, When speech is detected, the system moves to block 58 where the signals are digitally encoded and formed into 8-bit words representative of the incoming speech detected during the preselected period of time referred to hereinafter as a "frame". - 14 -
The system next moves to block 60 where each frame is processed to obtain its linear prediction coefficient parameters (referred hereafter as LPC parameters) . The development of the LPC parameters 5first comprises computation of autocorrelation coefficients which measure the degree of statistical dependency between neighboring speech samples within the frame. Typically, each frame contains from 100 to 300 speech samples, and the autocorrelation coefficients lOrepresent the relationship between these various samples. The autocorrelation coefficients are processed by use of a well-known algorithm referred to as "Levinson's Algorithm" to produce the group of coefficients representative of the frame. These 5 coefficients comprise the linear prediction coefficients, and are representative of the geometry of the vocal tract of the speaker. These coefficients are then processed to transform them into a _.χngle, transformed coefficient which is representative of the 0 group of linear prediction coefficients .presenting the frame of speech. This information s stored for future use.
The production of transformed coefficients as described above is continued for s' gments of the 5 utterance until the utterance is complete. Thus, in block 62, the system detects the end point of an utterance, and forms a template comprising the group of transformed coefficients corresponding tc that utterance, with the template thus being representative 0 of the utterance.
From block 62 the system passes to block 64 where the template for the utterance is compared by - 15 -
means of a dynamic programming analysis to reference templates corresponding to the particular utterance. The reference templates for comparison are retrievced from a reference templates store 66 which contains those templates which were produced by the individual user during the enrollment process, as well as templates of the corresponding word which are representative of a generic group of speakers. The particular templates to be referenced in block 66 are indentified by a signal from an identity store 68 which selects the particular reference templates in block 66 based upon the identity of the user which was entered through block 52, and by use of the particular word being prompted, which was identified in block 54. The distance comparisons or "distortions" - generated in block 64 comprise a set of values, with each value being a difference of the present utterance being tested as compared to one of the reference templates of this particular user. These difference values are then processed to provide a "score" which indicates the relative correspondence between the current utterance and the reference utterances.
The system next passes to decision block 7C'and references the appropriate threshold value from block '/2 in determing whether to accept, reject or make no decision yet. The threshold values in block 72 may be set by the operator, but if not set by the operator they will be set to default values. These threshold values comprise a set of "scores" which are based upon the results of comparisons in the enablement mode, and which indicate the probability of falsely accepting an impostor given a selected test utterance. In block' 70, the score from block 64 is compared with the threshold value from block 72. If the score is below the threshold value, and if the user has provided utterances for at least two test words then he 5 is accepted and the system passes to block 74 from whence a signal is produced terminating operation of the verification mode. If the user has not provided at least two samples, even though the score is below the threshold, the system returns to block 54 where the user 0 is prompted to provide an utterance corresponding to another of the reference words provided in the enrollment process.
In block 70, if the user has a score which is above any threshold value in block 72, this is counted 5 as a failure. If there are less than two failures, the ../stem proceeds to the block 54 and prompts the user for another of the user's reference words. If the user has failed twice, the system passes from block 70 to block 74 an rejects the user as an imposter. From block 74,
20 the system produces a signal preventing further op_r=*tion based upon the nonverification of the individual.
The above reference to the number of trials or -.o a failures for achieving verification or rejection
".5 are for example purposes only, and these parameters may be varied by the operator as desired. ith each new test θi.d, the threshold for acceptance or rejection remains fixed in the illustrated embodiment. However, this optionally could be made to change in proportioned
30 relationship to the change in the composite score from block 64.- which changes over time as more speech trials are conducted. t
Referring to Fig. 17, in system 110, an IBM Personal Computer 112 (XT or equivalent) includes a real-time operating system 114 that provides real-time servicing via a bus 116 connected to four speaker verification boards (SVBs) 118. Each SVB in turn can serve two stations 102 on a multiprocessing basis. One of the stations, station 103, is dedicated to enrollment and is denoted an enrollment station.
The system of the present invention has been designed to work on a host computer's bus in either a synchronous or direct memory access (DMA) manner after an interrupt protocol. Although the host interface can be programmed for virtually any standard, the embodiment disclosed herein assumes use of an IBM PC/XT as the host. Thus, the host may be selected from among many of the well-known IBM compatible personal computers which are presently on the commercial market, with only minor modifications necessary to interface to virtually any computer bus. The host provides for user interaction to collect and store reference templates as described above, to calculate, update and store statistically information about the reference templates, and to perform certain verification tasks. The physical configuration of the system illustrated in Figure 17 is such that the bus 116 and the SVBs 118 are housed internally in the host computer with the stations 102 being external to the host.
Referring to Fig. 18, each station 590 includes a 16-key touch-tone pad 592, a microphone 594 and indicator lights 596. Station 590 is connected to an SVB 600 via a multiplexor 602 and two-way three-wire signal lines 598. O 87/00332
- 18 -
Referririg to Fig. 19, at various times during enrollment and verification the user depresses the touch-tone keys (e.g. to identify himself during verification) and speaks into the microphone (e.g. to speak the reference words) , although never simultaneously. Depressing the touch-tone keys causes DTFM tone generator 604 to produce one of the standard DTFM tone signal pairs which is then passed through signal combiner 606 and out onto lines 598 for delivery to the SVB 118. When spoken into, the microphone produces voice signals which also pass through the signal combiner 606 and out onto lines 598.
At the SVB, the SD 106 (described below) receives the signals and passes them through analog-to-digital (A/D) converter 608. The digitized samples are passed from SD 106 to SP 122 (described below) which, as described below, assembles them into frames and, for each frame, derives linear prediction coefficients. Once a signal indicative of activity has been detected (by the state machine of Fig. 5) CP 110 (described below) analyzes the subsequent incoming frames to determine whether speech or tone signals are being received. If tone signals are detected a conventional fourier analysis is performed to determine which key has been depressed. If, however, speech signals are being received they are processed as described below. The decoding of DTFM signals thus uses the same hardware and software as is required for the processing of speech signals. Referring to Fig. 20, within station 590, a solid state relay 610 (for operating, e.g., a door lock solenoid) is connected to a power source 612 (power supply) which is capable of supplying 12 volts or 24 volts, AC or DC (pulsating DC) , and is switch selectable. Relay 610 includes (within the dashed lines) an "open" collector non-inverting buffer 614 for providing power to a light source for a Dionics DIG-12-08-45 opto-coupler 616. (The light source contained within the opto coupler is a light emitting diode (LED) of the Gallium Aluminum Arsenide type, emitting light in the near infrared spectrum.)
A door open signal 618 of a specific wave shape delivered over line 598 resets a counter 620 which is driven by a clock signal from an on-board door strike oscillator 622. A second counter 624 is switch selectable (switches 626) for various door strike "on" times (2 sec to 50 sec) . The output of counter 624 is connected to a gate 628 which implements either "norm off" or "norm on" door strike power as selected by jumper 630. The signal output of gate 628 directly controls the state of the solid state relay 610. The reason for an optically coupled (electrically isolating) relay is twofold: firstly, it prevents referencing the power MOSFETS of the relay 610 or their input drive to system ground and, secondly, it isolates the system logic from noise and any high voltage (11500 volt) transients. The output of opto-coupler 616 drives two power
MOSFETS 632, 634 (Motorola MTP25N06 or International Rectifier Corp. 1RF-541) which are characterized by a very low resistance drain to source (0.085 ohms max., known as R "on") . All power MOSFETS have an integral diode, drain to source, connected in a reverse direction from the normal polarity of the operating voltage. The diode plays an important role in the operation of relay 610, which allows AC or DC operation. With one input polarity, the current flows through the bulk of the left hand power MOSFET and through the diode of the right power MOSFET. When the polarity is reversed the path of the current flow also reverses. Unlike mechanical contacts power MOSFETS are not subject to changes over time in their contact resistance and are therefore considerably more reliable. They are also unaffected by humidity and require extremely low input currents (< 100 x 10 amps) to turn the device "on" (thus a very low power voltage source can be employed for turn-on) . Element 634 is a V47ZA7 available from General Electric and is employed for transient limiting and protection. The system of Figure 2 is capable of multi-tasking two verification channels for handling two users at different stations simultaneously in real-time. In addition, the system is designed to support as many as 16 separate user stations 102. The SD 106 is comprised ^f elements including PAL20RA10CNS programmable logic array from MMI for control purposes, in addition to other, wr l-known conventional logic elements for constructi*.- --.:ch a data system.
The SD 106 receives analog signals from the stations 102, and iLαltiplexes the signals to select two channels from among the set of 16. The two channels are then PCM filtered, both on their inputs and outputs, and are sampled every 125 to 150 microseconds. SD 106 is driven by its own clock 108 wh_.ch is interconnected thereto to obtain the samples. The SD 106 is electrically connected to a control processor subsystem (CP) 110, and interrupts the CP 110 at every sample conversion time set by the clock 108. During this time, the CP 110 can perform successive read and write operations to both channels from the SD 106. At no other time can the CP 110 gain valid access to the SD 106 for reading or writing.
The CP 110 comprises a 10 MHz MC68000 microprocessor 112, a memory 114 and logic 116 to generate various control signals for input/output and memory accesses. The CP 110 communicates with and coordinates all other functions in the system, including communicating and coordinating with the host computer. Although the CP 110 can perform signal processing tasks, it mainly functions as a data administrator. Thus, it can directly control and gain access to each part of the system and it also can communeiate with the host computer in the form of coded data via a host interface unit (HIU) 118.
The CP 110 is also electrically connected to a central data buffer (CDB) 120 which is a fast 4K X 16 random access memory data buffer, through which a majority of the inter-processor data and messages can be quickly and efficiently passed. The buffer could comprise, for example, four IMS1421S-50 RAM devices (4 X 4096) produced by INMOS. In the case of the verification mode of operation, CDB 120 is the sole source of data transfer between processors. The only other data path of this kind is the port between the CP 110 and the HIU 118. CDB 120 includes data ports for the CP 110, for the HIU 118 and for a signal processor subsystem (SP) 122. CDB 120 also includes an address port for memory-mapped input/output communication with - 22 -
the CP 110. For direct memory access purposes, there is a 12-bit counter 124 which allows fast sequential auto-incrementing access to the CDB. The 4K X 16 memory is accessed by the SP 122, and the HIU 118, but only byte-wide by the latter. The HIU 118 provides a signal on the line illustrated at 126 which, when the CDB 120 is placed in the proper mode, only allows auto-incrementing every two bytes. Either word or byte-wide accessing is available to the CP 110, however, for maximum efficiency.
One of the most important functions of the CP 110 is to manage allocation of the CDB 120. In light of this, and to maximize efficiency, the CDB 120 was designed to interface with the CP 110 in memory-mapped fashion. The SP 122 and HIU 118, however, view the CDB' as an input/output device, but attainable via direct memory access. Most inter-processor communications must utilize the CDB 120, therefore much of the system time in the verification mode is spent in this element of the system. Nevertheless, any processor can gain access to the CDB 120 at any time since there is no hard-wired priority system.
Accesses to the CDB 120 are strictly controlled by the CP 110, to the extent that CDB 120 requests are first granted by the CP 110. The protocol starts with an interrupt to the CP 110 by the requesting processor. The CP then checks the queue to see if the CDB 120 is being used. If not, an interrupt is sent back to the requester, who then can assume that it has designated possession of the CDB buss.
No specific ports are enabled by the CP 110, instead the processor of the CDB 120 takes care of this through the use of its own control signals. Although this scheme conserves hardware, it means that any processor can gain access to the CDB at any time, as indicated above. For this reason, the CP 100 must control the queueing of requests for the CDB 120.
Since the SP 122 has no direct communications port to either of the other processors, the CDB 120 must be employed -for this purpose. The HIU 118 can also deposit and retrieve messages and/or data via the CDB 120, even though there is a port between it and the CP 110. To provide this system of communication, each processor must have a pre-assigned window of memory, one of which is known to the other processors. Therefore, data can be passed to any processor from any processor by simply accessing the appropriate window in memory with any possession of the' CDB 120.
The CP 110 is also connected to the SP 122 which is a 16/32-bit TMS32010 signal processing microprocessor. The SP 122 also includes memory storage for the SP microcode provided by elements such as two 8 X 1024 bit PROMS of 754 ns speed, such as MMI6381. This s-.osystem includes random logic for input/output decoding., communications with the CP 110, and a data path to the CDB 120. SP 122 performs arithmetic calculations at a rate of 5 million operations per second, and therefore is used for all of the heavy number handling tasks. The CP 110 initiates a task in the SP 122 by sending a special signal over line 128 which causes the SP 122 to branch to the location specified by a command written by the CP into the CDB. In the verification mode, SP 122 has a pre-stored library of commands comprising uncoded vector locations, which the CP 110 calls to perform various tasks. Therefore, the SP 112 is a slave system whose tasks are initiated by the CP 110. It is also possible for the CP 110 to assign the CDB 120 to the SP 122 for the entire duration of a selected task, in which case the CDB 120 becomes a dedicated storage medium for the SP 122.
The HIU 118 functions as the interface between the host processor (not shown) and the CP 110, as well as the CDB 120. All signal interfacing is buffered by the HIU 118, which conditions the timing and control signals from the host to meet compatibility requirements. For example, and incompatibilities between the timing system of Figure 2 and that of the host processor are taken care of by a programmable array logic device (PAL) (not shown) which is internal to the HIU 118.
Any data transfers from the host processor must first be requested in the form of an interrupt to the CP 110. The interrupt vector can either be -.ead _rom the CP host processor data port or from the mf*-c",age window in the CDB as described above. However, r ior to system accesses, the host processor must know the s-s__r. address. The base address is installed in the HIU 118 input selector, device, which is simply an address comparator. Each system's base address is user programmable via switches, giving the host processor multiple systems to use for greater processing power. Thus up to 64 stations 102 can be provided for use in the verificaton mode by merely installing more circuit boards onto the host processor's bus. SP 122 and the HIU 118 are connected to a ubiquitous reset and multi-phase clock generator (RCG) 130 which provides all the reset and clock signals for the entire system. It will be appreciated that the various elements illustrated in block diagram form in Figure 2 can each be constructed in any of a number of ways which are well-known to those skilled in the technology.
3. THE METHOD The method by which the system accomplishes its purpose can be described in more detail by reference to Figures 3 - 16. Specifically, Figure 3 illustrates the enroll mode which is generally indicated at 24 of Figure 1. Referring to Figure 3, it is seen that once the system moves to the enroll mode in block 150, the system begins operation so that in block 152 it obtains the user's name, the access points through which this user may pass, and the maximum permissible levels for false acceptance of a user and false rejection of a user. Default values are provided where necessary, if no values are specified. In step 152 the system also retrieves a list of reference words which are to be recorded by the user and stores them for reference. For purposes of discussion, the use of 10 reference words is described herein. However, it will be appreciated that substantially any desired number of reference words may be selected without changing the function or structure of the system. By using a number of different reference words, it is possible to identify those words of the particular user which have the lowest probability of resulting in an erroneous acceptance of an impostor. As the number of reference words used is increased, the ability to obtain words having this low probability of error is enhanced. Thus, the enroll mode is functioning to generate a set of data from this particular user which may bew utilized in the later verification protion of the system operation.
From block 152 the system passes to block 154 where the information is assigned a temporary identification number and then the system passes to step 156.
Block 156 is a decision block relating to the number of tokens of the 10 reference words which have been collected. If a selected number of tokens (or utterances) of each of the 10 words have not been made, the host processor passes to step 158 and requests a new word fro.- the user by generating a prompt command. For purposes of discussion, the use of four tokens for each reference word is described herein. Of course, it will be appreciated that substantially any number of tokens could be utilized for proper operation of the method and apparatus -~.. ∑ the present invention.
In response to the prompt command, the CP 110 of Fiα re 2 produces a signal through SD 106 which is co munic'.t'.'1 via input/output line 104 to a station 102 whereby, through means of a speaker or other audio or visual communicatxon means, the user is prompted to utter one of tha ten reference words for which tokens are to be stored. The microphone in station 102 receives an utterance of the reference word form the user, and communicates it from station 102 via line 104 and SD 106 to the CP 110, where it is further processed as explained hereafter. From block 158 of Figure 3, the host processor moves to block 160 where it instructs the system to produce autocorrelation coefficients (r coefficients) for the incoming speech by calling the "enroll" routine. The enroll routine will be discussed more completely hereafter in reference to Figure 4. As it is producing the r coefficients, the system examines those coefficients to determine whether incoming speech is present in the system. If no speech is present within a preset time period following the prompt signal, the host porcessor determines that there is a failure and passes to block 162. If the number of times the system has detected no speech following a prompt signal exceeds a predetermined threshold number, the processor moves to block 164 and aborts further attempts to enroll the user. On the other hand, if the number of failures does not exceed the threshold figure, the processor returns to block 158 and again causes the user to be prompted to input an utterance. if speech is detected in block 160, the host processor moves to block 166 and initiates conversion of the r coefficients to transformed linear prediction coefficients (aa coefficients) which provide a correlation between the detected utterance and the geometry of the user's vocal tract. This conversion is performed as part of the enroll routine to be described with respect to Figure 4.
After the aa coefficients have been generated, the host processor moves to block -168 and causes the autocorrelation coefficients and transformed linear prediction coefficients to be stored in an array for future use. The processor then moves to block 170 and increments the wo'rd and token counter to identify the next token of the current word if four tokens have not yet been received from the user, or to the next word if four tokens have been reveived. As was indicated above, the number of tokens is programmable, and is selected by an operator prior to use. From block 170, the processor returns to block 156.
If the system has collected four tokens for each of the words in its list, then the system moves from block 156 to block 169 and sets a flag indicating that the stored information from the enable mode must be processed to produce the data base necessary for later system operation in the verify mode. Having set this flag, the system passes to block 171 where it exits the enroll mode of operation and returns to the initiate block 20 of Figure 1 to await further instructions. If four tokens of each of the ten words have not been received, the system passes from block 156 to to block 158 and operates as described above. By reference to Figure 4, it is possible to describe the enroll routine which is activated by the host processor in block 160 of Figure 3. Upon entering the enroll mode 160 of Figure 4, the system moves to block 172 wherein the system allows the MC 68000 microprocessor of the CP 110 of Figure 2 to digitize the analog speech signal received from SD 106 at a sampling rate of approximately 8,000 samples per second. Sample frames are formed and stored in buffers allocated in memory 114 of the CP 110. Each one of the samples is read from a coder/decoder (CODEC) chip in the SD 106 as an 8-bit wide m - law encoded word whenever control is passed to the CODEC interrupt handler. Once a buffer in the CP memory 114 is filled with a frame of samples, the system moves to block 174 and copies the samples from the buffer into the CDB 120. Each of the 8-bit code words making up the samples is decompanded to define a 16-bit sample. This decompansion is done on the fly utilizing a table driven procedure for the sake of speed. The use of m-law compansion is well-known in the technology. A comprehensive discussion of this subject matter is presented in "Digital Signal Processing Application
Report: Companding Routines for the TMS32010" by Texas Instruments (1984) , No. SPRA001. This document is hereby incorporated herein by reference.
From block 174 the system moves to block 176 and accomplishes preemphasis by forming a frame which includes the present frame plus the last half of the previous frame, thereby permitting an overlapping analysis. The samples are preemphasized by use of a first order finite impulse response filter which is applied to the input samples in a format as follows:
y(n) - x(n) - .94 x(n-l) (1)
where n equals the index number of the sample; x equals the samples prior to preemphasis; and y equals samples after preemphasis. Preemphasis is performed to emphasize the high frequencies and cancel the low frequency emphasis caused by the sound transition from the speaker's lips to.the open space between the speaker and the microphone.
Once preemphasis is accomplished, unnormalized autocorrelations are calculated as follows: L- i-1 R ( i) = 2 [ s ( j ) x s ( j+l) ] (2 ) j=0 for i = 0, ... P; where s is the preemphasis value of the sample; L is the analysis frame size; and P is the autocorrelation order.
The unnormalized autocorrelations from Equation (2) are converted into normalized correlations by the relation:
r(i) = R(i)/R(0) (3)
for i = 1, ... P.
It is noted that the R(0) correlation is referred to as the "energy" and is a quantity which is utilized by the system in the detection of the beginning and end of an utterance as is explained hereafter. The normalized autocorrelations (r coefficients) from Equation (3) together with the energy term R(0) are returned to the CP 110 as soon as the SP 122 has finished the calcu] __ion<*- . From block 176 the system passes to block 178 where it develops the linear prediction (LP) coefficients. Specifically, t'.e "-r 110 retrieves a copy of the r coefficients from t'.te CDB 120 and then starts the SP 122 to develop the LP coefficients. The recursive procedure employed for acco piisning this is well-known in the technology as Levinson's Algorithm. This algorithm is discussed in detail in J. D. Markel and A. H. Gray, "Linear Predicition of Speech, Springer-Verlag, New York, (1976). This text is incorporated herein by reference.
Levinson's Algorithm delivers a set of filter coefficients by the recursive procedure which relies on the relationships indicated below:
Σ a [n-1]^ κn = " i=0 i n-1 en-1
en= ®n-l(l " Kf.)
a£n] = a^n-l] + KnanDχl] for i = 0, ... n. Where ai = 1 for all i not r01 equal to 0; a. J = 0 for all i not equal to 0; and eQ = 1.
In a Pth order LPC analysis, the Levinscn Algorithm set forth in Equations (4) - (6) delivers a set of filter coefficients a(i), i = 0, ... P comprising the LP coefficients. In this Pth order analysis, the value of n ranges from 1 to P, and a(i) is calculated for each value of i = 0, ... P for each value of n.- From block 178 the system moves to b'Ock 180 and transforms the LP coefficients into aa coefficients. The procedure for accomplishing tnia - transformation is as follows:
P aa(0) = ∑ a(i)2 i=0
P aa(j) = 2 x ∑ a (i) x a (i+j) i=0 for j = 1, ... P. This transformation is done to increase the speed of utterance comparison operations in the verification mode.
The aa coefficients together with the binary logorithm of the so-called residual energy (e ) are left in the CDB 120 when the SP 122 completes the transformation.
Once a frame is analyzed and its normalized autocorrelations, its energy R(0) , the aa coefficients, and the logorithm of the residual energy e have been extracted, the system moves to decision block 182. In block 182 the CP 110 feeds the extracted data into a state machine to determine whether a user was speaking into the microphone of station 102 during the time the frame was recorded. This state machine is described in more detail hereafter with reference to Figure 5.
If the speech detector finds that no speech is active, the system moves to block 184 and determines whether the speech detector state machine is in an exit state. If it is in an exit state this indicates that speech is completed and the system moves to block 186, terminates operation of the algorithm of Figure 4, and then returns to block 160 of Figure 3. If the state rachine is not in an exit state while the system is in block 184, the system returns to block 172 to obtain the nex speech samples for processing as described above.
If the speech detector indicates that speech is. currently active while the system is in decision block 182, the system moves to block 188 and determines whether the system is in the enroll mode. If it is in the enroll mode, the system moves to block 190 and stores the aa coefficients in the memory 114 of Figure 2. If the system is not in _ne enroll mode, then the system moves from block 188 to block 192 and stores the normalized autocorrelation coefficients and the residual energy from the Levinson's Algorithm in memory 114. From either of blocks 190 or 192, the system returns to block 172 and continues functioning as described above. As soon as the last frame of active speech from the user has been found, all of the parameter sets extracted so far are copied from the memory 114 into the CDB 120. The host is then interrupted by the CP 110 indicating that the parameters for a full utterance are being delivered to the CDB 120. The host then reads the parameters from CDB 120 and eventually stores the results on its mass storage devices.
By reference to Figure 5, the state diagram which is utilized for detection of the start and end point of speech may be discussed. Whenever autocorrelations and the energy for a new frame have been derived from an input signal, a finite state machine having the states indicated in Figure 5 is employed to determine from these parameters whether speech is active, or which of the transition states between speech and no speech the system is in.
1'he state diagram includes a silence state 200 which is botn the initial and final state. In this state, the machine waits to detect the beginning of an utterance and returns after the end of the utterance is detected.
When a possible beginning of an utterance is detected, the state machine goes from the silence state 200 to the attention state 202. Specifically, the machine goes to the attention state if the energy of the detected signal is either above a certain upper threshold level αr, if the energy is above a certain lower threshold level and the normalized autocorrelation function (r) has an euclidean distance which is more than a preselected threshold distance "a" from that value of the autocorrelation function (r) which the machine has measured for noise. This noise autocorrelation function is recursively updated by the machine in the silence state 200.
From the attention state 202 the machine will go to a speech state 204 when the detected energy is high enough to prevent return of the system to the silence state, and when the machine has spent three cycles in the attention state 202.
Once in the speech state 204 the system will remain there until the detected energy drops below an "end of speech" threshold, indicating that a possible end of the utterance has been detected. At that time, the machine will go to an exit state 206. From exit state 206 the machine will go to silence if the detected energy is not high enough to exceed a lower threshold after five cycles. If energy of a sufficiently high value is detected, the machine will move from exit state 206 to resumption state 208 which functions, similar to the attention state, to move the machine to the spee 1- state 204 when the energy is high enough not to rf.curn to the exit state 206 and when the machine has spent three cycles in the resumption state 208.
It is noted that the speech detection automaton also controls the attenuator system of the present invention. Thus, whenever the state is silence, the attenuator register is altered to keep the noise on some constant level. Once the state machine has left the silence state, the attenuation is kept constant. In one presently preferred embodiment, the speech detection machine comprises software, although a hardware embodiment could be readily provided by one skilled in the technology and based on the above description of the machine.
Having completed the enrollment operation, the system can be activated either by an operator, or by other means such as a timing device, to utilize the information generated in the enrollment mode for forming the data base which will be used to accomplish comparisons of incoming speech signals in the verification mode. Specifically, Figure 6 illuratrates a method of system performance by which the appropriate data base may be constructed.
Upon receiving a signal requesting formation of a data base, the system moves from block 220 of Figure 6 to block 222, and determines whether the flag set in 169 of Figure 3 is still set. If the flag is not set, then there is no data to be processed and the system moves to block 224, terminates operation of the procedure for making the data base, and returns to initialize state 20 in Figure 1.
If data is present to be processed, the system moves from block 222 to block 226 where it obtains the next word stored in CDB 120. In block 226 the system also determines whether it is finished with the preparation of the data base. If it is not, the system passes from block 226 to block 228. in block 228 the system makes "intra-speaker" comparisons wherein the statistics developed in the enrollment mode are compared with each other to develop "global distortions" indicating the extent to which the words differ amongst themselves. These intra-speaker comparisons are computed by obtaining, for each of the speakers, N tokens of that word. The global distortion for this intra-speaker comparison is produced by comparing the r(i) of a given token with aa(i) versions of all other tokens of the same word, produced by the same speaker, thus, if there are t different tokens, there are T-. = (t • t) - t intra-speaker comparisons. The host processor obtains the global distortion G from this intra-speaker comparison by use of an intra-speaker routing which implements the above-described procedure and which will be described hereafter with respect to Figures 7 - 9. _ Once the intra-speaker global distortions have been developed in* block 228, they are stored for future use and the system moves to block 230.
In block 230 the inter-speaker comparisons for a speaker's word are computed by obtaining, for each of the speaker's N tokens of that word, the global distortion (G) that ι._όαltΞ from comparing r(i) versions of that token with the aa(i) versions of all other tokens of the sam-j word, produced by all speakers in a generic data ba'_. . This generic data base encompasses a wide range of variation in the rendition of the ten reference words that are used by the verification system, since it represents the utterance of these tokens by a group of individuals who each have distinctly different vocal tract characteristics. If there are n speakers, there are T,, = n t inter-speaker comparisons. A more detailed discussion of the method of accomplishing the inter-speaker comparisons in block 230 will be discussed hereafter with respect to Figure 10.
After obtaining the inter-speaker distortions in block 230, the system moves to block 232 where the distortions from the intra-speaker and the inter-speaker comparisons are merged, sorted in numerically descending order, and stored in an array D. The total number of distortions in this array J is d = T_W_+τ_B-. The distortions in this array continue to be labeled as inter-speaker or intra-speaker distortions. An example of the array D created in block 232 is illustrated in Figure 12.
From block 232 the system passes to block 234 where a statistics file STAT is created. By use of the ordered distortions of Figure 12, the system creates a file which, for each intra-speaker distortion, provides an indication of the likelihood that the system will erroneously reject the actual speaker, or erroneously accept an imposter, if the threshold value for making the accept/reject decision is based on a distortion value corresponding to the distortion of that particular intra-speaker distortion. One example of a STAT file based upon the information in array D of Figure 12 is illustrated in Figure 13. The pro*β-ϊure utilized in block 234 for developing the STAT file of Fiture 13 will be described hereafter with respect to Figure 11.
From block 234 the system of Figure 6 returns to block 226 to obtain the enrollee's next word. If all of the words have been processed in the manner described above, the system moves to block 236 and constructs an ORDER file which indicates the relative ability of each of the reference words to distinguish the present speaker from other speakers whose data is stored in the generic data base. Words with high discriminability will be those that have, at any given level of false reject error rate, low false accept error rates. The system determines the relative discriminative power of the words for the current user by sorting those words based on the highest value of the false accept designation in that word's per-word STAT file. Thus, for example, the value used for sorting the word represented by the STAT file in Figure 13 would be the first entry under the ERRORFAL-_ ACCEPT nea3ing, which is the value of 2/6. These representative values from each of the per-word STAT files for the reference words are ordered in numerically descending order, and are saved in a file "ORDER" which is identified on a per-user basis.
While in block 236 the system also enters the information obtained at the time of enrollment (name, access points, error levels) for the particular user in an identity-store file in the system, which is linked to the data base information corresponding to that user's identity.
By reference to Figures 7-9 it is possible to describe the procedure utilized in block 228 of Figure 6 for doing the intra-speaker comparisons. With reference to Figure 7, .it is seen that upon entering block 228 of Figure 6, the system moves to block 240 where it loads the next token of the word which is presently under consideration and which was uttered by the enrollee. Once the next token is loaded, the system passes to block 242 where the procedure "COMPARE" is called. The COMPARE procedure provides the global distortion for the particular token under consideration, based on a comparison of that token with the other tokens of that word as spoken by the enrollee.
Upon completing the comparison and otaining the global distortion, the system moves to block 244 and stores the distortion for later use. From block 244 the system returns to block 240 and obtains the next of the N tokens, and then processes this as described above. If no more tokens are available for the given word, the system moves to block 230 which corresponds to block 230 of Figure 6 and initiates an inter-speaker comparison. Referring now to Figures 8 and 9, the COMPARE procedure utilized in block 242 of Figure 7 is described. Upon entering the COMPARE block 242 of Figure 7, the system moves to block 242 of Figure 8 and initiates a comparison between a reference token and a test token. In the case of an intra-speaker comparisons, the reference token is one of the previous utterances of the enrollee. The test token is the token which is currently being processed in the making of the riata base, as illustrated in block 228 of Figure 6.
The comparison is accomplished by comparing a reference pattern of aa coeffeicient sets against the * (i) and e values of the test template. The circumstances of the test are represented graphically in Figure 10 where it is seen that the reference template is defined by designating the aa values for each frame from zero to j on the reference axis 280. Likewise, the r and e values for each frame from zero to i of the test template are located along the i axis 282. The length of the reference template is designated as extending from the origin 284 to M on the j axis. Likewise, the length of the test template is indicated as extending from the origin 284 to the location indicated as N on the i axis.
The results of the test comprise a global distortion value G which is representative of the minimum amount of distortion experienced in traversing all possible paths between the origin 284 and the intersection of lines perpendicular to the axes of M and N locations, designated at 288. This global distortion G is derived as the result of a two-stage process which is embodied in the flow diagram of Figure 8 and which will now be explained. From block 242 of Figure 8 the system passes to a decision block 246 where the number of frames M of the reference template is compared against the number of frames N of the test template. If the difference between these numbers of frames is outside preselected threshold values, the system moves to block 248 and provides an output signal "999" indicating that the lengths of the utterances are uncompatible for comparison purposes. For the illustrated embodiment, the threshold values are:
M < 2N and
N < 2M
If the lengths of the utterances fall within the threshold requirements, the system passes to block 250 and sets the initial value of i = 0. From block 250 the system passes to block 252 and sets the initial value of j = 0. From block 252 the system next passes to block 254 and accomplishes the first stage of the comparison procedure by computing the local distortion at a given point' i,j in the system. Since the initial values of i and j are 0, the procedure begins at the origin 284 from whence the minimum path to point 288 must be determined. In block 254, the SP 122 of Figure 2 compares the "N" vectors of aa coefficients with the normalized autocorrelations "r". The "N" resulting local distortions are stored as 16-bit values in the CDB 120. The relationship for obtaining these local distortions is as follows:
P d(i,j) = LOG [aa(n) x r(n)J - LOG(e) (11) n=0
Thus, for each point i,j in the matrix a distortion value may be derived representative of the distance measures between the reference and test templates at that location.
Having computed the local distortion d(i,j) in block 254, the system next moves to block 256 and develops the global distance value g(i,j) for the particular location indicated by the coordinates i a"-<_ j on the graph of Figure 9. To determine this minimum distance path, the process evaluates different paths to the given location and selects the path having the minimum distortion value. Thus, to determine the minimum global distortion for point 290 on the graph of Figure 9, the minimum distortion from among the three paths indicated at 292, 294 and 296 is accepted. This minimum path can be determined mathematically by the following relation: 9i-2 , j-l + 2di-ι ,j + di . j
9i ,j = min ) 9i-l , j-l + <- , j (12 )
9i-l ,j-2 + 2di _ j-ι + d . j
Where d is the local distortion developed previously; and where any indexes which have not previously been determined, or which result in a negative value, are assigned a value of zero. Thus, for example, the value of g. . at the point i=l, j=l would require values for
1 rJ g at points i-2, j-1 and i-1, j-2. Since these g values are on the negative side of the axes, they do not exist in the system and, therefore, they are assigned the value of zero.
The final global distortion, which defines the value of the minimum path, is found at point 288 in Figure 9. For this point, the system in block 256 returns a final G value which is normalized based on the length of the tokens being compared. This final G value is defined as:
(normalized) g(M,N) g(nr-l) ,naa-l) GF Nj^-_ ~ = (naa + nr) = (naa + nr) (13)
In actual operation, the CP 110 maintains only tlr.ee -ows of global distortions and two rows of local 'istortions- since this is all the information necessary to continue the above-described computation. This saves a significant amount of memory space. Further, due to the parallelogram boundaries in the dynamic programming algorithm described above, approximately 30% of the points can be excluded from an explicit search. These points would be outside of a parallelogram with its corners connecting the origin 284 and the end point 286 of the line in Figure 9. These points would be particularly concentrated near those boundaries of the graph which are in the vicinity of the M value on the j axis, and the N value on the i axis.
After deriving the global distortion for the current point i,j on the graph of Figure 9, the system moves to block 260 and increments the j index by 1. The system then moves to decision block 262 and determines whether the new j index is equal to the number of aa coefficients in the reference token. If the j index does not equal the number of aa coefficients, then the system returns to block 254 and computes the distortions as described above for the new location in the graph of Figure 9.
If the j index equals the number of aa coefficients in the reference token, the system moves from block 262 to fock 264 and increments the i index by 1. From block ^264 the system moves to decision block 266 and determines whether the new i index is equal to the number of r coefficients of the text token. If it is not, the system returns to block 252, sets the j index equal to zero and continues developing the distortions at the new location in the graph of Figure 9,
If the new index i equals the number of r coefficient sets for. t e test token, then the system is at the final point 288 in the graph of Figure 9. The system then moves to block 268 where the final, normalized global distortion value at point 288 is defined as indicated in Equation (13) above. This normalized value then becomes the global distortion correponding to the comparison of the selected reference token and the test token. After completing the intra-speaker comparisons in block 228 of Figure 6, and using the procedures described above, the system passes to block 230 of Figure 6 where the inter-speaker comparisons are performed. By reference to Figure 10, the procedure for accomplishing the inter-speaker comparisons is more clearly described. In particular, upon entering block 230, the system moves to block 300 wherein it advances to the user's next token for the particular word being dealt with. Once this token is identified, the system moves to block 302 where it advances to the next generic speaker, and then moves to block 304 where it loads that specific generic speaker's next token for the current word being tried. The system next passes to block 306 where it conducts the comparison of the current token of the word from the user with the token from the particular generic speaker as identified in block 304. The procedure for accomplishing this comparison is the same as was described above with respect to Figures 8 and 9, with each comparison producing a global distortion representative of a score value for the compared tokens. After completing the comparison, the global distortion is stored for further use. From block 308, the system returns to block 30^ and obtains the next token of that particular generic speaker. If there are no more tokens for that speaker relating to the word under consideration, the system passes from block 304 to block 302 and advances to the next generic speaker, and then proceeds as described above to evaluate the tokens for that generic speaker. If there are no more tokens of the present word from generic speakers, the system moves from block 302 to block 300, and advances to the user's next token of the particular word. If the comparison for all of the user's tokens of that word have been completed then the system moves from block 300 to block 232 in the flow chart of Figure 6 wherein the distortions which have been stored in block 308 of Figure 10 are merged and sorted as described previously. Referring to Figure 11, it is possible to describe the procedure for accomplishing construction of the STAT file as it is performed pursuant to block 234 in Figure 6, the system initiated the construction of the STAT file by jumping to block 310 of Figure 11, where the next global distortion value for the intra-speaker comparison having the highest order number not previously considered is obtained from array "D" of Figure 12. In reference to Figure 12 and by way of example, the activity of block 310 would obtain the distortion value 87 corresponding to intra-speaker comparison number 6. This information is stored, and the system then passes to block 312 where the ERRORFALSE ACCEpτ(ERRORFA) value is developed.
To develop the information in the per-word STAT file of Figure 13, each of the intra-speaker distortions in the sorted array D of Figure 12 is assigned a number
1 = 0, ..., T- W_. For each of the TrWτ intra-sp ceaker distortions in array D, the number of inter-speaker distortions in the array that is numerically less than that distortion is computed and stored in array J. There are i = 0, ..., T elements in these arrays. For the intra-speaker distortion in array D assigned number Tτ,, the total number of distortions (both inter- and intra-speaker) which are numerically equal to or less than that distortion is computed and stored in array K. Based on information developed above, the ERRORFA column of the per-word STAT file of Figure 13 may be developed pursuant to the requirements of block 312 of Figure 11 by use of the following relationship:
J(i) ERRORFA K (14) for i = 0, ... , T__.
For purposes of understanding, the first element of the ERROR„ file in the per-word STAT file is arrived at by applying the above formula (14) to the first intra-speaker distortion value which is encountered when working from the top down in the D array of Figure 12. In this case, this first encountered distortion value would be identified in order as number 6, having a distortion value of 87. For this distortion value, the above described relationship would produce J(i) = 2, sinc_ there are 2 inter-speaker distortions which are. numerically less than order number 6. The K term equal? 6 s-'.nce there are 6 distortions which are equal to or I.ss than th_ first intra-speaker distortion encountere--, which is order number 6. On the other hand, for the intra-speaker distortion having a value of 40, it is noted that there are not inter-speaker distortions below that point, and, therefore, the numerator of the I-RROR F--A, term is zero, while the denominator remains as 6 since there are 6 distortions which occur from and including the first encountered intra-speaker distortion. Upon completing the development of the ERROR-, column in block 312, the system moves to block 314 where it develops the ERR0R-_AL_E REJECT (ERRORFR) column of the per-word STAT file of Figure 13. Using the relationships developed above, the ERROR_„ column can be developed by the following relation:
ERRO FR = i-1 W (15)
for i = 0, ... , Tw.
By reference to the array D of Figure 12, the above-identified Equation (15) for the ERR0RFR defines the first element for that value in Figure 13 by looking at the first intra-speaker distortion which is encountered and which would therefore be assigned the number i=l. Being aware of the fact that, in the general case, there are N intra-speaker distortions in the system, the above Equation would indicate that at a distortion of 87 the value of the ERROR^ term would be 1-1 = _0_ = 0. n n
From block 314 the system next asses to block 316 and stores the information developed in blocks 312 and 314 in the per-word STAT file as illustrated for discussion purposes in Figure 13. With the per-word STAT file completed, the system moves from block 316 to block 310 where it gets the next intra-speaker distortion from the D array, and forms a per-word STAT file for that word, in the manner as described above. If there are no more intra-speaker distortions in the array D the system moves to block 226 of Figure 6 and functions as described previously. With the' enrollment procedure complete, a user may activate the verify mode of the present invention in attempting to gain access to a secured area, or to otherwise use the present invention for purposes of access, identification, and the like for which the verification system of the present invention finds application. Thus, referring again to Figure 1, if the system is not in the enroll state it passes to block 50 and determines whether it is in the verify mode. Upon determining that it is in the verify mode, it passes to block 50 of Figure 14 and activates the procedures for accomplishing verification.
From block 50 the system in Figure 14 passes to block 400 and determines whether there is a present claim for access at one of the stations 102. In block 400 the host processor continuously scans all access points for activity. Thus, if no activity is indicated the system loops via line 402 back to block 400 and continues to function here until such time as a claim is made. The user may present an identity claim in several ways, including (but not limited to) entering a personal identification number on a key pad, inserting a plastic card in a reader, or saying his name into a microphone. Upon detecting a claim the system passes to block 404 and issues a request to the user for an identification. This request is transmitted from the CP 100 through the SD 106 to the station 102.
Having prompted the user for an identification, the system passes to block 406 where it monitors incoming signals from the station 102 via SD 106, and detects energy levels of the signals corresponding to an indication that identification information may be present. If no incoming signals of this type are detected after a preset period of time, the system returns to block 404 and again requests an identification from the user. Upon detecting signals comprising identification, the system moves to block 408. In block 408 the system compares the identification information which was previously stored for the user to determine whether the identification corresponds to an enrolled user, and whether the identification permits access from the particular access point.
If the identification does not meet the indicated criteria, the system moves to block 410 and produces a signal rejecting the user. If the identification is found to be acceptable for the particular access point, the system moves to block 412 and sets the threshold values based upon data for the particular user which was previously entered and stored. II no threshold values are specified, the system will utilize default values.
The. threshold values which are set include the maxxmum number of trials to permit before rejecting the claim (where each word constitutes a trial) ; the minimum number of trials to require before permitting a claim to be accepted; the level of false accept error rate
(ERROR^ ) required for acceptance; the level of false rejection error rate (ERROR---) required for acceptance; and the maximum number of failed trials which are permitted. Upon setting the thresholds, the system moves from block 412 to block 414 from whence it accesses the ORDER file 416 for this particular user. The system gets the next word from the ORDER file which was created in the enrollment mode.
Having the new word the system moves from block 414 to block 418 and prompts the user, requesting that the user speak the selected word into the microphone of station 102.
Upon receiving the current token uttered by the user the system moves to block 420 and obtains the normalized autocorrelation coefficients r. These r coefficients are obtained by use of the test procedure. The test procedure functions on the current speech to develop the r coefficients, as well as other parameters, in order to facilitate the verification process. This procedure was previously described in connection with the illustration of Figure 4.
Upon receiving the r coefficients tLβ system moves to block 422 where it produces ERROR_,A by comparing the present token with the corresponding four templates of the user and then utilizing the global distortion obtained from those comparisons to obtain the correct value for ERROR-,., from the
Figure imgf000052_0001
ST?', file for this word. The procedure for obtaining this ERRORFA is illustrated in more detail in F^'jure 15 and will be discussed hereafter. During the effort to retrieve the £RROR-_A from the STAT- file, if it is found tnat there is no record in the STAT file with a distortion greater tl.an the current distortion, the trial is considered failed. At that point, the system passes from block 422 to block 424, wherein the fail counter is incremented. The system then passes to block 428 where further testing is conducted to determine whether any action as to acceptance or rejection should be taken. The test procedure, which is performed in block 428, is described hereafter with reference to Figure 16.
Based upon the decision of the test procedure in block 428, the system either makes no decision and returns to block 414 wherein the next word is obtained from the ORDER file and operation continues as described above, or the system passes to either block 430 and rejects the claimant, or to block 432 and accepts the claimant. From blocks 430 and 432 signals are produced which are transmitted to station 102 of Figure 2 to advise the claimant of the decision. Of course, these signals could also be provided to other external equipment to accomplish things such as opening doors to secured areas, initiating operation of selected equipment and the like.
If the ERROR-, is recovered in block 422 then the system moves to block 426 where the cumulative error is adjusted. On the initial pass through block 426 the cumulative error corresponds to the ERROR--- value provided from block 422. On subsequent passes through block 426, the cumulative error comprises a combination of the previous cumulative error value and the current ERROR^ value received from block 422. In one preferred embodiment, this combination comprises the product of the-previous cumulative error value and the current ERROR-., value received from block 422. Thus, the cumulative error is adjusted based upon the probability of false acceptance of an impostor each time a new claim is evaluated.
From block 426 the system passes to block 428 where decisions are made with respect to the rejection or acceptance ofrthe claimant in accordance with a procedure which is described hereafter.
By reference to Figure 15, the procedure for obtaining the ERROR-,-, in block 422 may be described. Upon entering block 422 the system immediately passes to block 440 of Figure 15 and gets the next of those four aa files developed in the enrollment mode, which correspond to the word being tested. Upon obtaining the aa values the system passes to block 442 and compares the new r coefficients obtained in block 420 of Figure 14 with the aa values obtained from block 440. This comparison is accomplished by the procedure outlined previously with reference to Figures 8 and 9, producing a global distortion representative of the difference between the r values of the new token, and the aa values of the token obtained in block 440.
The system next passes to block 444 and reads the global distortion value obtained in block 442. This distortion value is saved and the system passes to block 446 and determines whether any of the N stored tokens correspr--;_dinσ. to this user have not yet been compared with the r values of the current utterance. If there are tokens that have not been compared the system returns to 'jJc-i* 440 and functions as described above. If no more tokens remain to be compared, the system moves to block 443 '-here the global distortions produced by the comparisons of the new r values with the N stored templates of the user are processed to develop a composite distortion. This composite distortion may comprise any of several different vlues. For example, in one preferred embodiment, the token distortion values are averaged together to comprise the composite distortion. In another preferred embodment the lowest value of the distortions is selected, and in yet another preferred embodiment the lowest two distortions are averaged. The basis upon which the make-up of the composite distortion is selected may be dependent upon the type of application for which the invention is utilized and upon the desires of the operator.
Having determined the composite distortion value representative of the current utterance, the system passes to block 450 and references the per-word STAT file created in the enrollment mode.
In block 450 the composite distortion is compared with the distortions in the per-word STAT file and that distortion which is closest to, but greater than, the composite distortion is identified. The ERRORFA associated with the identified distortion is extracted, and is utilized as the ERROR-,,, which is provided to block 422 in Figure 14.
If the composite distortion is greater than any distortion in the per-word STAT file, then no ERRORFA is extracted and, instead, an ERROR_,A equal to 1 is provided to the block 422 indicating that the trial has failed. It is noted that if 'the system proceeds from block 422 in Figure 14 to b. o<- •s .424 and then to the block 428, the cumulative error 426 is not adjusted in response to a failed trial. Thus, an inadvertent noise, hiccup by the user, or any other sort of problem which prevents the user from currently entering his utterance will not adversely affect the likelihood that his further utterances will be accepted. Nevertheless, the erroneous utterance or sound which was detected will be noted in the system and can affect the acceptance or rejection if such unusual utterances recur. The operation of block 428 of Figure 14, which comprises the test block, is described with respect to Figure 16. Specifically, upon entering block 428 the system passes .to block 460 and determines whether the 5 fail counter from block 424 indicates that the user has exceeded a preselected number of trials. In a default condition this preselected threshold value is two. Of course, this value may be set at the discretion of the operator. If the maximum failed trials has been o exceeded, the system passes to block 430 and rejects the claimant as described previously.
If the maximum failed trials has not been exceeded, the system passes to block 462 and determines whether the maximum number of trials has been exceeded. 5 This condition arises when the cumulative error of the claimant is greater than the threshold value for acceptance, but the trial has not been failed in block 460. A default threshold value for the maximum number of trials is five. Again, the operator can select 0 another value if he desires. If the maximum trials has been exceeded, the system passes to block 430 and rejects the claimant in the manner described above. If the maximum trials have not been exceeded, the system passes to block 464 and determines whether the cumulative error from block 426 of Figure 14 is greater than the threshold value. If it is, the system passes to block 414 of Figure 14 and obtains the next word from the ORDER file and then proceeds as described above. If the cumulative error is less than the threshold value, the system passes to block 466 and determines whether the claimant has passed at least two trials. This means that at least two trials have been conducted for this claimant, and that the cumulative error at the end of the second trial is below the threshold value for acceptance.
In the above-described embodiment, the cumulative error is adjusted with each trial in block 426. In another preferred embodiment the threshold is also adjusted along with the cumulative error. This adjustment can be made as a proportion of the change in cumulative error such as by multiplying the threshold by the present ERROR__A value, in the same manner as is done for the cumulative error. As a result, the tendency of the cumulative error value to go down as it is multiplied by fractions would be accompanied by a lowering of the threshold level, so that the likelihood of an impostor being accepted could not be increased with an increasing number of trials.
If it is determined that less than two trials have been conducted, then the system passes from block 466 to block 414, gets the next word from ORDER file 416, and proceeds as discussed previously. If at least two trials have been conducted, the system passes to block 468 and accepts the claimant by producing a signal which is transmitted through the station 102 to communicate to the claimant that he is accepted, and to otherwise activate equipment which may be operated by such acceptance signals.
To achieve the real time operating system 114 in an IBM Personal Computer 112 the following steps are required. Referring to Fig. 21, the memory of the personal computer is organized into blocks to be used as stacks 124. These stacks will hold all modifiable data for each of twelve processes that may be contemporaneously in progress. An "image" of each process to be run is stored in its stack. Process 0 is designated to be special in that it needs no image initialization and it performs all other initializations and setups. Thereafter processes 0 simply reads the time continuously and makes the "time" available to all other processes in a shared semaphored memory location. Only after the images have been initialized is a routine called the "clock interrupt service routine" or "clock ISR" permitted to begin operation. The clock ISR pushes the image of the current process onto that process's stack. Then, based on the status of other processes, a decision is made by a "scheduler" as to which process deserves use of the digital processor in the personal computer. The chosen process's stack address is then retrieved from the process table 126 and its image is popped off the stack. Operation of that process begins immediately upon returning from the interrupt. A critical part of the initial image is the establishment of the correct interrupt return address and flags. The procedure of changing stacks and images is called a "context switch". A second mech_.ri-.ui also performs a context switch. A process wa\cing for some event hands control to the context switch mechanism --hich builds an image as if it were interrupted then continues with the stack decision and image popping of another process.
Processes share program space. It is absolutely necessary that code be '"pure" or "reentrant". Each shared datum, device, or resource must be protected by it's own semaphore. A semaphore is a flag which can be set and examined only as an atomic non-interruptable' operation. At issue is the possibility that two processes may try to examine or modify the same location, device, or resource and expect no interference during that operation. A semaphore is guaranteed validity by having a process turn off interrupts and read the flag. If a flag is free, the flag is set and interrupts are turned on again. If the flag is in use, do a context switch. The flag is examined again when the processor returns to this context. Interrupts remain off within this context until the semaphore is free.
Each device (i.e. the clock, screen, keyboard, disk, diskette, rs232 bus, and SVB) has associated with it a "command" function, and an "interrupt service routine" function. The SVBs are examined in a "round robin" fashion in the interrupt service routine t_ determine which board has the right to use a DMA function which is shared by up to four boards.
When using a device a process sets the device semaphore, turns off interrupts, then commands the device, then sets the device busy flag, and then coιitβvt switches out. The device interrupt service routine must reset the device busy flag. The scheduler can then return to that process. The process must then re-.ea-.p_ the device semaphore and turn interrupts back on. The decision of the scheduler as to which process gets use of the processor is based on priorities, device busy flags, and process states. There are only two levels of priority. High priority is given only to speech output which must remain contiguous (continuous) to sound correct to the user of the system. All other processes have low priority. High priority processes have absolute precedence over low priority processes. The scheduler examines the process table in a "round robin" fashion starting at the process immediately following the one being switched out. During this round robin examination, processes waiting for devices and dead processes are skipped. The first runnable low priority process is remembered. If a high priority process is waiting, the first one found is served first. if none is found, the remembered low priority task gets service. It is not possible for process 0 to be blocked or "dead" therefore the scheduler always has a process to run. Processes may be "killed" or "doomed". A killed process will terminate immediately, context switch out (if in) and be subsequently skipped in the round robin examination. A doomed process will continue until the process finishes upon which time it kills itself. This is accomplished by having the initial image of the process point to a "process driver". The process driver is an infinite loop which executes the function specified in the process table. This infinite loop will kill itself upon finding the μ.'ccess doom flag set and then context switchout.
Processes may be "exec'ed". To exec a process, - first it must be assuredly dead. A kill command is issued for that process and a function pointer is copied into the process table entry for that process, then the remainder of the process table entry is initialized except for the process state which must remain as killed until the entry is up to date. Then the process state is raised to "live". - 59 -
The operating system must maintain the fixed disk in the equivalent of a file system. The disk is organized into seven major blocks: the BOOT, OPERATING SYSTEM, STRUCTURES, HEADERS, IMPOSTORS, ENROLLEES, and UDIT TRAIL. The BOOT contains the program necessary to bring up the operating system including the disk linkage for finding the operating system and structures. The OPERATING SYSTEM contains the program that operates, e.g., all devices and resources in the management of verification. STRUCTURES is a map of where all the 7 major blocks are located. These major blocks are sometimes split into separated minor blocks. For instance, two copies of the HEADER are stored in widely separated disk sectors so that damage to one copy does not permanently destroy valuable data. The same is true of 'the STRUCTURES which have two copies. The location of these structures are stored in the BOOT. The HEADERS contain all the enrollee personal data and linkages into the ENROLLEES block and IMPOSTORS block. The IMPOSTORS block is an array of tokens. Each IMPOSTOR has 5 tokens each of 5 words. .The ENROLLEES block is an array of tokens. For each enrollee there are 5 tokens each for 5 words.
The verification procedure is coded in a separate module which uses primitive functions provided by the operating system.
Verification is broken down into three major sub-modules: VERIFY, ENROLL, and MANAGE. ENROLL can only be seen from the action of the "Manager Access Point". Internally there are two very large token buffers capable of holding 50 tokens altogether; 25 tokens each consisting of 5 words by 5 tokens per word. The two buffers are required for keeping in RAM the operation of generating intra- and inter-speaker statistics. Shuffling off and onto disk during this operation would slow the system down considerably. (NOTE: These buffers are also used by other modules which may not run concurrently with ENROLL such as BACKUP. BACKUP also shuffles blocks of 25 tokens each off and on disks and diskettes.) The prompts are kept in RAM for similar speed reasons. During speech prompting (voice output) , the sharing of the processor by the scheduler could make the portions of output speech fragment leading to poor quality sound. Therefore, speech output is given high priority. As the prompts are output the process sets up a command for the interrupt handler, marks itself as waiting for service and signals the board interrupt handler. Since the interrupt handler has interrupts turned off there is never contention for the DMA so that semaphores and device busy flags are unnecessary. The interrupt handler drops the waiting-for-service fl?~j for a process as soon as it is finished with that process's request. The process context switches out while waiting for service and is not executed again until the waitinj for service flag drops. During token retrieval from disk the process semaphores the disk, sets up the command and DMA, marks the disk as busy, starts the DMA, and then requests a context switch to give other processes time to execute while the disk is busy. When the disk is finished it executes an interrupt which marks the disk as idle and turns off the semaphore. The scheduler will, when the round robin loop reaches this process again, continue execution of this process. ENROLL is started from the MANAGER which "dooms" process 1. Since process 1 is normally a verify process, the system waits for the verify to time out and kill itself. It would be inappropriate to interrupt a valid verification which may be occurring. After process 1 kills itself, MANAGER sees the KILL flag come up on process 1 and performs an EXEC of ENROLL on process 1. ENROLL begins by getting vital statistics, and then as each token is retrieved it is placed in the R token buffer in sequence according to its word number and token number. The tokens are then examined for large variances in length. If any such variances are found, an attempt is made to replace the tokens that have standout lengths. If too many tokens are "standout" within a word, a fresh set of tokens for that word is prompted. If this is unsuccessful, the ENROLL aborts. If the ENROLL is successful, the tokens are then converted to k tokens which are placed appropriately in the k token buffer. The board is then commanded to generate scores on each token as compared to all the other tokens of the same word. These scores are used to determine the intra-speaker statistics which are stored in the header. Then each installed impostor is read into the r token buffer and inter-speaker statistics are generated in the same manner. For each impostor that is not installed a default set of statistics is generated. These statistics are combined into generalized inter-speaker statistics. These statistics are then stored in the header. The k token buffer is then assigned to a User Disk Index (udi) and written out. It is read back into the r token buffer and a byte-by-byte comparison is made to determine if there is a bad sector on the disk. If so, another udi is issued until a successful write/read sequence is executed or the disk is full. The in RAM header is then marked as dirty and is written out immediately. This is 5 in contrast with verification, where changes to tokens and statistics mark the header as dirty and are NOT written out until the 15 minute flush arrives.
VERIFY is the normal function for processes 1 through 8. A flow chart of the VERIFY process, set
"-0 forth in Fig. 22, illustrates the steps taken by operating system 114 in performing the verification operation previously described.
Attached hereto as Appendix A is a copy of the object code for'the subprograms for verification, data
15 base construction and enrollment. The object code is in octal form, written in the programming language C and the Masscomp version of the M68000 assembler language for use on the IBM PC/XT. Also attached, as Appendix B, is a copy of the SP object code which is written in the
20 TMS32010 specific assembler language. In addition, attached as Appendix C, is a copy of the CP object code in Motorola S-record Tcrmat. The CP code is written in the programming language "C" and the MASSCOMP version of the M68000 ass^mblrr language. Of course, it will be
25 appreciated that che code provided in these appendices could be "prog-.ammed" in substantially any of the numerous languages which are well-known in the technology. The embodiments disclosed in the attached appendices are, therefore, provided as an example of one
30 preferred embodiment of the invention and should not be construed to limit the scope of the invention in any way. The apparatus and method described above comprise a significant improvement over the prior art systems by providing a reliable and efficient speech verification system which: (1) includes an efficient and accurate means for detecting the beginning and end of speech; (2) provides parameters for use in verification which are related to features of the vocal cord, and which may provide for comparison between reference and test utterances without limitations based on time or intensity restrictions; (3) provides a verification mode which develops an indication of the probability of erroneously accepting an impostor or rejecting. the actual speaker, and -which utilizes a result for acceptance which is a function of the results of a plurality of comparisons of difference utterances, on both intra- and inter-speaker bases; and (4) providing an optionally variable threshold for acceptance or rejection of speech, which permits maintaining approximately constant standards, while utilizing cumulative results of plural tests in determining whether to accept or reject the speaker.
The invention may be embodied in other specific forms without departing from its spirit or essential characteristics. For example, referring to Fig. 25, after the enrollment operation is complete the system may proceed to calculate (500) the intra-speaker global distortion values for the enrollee' s utterances of the first reference word. Once all the global distortion values have been calculated for that word, the largest value is discarded and the mean and variance for the remainder "of the global distortion values are calculated and stored (502) for later use in' the verification operation.
Similarly, the inter-speaker comparisons for calculating the inter-speaker global distortion values are performed (504) , the largest distortion value is discarded, and, for the remainder of the global distortion values, the inter-speaker mean and variance are calculated and stored (506) .
For each of the other reference words and their corresponding group of utterances the intra-speaker and the inter-speaker mean and variance are calculated and stored, as described above.
Referring to Figure 26, during the verification operation (518) , after the system has obtained and verified as valid the user's claimed identity (520), it retrieves from memory the intra-speaker and inter-speaker mean and variance values, the reference words, the k coefficients for the utterances associated with each reference word, and the two threshold values, U and V, corresponding to the user's claimed identity (522).
From the reference words, the system randomly selects one word and prompts the user to speak the selected word into the microphone (524) . The received utterance is immediately processed to obtain the normalized autocorrelation coefficients r (526) . The r coefficients are then compared with the k coefficients for each of the stored utterances of that word (previously spoken by the enrolled user) to calculate a new set of global distortions (528) . From the set of global distortions a single combined score (i.e. the mean global distortion) is calculated. Using the combined score and the intra-speaker and inter-speaker' mean and variance values, two values, p and q, are calculated; they represent, respectively, the probability that the enrolled user would produce an utterance yielding a combined score as poor as or worse than the one just calculated and the probability that an impostor would produce an utterance yielding a combined score as good as or better than the one just calculated. Referring to Figure 27, p is calculated by integrating the Gaussian density function 540 characterized by the two intra-speaker values (i.e. the mean 542 and the variance 544) from the combined score 546 to positive infinity (i.e. p equals the area of the shaded region 548) . Similarly, q is calculated by integrating the
Gaussian density function 550 characterized by the two mter-speaker values (i.e. the mean 552 and the variance 554) from negative infinity to the combined score 546 (i.e. q equals the area of shaded region 556). Referring again to Figure 26, while p and q are the probability values corresponding to the current utterance of the current reference word, P and Q are cumulative probability values which are updated each time new values for p and q are calculated, i.e. each time a new utterance is received from the user, P and Q (which are initialized to 1) are updated by multiplying each by its respective corresponding individual probability value, p or q. Consequently, after the users first utterance, P and Q equal p and q, respectively (560) .
To make a verification decision (562) , P and Q are compared, respectively, with the two threshold values, U and V, for the enrolled user. If P is less than U but Q remains greater than V (564) the user is rejected; if, conversly, Q is less than V but P remains greater than U (566) the user is accepted. If, however, P remains greater than U and Q remains greater than V (568) no decision to accept or reject the user is made; instead another one of the reference words is chosen and the user is prompted for another utterance. New values for p, q, P, and Q are calculated for this utterance and the updated values of P and Q are once again compared to u and V. The system continues to prompt the user for new utterances until a verification decision is made to either accept or reject the user.
The sensitivity of the verification decision, and thus the security level, is related directly to the value of the two threshold values, U and V. If U and V are assigned relatively small values the verification operation will be an extended one requiring a long sequence of utterances and analysis. The benefit of low threshold values is an increased accuracy, i.e. greater certainty that an impostor has not been accepted and that a valid user has not been rejected; the benefit of increased accuracy must of course be- weighed against a lengthy verification operation. A lower value for U relative to the value of V will make the system more tolerant of erratic utterances which exl '_,i_ a large combined global distortion value, whereas a lower value for V relative to the value of U will make it mere difficult for both valid users and impostors +-o be accepted. if. at the beginning of the verification operation, the user provides an invalid identity (570) he is not immediately rejected; instead, he is prompted to utter several of a group of default reference words. The resulting utterances are, however, discarded and the user is rejected shortly after speaking the final utterance. This procedure prevents the user from learning whether or not he has delivered a valid identity.
If a user has been accepted as being the enrolled user, the utterances he or she spoke during the verification operation are used to update both the stored reference utterances and the intra-speaker and the inter-speaker mean and variance values. This is done to accommodate the changes which occur in the human voice over time. The updating is done autoregressively such that very recent utterances are given more weight than very old ones.
The described embodiments are to be considered in all respects only as illustrative and not restrictive. The scope of the invention is, therefore, indicated by the appended claims rather than by the foregoing description. All changes which come within the meaning and range of equivalency of the claims are to be embraced within- their scope.

Claims

1. A method for verifying whether a speaker is a known person based on stored information that includes reference information about speech of said known person, comprising in a first stage of said verification, analyzing test information about an utterance of said speaker, relative to said stored information, deciding whether or not to proceed to a subsequent stage on the basis of said analysis, and if not, then deciding whether to accept or reject said speaker, and if the decision is to proceed, then, in said subsequent stage of said verification, analyzing test information about at least an additional utterance of said speaker relative to said stored information, and making a verification decision based on both the analysis made in the first stage, and the analysis made in the subsequent stage.
2. The method of claim 1 wherein said analyses comprise comparing said test in-'.o..:".--tion to said reference information to develop difference data indicative of the results of said comparison, and said decisions are based on probability values derived from said difference data and on at least one threshold value.
3. The method of claim 1 wherein said stored information comprises probability data indicative of the probability value for any given said difference data.
4. The method of claim 1 wherein said method comprises a plurality of said subsequent stages, each said subsequent stage including analysing test information about at least another additional utterance of said speaker relative to said stored information, said verification decision being made based on said anaylses in said first stage and said subsequent stages.
5. The method of claim 2 wherein there are two said probability values, one representing the likelihood of erroneously accepting an impostor, the other representing the likelihood of erroneously rejecting said known person, and there are two said thresholds.
6. The method of claim 5 wherein said thresholds are adjustable.
7. The method of claim 5 wherein in said subsequent stage said probability values are based in part on the probability values of said first stage.
8. The method cf claim 1 wherein said verification decision comprises a decision whether said speaker is or is not said known, person.
9. The method of claim 1 further comprising updating said stored information based on said test information if said verification decision is that said speaker is said known person.
10. The method of claim 1 wherein said utterance and said additional utterance comprise two words randomly drawn from a set of words which formed the basis of said reference information.
11. A method for verifying whether a speaker is a known person based on stored information that includes reference values based on linear prediction coefficients for speech of said known person, comprising deriving from said stored information probability data indicative of the probability that, for a given result of said test information analysis, said speaker is or is not said known person, during verification, analyzing test information about an utterance of said speaker, relative to said stored information, and applying the result of said test information analysis for said utterance to said probability data to derive at least one probability value for said utterance of said speaker, and making a verification decision based on at least said probability value.
12. The method of claim 11 wherein said probability data comprises numbers from which probability distribution functions can be derived, and each said probability value is obtained by integrating a portion of one of said probability distribution functions.
13. The method of claim 12 wherein said values include a mean and a variance.
14. The method of claim 12 wherein there are two said probability distribution functions and two said probability values, one said probability value representing the likelihood of incorrectly deciding that an impostor is said known person, the other said probability value representing the likelihood of incorrectly deciding that said speaker is said known person.
15. The method of claim 11 wherein said test information comprises test values based on normalized auto-correlation coefficients for said utterance of said speaker, and said analysis comprises comparing said test values to said reference values to derive a global distortion value representative of differences between said test values and said reference values.
16. The method of claim 15 wherein said probability data is derived from a plurality of utterances spoken by said known person and a plurality of utterances spoken by other persons, said probability data comprising an intra-speaker mean equal to the mean of global distortion values each representative of the difference between the linear predictive coefficients for each possible combination of two said utterances spoken by said known person, an intra-speaker variance equal to the variance of said intra-speaker global distortion values, an inter-speaker mean equal to the mean value of global distortion values each representative of the differences between the linear predictive coefficients for each possible combination of a said - 72 -
utterance by said known person and a said utterance by a said other person, and an inter-speaker variance equal to the variance of said inter-speaker global distortion values, and wherein a first said probability value equals the integral over the Gaussian density function characterized by said intra-speaker mean and variance values from said global distortion value representative of differences between said test and reference values to positive infinity, and a second said probability value equals the integral of the Gaussian density function characterized by the inter-speaker mean and variance from negative infinity to said global distortion value representative of differences between said test and reference values.
17. The method of claim 16 wherein said stored information includes first and second threshold values, said method further comprising making a verification decision based on a comparison between said first and second probability v ues _nd said first and second threshold values, respectively.
18. The methor <-Z .claim 17 wherein the steps of claim 17 are repealed a plurality of times, and a first cumulative probability value equals the product of all previous first probability values, and a second cumulative probability value equals the product of all previous second probability values, and said verification decision is based on a comparison of first and second cumulative probability values to said two threshold values, respectively, said verification decision comprising if the first cumulative probability value is less than said first threshold value and said second probability value is greater than said second threshold value then the speaker is rejected as not being said known person, and if said first cumulative probability value is greater than the said first threshold value and said second cumulative probability value is less than said second threshold value a decision is made to verify said speaker as being said known person.
19. The method of claim 16 further comprising adjusting said threshold values in accordance with a desired degree of security.
20. A method for verifying whether a speaker is a known person based on stored reference information that includes reference information about speech of said known person, comprising verifying whether said speaker is or is not said known person based on an analysis of test information about an utterance of said speaker relative to said stored information, and if said speaker is verified as being said known person, updating said stored information based on said test information.
21. Apparatus for providing power to a device requiring power based upon the verification of a speaker as a known person authorized to cause said device to receive power, comprising apparatus for verifying that said person is said known person based upon analysis of a current utterance of said person relative to stored information about the speech of said known person, and for thereupon issuing a logical verification signal, and a non-mechanical, non-magnetic power switch for sending power to said device in response to said logical signal.
22. The apparatus of claim 21 wherein said " power switch comprises a power MOSFET device.
23. The apparatus of claim 21 wherein said device comprises a solenoid for opening a door lock to admit said speaker via a door.
24. Apparatus for verifying whether a speaker is a known person based on stored information that includes reference information about speech of said known person, comprising a key-operated tone generator for receiving keyed information relevant to said verification and for generating corresponding coded tone signals, a microphone for generating analog electrical signals in response to an utterance of said speaker, a channel for carrying both said tone signals and said analog signals. an analog-to-digital converter for converting signals received from said channel to digital signals, and a digital processor controlled by a stored program for analyzing said digital signals corresponding to said utterances relative to said stored information as part of said verification, said digital processor being programmed to detect and decode digital signals corresponding to said tone signals to derive said keyed information, and said apparatus being adapted to conduct said verification on the basis of said keyed information.
25. Apparatus for verifying whether a plurality of speakers are known persons based on stored information that includes information about speech of said known persons, comprising a plurality of stations for receiving utterances from said plurality of speakers, and a digital processor controlled by a stored program arranged to enable said processor to analyze each said utterance relative to said stored reference information as part of said verification, said stored program being arranged to cause said digital processor to time interleave the steps of said analyses for different said utterances, so that the analyses of said different utterances can proceed simultaneously.
26. Apparatus for verifying whether a plurality of speakers are known persons based on stored information about speech of said known persons, comprising a plurality of stations for receiving utterances from said plurality of speakers, and for providing information to said speakers, a plurality of processors for serving respective users of said stations, and a host computer for supervising said processors, said host computer being controlled by a real time operating system that time interleaves control of said plurality of processors to enable them to serve said respective stations in real time.
27. A method for verifying the identity of a source of acoustic signals, comprising: receiving acoustic signals from said source; extracting first parameters from said acoustic signals, said first parameters corresponding to physical characteristics of said source; comparing said first parameters with second parameters extracted from acoustic signals previously received from said source, thereby producing a first distortion value representative of the compai-ison of the first and second parameters; comparing said first distortion value vith plural second distortion values representative <* f = comparisons of each of plural groups of said _econd parameters with others of said groups, said second distortion values each identifying corresponding probability values for incorrect verification of an acoustic signal from another source; generating a- first error value based on the comparison of the first distortion value with said second distortion values, and corresponding to a probability identified by one of said second distortion values; repeating the above steps for a second acoustic signal from said source; producing a second error value reflecting the change in the first error value for each repetition of the above steps; comparing the second error value with a threshold value; and generating an acceptance signal if the relationship between the second error value and the threshold value are in accordance with preselected acceptance criteria.
28. A method for verifying the identity of a source of acoustic signals as defined in Claim 27, wherein said second parameters comprise groups of parameters, with each group of parameters corresponding to acoustic signals having characteristics similar to the acoustic signals corresponding to the first parameters.
29. A method for verifying the identity of a source of acoustic signals as defined in Claim 28, wherein the step of comparing the first parameters with the second parameters comprises the steps of: separately comparing the first parameters with each group of said second paramters, thereby producing a distortion value for each comparison; and processing the distortion values from each comparison to produce a composite distortion value.
30. A method for verifying the identity of a source of acoustic signals as defined in Claim 27, wherein the step of generating a first error value comprises the steps of: selecting, in response to the comparison of the first and second distortion values, that second distortion value which is next greater than the first distortion value; and designating the probability value associated with the selected second distortion value as the first error value.
31. A method for verifying the identity of a source of acoustic signals as defined in Claim 30, wherein, if no second distortion value is greater than the first distortion value, the step of designating comprises assigning a value to the first error value which indicates that the first distortion value represents acoustic signals from a source other than the source which produce acoustic signals represented by the second distortion values.
-.2. A method for verifying the identity of a source r. acoustic signals as defined in Claim 30, wherein tho second error value comprises a composite value wh-.ch is changed in response to the first error value produced in each repetition of the prior steps of the method, thereby providing a comparison value which is representative of influences of the error values, with respect to each other, produced in practicing the method to verify a selected source of acoustic signals.
33. A method for verifying the identity of a source of acoustic signals as defined in Claim 27, wherein the first and second parameters include autocorrelation coefficients and aa coefficients, and wherein normalized autocorrelation coefficients of one group of parameters are compared with aa coefficients of another group of parameters to develop the distortions utilized in the present method.
34. A method for verifying the identity of a source of acoustic signals as defined in Claim 27 wherein, prior to the step of receiving acourstic signals, the method comprises the steps of: selecting an acoustic signal from a table of values representing different acoustic signals previously received from said source; and producing a prompt signal which is communicated to said source and which identifies the selected acoustical signal which is to be provided from said source.
35. A method for verifying the identity of a source of acoustic signals comprising: receiving a pluralty of acoustic signals from said source; developing parameters representative of each of said acoustic signals; for groups of said acoustic signals wherein each signal attempts to present the same acoustic message pattern, developing first values representatie of the differences between the parameters representing the acoustic signals in the group; comparing parameters from said group with parameters representing signals from other sources which are similar in acoustic message patterns; in response to said comparing step, developing second values representative of the differences between said parameters representing signals from other sources and said parameters representing the acoustic signals in the group; developing probability values representing, for said first values, probability of incorrect verification of an acoustic signal from another source which is represented by parameters which have similar values to the parameters representative of an acoustic signal from said acoustic source; receiving a new acoustic signal to be verified; developing new parameters representative of said new signal; comparing said new parameters with the parameters of the group of acoustic signals which each attempt to present the same acoustic message pattern as said new acoustic signal; in response to said new parameter comparison step, produing a new value representative of the differences between the new parameters and the parameters of said group of acoustic signals; comparing said new difference value with said first values to select one of said first values which is close to said new difference value; providing a first error value which corresponds to the probability value associated with the selected first value; receiving another new acoustic signal to be verified; repeating the above steps which follow the step of receiving a new acoustic signal, for said another new acoustic signal; producing a second error value which reflects a composite of the first error values resulting from said repeating step; comparing the second error value with a threshold value; and generating an acceptance signal if the relationship between the second error value and the threshold value are in accordance with preselected acceptance criteria.
36. A method for verifying the identity of a 5 source of acoustic signals as defined in Claim 35 wherein, prior to the step of receiving a plurality of acoustic signals, the method comprises storing information identifying said acoustic source.
- 37. A method for verifying the identity of a source of acoustic signals as defined in Claim 35 wherein said plurality of acoustic signals comprises a plurality of groups of acoustic signals, with each said group comprising a plurality of signals representing the ^ same acoustic message.
38. A method for verifying the identity of a source of acoustic signals as defined in Claim 35 wherein the step of developing first values comprises 0 tht- step of comparing parameters of the acoustic signal's in the group with parameters of other acoustic signals in the group.
39. A method for verifying the identity of a source of acoustic signals as defined in Claim 35 wherein the step of receiving a plurality of acoustic signals comprises the steps of: monitoring said acoustic signals; and detecting beginnings and endings of acoustic messages represented by said signals, thereby forming a plurality of acoustic signals which each represent an acoustic message.
40. An apparatus for verifying the identity of a source of acoustic signals, comprising: means for receiving acoustic signals from said source; means for extracting first parameters from said acoustic signals, said first parameters corresponding to physical characteristics of said source; means for comparing said first parameters with second parameters extracted from acoustic signals previously received from said source, thereby producing a first distortion value representative of the comparison of the first and second pa-.*c.τ.eters; means for generating <_ first error value based on the comparison of the first distortion value with said second distortion values, a.d corresponding to a probability identified by one _f said second distortion values; means for producing a second error value reflecting changes in the first error value resulting from receiving a second acoustic signal from said source; means for comparing the second error value with a threshold value; and means for generating an acceptance signal if the relationship between the second error value and the threshold value are in accordance with preselected acceptance criteria.
41. A method for verifying the identity of a source of acoustic signals comprising: receiving a first acoustic signal; producing test values representative of said first acoustic signal; receiving a second acoustic signal; adjusting said test values to represent a combination of the first and second acoustic signals; comparing said adjusted test values with a threshold value; and generating an acceptance signal if the relationship between the adjusted test values and the threshold value are in accordance with preselected acceptance criteria.
42. A method for verifying the identity of a source of acoustic signals as defined in Claim 41 wherein the method further comprises the step of . repeating the above steps, beginning with the step of receiving a second signal, if the relationship betwoe > the adjusted test values and the threshold value, are not in accordance with preselected acceptance criteria.
43. A method for verifying the identity of a source of acoustic signals as defined in Claim 42 wherein, prior to the repeating step, the method comprises the step of terminating operation of the method if the relationship between the adjusted test -values and the threshold value are not in accordance with preselected acceptance criteria, and if the repeating step has been performed a preselected number of times.
5
44. A method for verifying the identity of a source of acoustic signals as defined in Claim 43 wherein, prior to terminating the operation of the method, a nonacceptance signal is generated if the
10 relationship between the adjusted test values and the threshold value are not in accordance with the preslected acceptance criteria, and if the repeating step has been performed a preslected number of times.
15 45. A method for verifying the identity of a source of acoustic signals as defined in Claim 41 wherein the step of adjusting said test values comprises the steps of: producing test values representative of said
20 second acoustic signal; and multiplying the test values representative of '.he first acoustic signal with the test values representative of the second acoustic signals, thereby producing a value which comprises said representation of
?_ τ-πe combination of the first and second acoustic signals.
46. A method for verifying the identity of a source of acoustic signals as defined in Claim 41, wherein said first acoustic signal represents a first 30 communication from said source, and wherein said second acoustic signal represents a second communciation from said source, said second communication being different than said first communication.
47. A method for verifying the identity of a source of acoustic signals comprising: providing a plurality of first templates, wherein each said first template comprising a representation of a separate acoustic transmission of a selected communication from a selected source; providing a plurality of second templates, wherein each said second templates comprises a representation of an acoustic transmission of the selected communication from different sources; comparing individual ones of the plurality of first templates with others of said first templates to define first distortion values for said individual ones, said first distortion values being representative of differences in separate acoustic transmissions of the same message from the selected source; comparing ones of the plurality of first templates with ones of the plurality of second templates to define second distortion values representative of differences in acoustic transmissions of the selected communication by the selected source as compared to other sources? comparing the first distortion values with the second distortion values to develop probability values for said first distortion values, wherein said probability values provide an indication of the likelihood that an acoustic transmission of the selected communication from another source will have a distortion value which has a predetermined relationship to a selected first distortion values; selecting a threshold value for acceptance of a source of acoustic signals based on said probability values; receiving a third template comprising a representation of an acoustic transmission of the selected communication from a source of acoustic signals; comparing the third template with the first templates to define a third distortion value; and accepting the source of acoustic signals represented by the third template if the relationship between the threshold and the third distortion value is in accordance with preselected acceptance criteria.
48. A method for verifying the identity of a source of acoustic signals as defined in Claim 47 wherein, prior to the step of accepting the source, the method comprises the steps of: repeating the steps of the method for a second selected communication which is different from the first selected communication; generating an error value which comprises a combination of the third distortions produced by the step of comparing the third template with the first template; comparing the error value with the thresho1.- value; and accepting the source of acoustic signals represented by the third template if the relationship between the error value and the threshold value are
Figure imgf000088_0001
accordance with preselected acceptance criteria.
49. A method for verifying the identity of a source of acoustic signals as defined in Claim 48 further comprising the step of repeating the steps of Claim 28 if the relationship between the error value and the threshold value and not in accordance with preselected acceptance criteria.
50. A method for verifying the identity of a source of acoustic signals as defined in Claim 49 wherein, prior to the step of repeating the steps of Claim 48, the method comprises the step of terminating the operation of the method if the relationship between the error value and the threshold value are not in accordance with preselected acceptance criteria, and if the step of repeating the steps of Claim 48 has been performed a preselected number of times.
51. A method for verifying the identity of a source of acoustic signals as defined in Claim 48 wherein the error value comprises the produce of the third distortions produced by the step of comparing the third template with the first template.
52. A method for defining an acceptance threshold in identifying a source of acoustic signals comprising: providing a plurality of first templates, wherein each said first template comprises a representation of ?. separate acoustic transmission of a selected communication from a selected source; providing a plurality of second templates, wherein each said second template comprises a representation of an acoustic transmission of the selected communication from different source; comparing individual ones of the plurality of first templates with others of said first templates to define first distortion values for said individual ones, said first distortion values being representative of differences in separate acoustic transmissions of the same message from the selected source; comparing ones of the plurality of first templates with ones of the plurality of second templates to define second distortion values representative of differences in acoustic transmissions of the selected communication by the selected source as compared to other sources; comparing the first distortion values with the second distortion values to develop probability values for said first distortion values, wherein said probability values provide an indication of the likelihood that an acoustic transmission of the selected communication from another source will have a distortion value which has a predetermined relationship to a selected first distortion values; and selecting a threshold value for acceptance of a source of acoustic signals based on said probability values.
53. An apparatus for verifying the identity of a source of acoustic signals comprising: means for receiving a first acoustic signal; means for producing test values representative of said first acoustic signal; means for receiving a second acoustic signal; means for adjusting said test values to represent a combination of the first and second acoustic signals; means for comparing said adjusted test values with a threshold value-; and means for generating an acceptance signal if the relationship between the adjusted test values and the threshold value are in accordance with preselected acceptance criteria.
54. A method for detecting the beginning and end of an acoustic signal comprising the steps of: detecting acoustic energy above a selected begin threshold level; identifying a beginning of an acoustic signal after acoustic energy above the begin signal threshold is detected and when said acoustic energy remains above a selected first attention threshold level for a period of time exceeding a selected first threshold time period; detecting when said acoustic energy drops below a selected end signal threshold level; and identifying an end of said acoustic signal after detecting when said acoustic energy drops below said selected end signal threshold i^vel and when said acoustic energy remains below a selected second attention threshold level for a period of time exceeding a selected second threshold time period.
55. A method for detecting t _ beginning and end of an acoustic signal as defined in Claim 54 wherein, following the step of detecting acoustic energy, and if said acoustic energy is no above the selected begin speech threshold le-'el, the method comprises the step of detecting acoustic energy above another begin signal threshold level in a situation where a normalized autocorrealtion function has an euclidean distance which is more than a preselected distance from a value of the autocorrelation function which has been measured for signals comprising noise.
PCT/US1986/001402 1985-07-01 1986-07-01 Speaker verification system WO1987000332A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US75103185A 1985-07-01 1985-07-01
US751,031 1985-07-01

Publications (1)

Publication Number Publication Date
WO1987000332A1 true WO1987000332A1 (en) 1987-01-15

Family

ID=25020188

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1986/001402 WO1987000332A1 (en) 1985-07-01 1986-07-01 Speaker verification system

Country Status (4)

Country Link
EP (1) EP0233285A4 (en)
JP (1) JPS63500126A (en)
AU (1) AU6128586A (en)
WO (1) WO1987000332A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1990008379A1 (en) * 1989-01-17 1990-07-26 The University Court Of The University Of Edinburgh Speaker recognition
EP0424071A2 (en) * 1989-10-16 1991-04-24 Logica Uk Limited Speaker recognition
US7545960B2 (en) 2004-12-11 2009-06-09 Ncr Corporation Biometric system
US9443523B2 (en) 2012-06-15 2016-09-13 Sri International Multi-sample conversational voice verification

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5054083A (en) * 1989-05-09 1991-10-01 Texas Instruments Incorporated Voice verification circuit for validating the identity of an unknown person
JP5272141B2 (en) * 2009-05-26 2013-08-28 学校法人早稲田大学 Voice processing apparatus and program

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4058710A (en) * 1975-03-14 1977-11-15 Dornier Gmbh. Process for preventing undesired contact with land or water by low-flying aircraft
US4158750A (en) * 1976-05-27 1979-06-19 Nippon Electric Co., Ltd. Speech recognition system with delayed output
US4292471A (en) * 1978-10-10 1981-09-29 U.S. Philips Corporation Method of verifying a speaker
US4385359A (en) * 1980-03-18 1983-05-24 Nippon Electric Co., Ltd. Multiple-channel voice input/output system
US4581755A (en) * 1981-10-30 1986-04-08 Nippon Electric Co., Ltd. Voice recognition system

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3673331A (en) * 1970-01-19 1972-06-27 Texas Instruments Inc Identity verification by voice signals in the frequency domain
US4032711A (en) * 1975-12-31 1977-06-28 Bell Telephone Laboratories, Incorporated Speaker recognition arrangement
JPS5870292A (en) * 1981-10-22 1983-04-26 日産自動車株式会社 Voice recognition equipment for vehicle
JPS59178587A (en) * 1983-03-30 1984-10-09 Nec Corp Speaker confirming system
IT1160148B (en) * 1983-12-19 1987-03-04 Cselt Centro Studi Lab Telecom SPEAKER VERIFICATION DEVICE

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4058710A (en) * 1975-03-14 1977-11-15 Dornier Gmbh. Process for preventing undesired contact with land or water by low-flying aircraft
US4158750A (en) * 1976-05-27 1979-06-19 Nippon Electric Co., Ltd. Speech recognition system with delayed output
US4292471A (en) * 1978-10-10 1981-09-29 U.S. Philips Corporation Method of verifying a speaker
US4385359A (en) * 1980-03-18 1983-05-24 Nippon Electric Co., Ltd. Multiple-channel voice input/output system
US4581755A (en) * 1981-10-30 1986-04-08 Nippon Electric Co., Ltd. Voice recognition system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP0233285A4 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1990008379A1 (en) * 1989-01-17 1990-07-26 The University Court Of The University Of Edinburgh Speaker recognition
EP0424071A2 (en) * 1989-10-16 1991-04-24 Logica Uk Limited Speaker recognition
EP0424071A3 (en) * 1989-10-16 1991-08-07 Logica Uk Limited Speaker recognition
US7545960B2 (en) 2004-12-11 2009-06-09 Ncr Corporation Biometric system
US9443523B2 (en) 2012-06-15 2016-09-13 Sri International Multi-sample conversational voice verification

Also Published As

Publication number Publication date
EP0233285A4 (en) 1987-12-01
AU6128586A (en) 1987-01-30
JPS63500126A (en) 1988-01-14
EP0233285A1 (en) 1987-08-26

Similar Documents

Publication Publication Date Title
US5548647A (en) Fixed text speaker verification method and apparatus
EP3599606B1 (en) Machine learning for authenticating voice
US6510415B1 (en) Voice authentication method and system utilizing same
US5794196A (en) Speech recognition system distinguishing dictation from commands by arbitration between continuous speech and isolated word modules
US5339385A (en) Speaker verifier using nearest-neighbor distance measure
RU2161336C2 (en) System for verifying the speaking person identity
US6463415B2 (en) 69voice authentication system and method for regulating border crossing
US6182037B1 (en) Speaker recognition over large population with fast and detailed matches
EP0983587B1 (en) Speaker verification method using multiple class groups
EP1199708B1 (en) Noise robust pattern recognition
US4802231A (en) Pattern recognition error reduction system
US20070129941A1 (en) Preprocessing system and method for reducing FRR in speaking recognition
KR100406307B1 (en) Voice recognition method and system based on voice registration method and system
EP0892388B1 (en) Method and apparatus for providing speaker authentication by verbal information verification using forced decoding
EP0954854A1 (en) Subword-based speaker verification using multiple classifier fusion, with channel, fusion, model, and threshold adaptation
CA2318262A1 (en) Multi-resolution system and method for speaker verification
US8433569B2 (en) Method of accessing a dial-up service
US5937381A (en) System for voice verification of telephone transactions
US5677991A (en) Speech recognition system using arbitration between continuous speech and isolated word modules
US20060178885A1 (en) System and method for speaker verification using short utterance enrollments
Campbell Speaker recognition
WO1987000332A1 (en) Speaker verification system
JPH1173196A (en) Method for authenticating speaker&#39;s proposed identification
Lee A tutorial on speaker and speech verification
KR100917419B1 (en) Speaker recognition systems

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AU BR DK FI JP KP NO US

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE CH DE FR GB IT NL SE

WWE Wipo information: entry into national phase

Ref document number: 1986907234

Country of ref document: EP

WWP Wipo information: published in national office

Ref document number: 1986907234

Country of ref document: EP

WWW Wipo information: withdrawn in national office

Ref document number: 1986907234

Country of ref document: EP